Test with 100 activities with 10 results in each activity, 10 indicators in each result and 20 periods in each indicator

younginnovations / iatipublisher

IATI Publishing Tool

GNU Affero General Public License v3.0

7 stars 0 forks source link

Test with 100 activities with 10 results in each activity, 10 indicators in each result and 20 periods in each indicator #1557

Open PG-Momik opened 2 months ago

PG-Momik commented 2 months ago

[!NOTE] According to previous discussion, for testing 100 activities bulk publish, we used 50 transactions, 50 results, 10 indicator (for each result) and 10 period (for each indicator)

Context

This time, we used 1 to 3 transactions, with 10 results , 10 indicators for each result and 20 periods for each indicator for each activity. By significantly reducing number of results and its children, we were able to reduce the size of each activity xml to less than 2 MB. We created 100 similar activities for this test.

Findings

Bulk publishing of 100 activities was successful, and the process did not crash.
Total time taken for the bulk publish: around 1 hour and 3 minutes.
The main contributor to the bulk publish time was validation against the IATI validator, which took approximately 50 minutes.
There is a size limit of 60 MB for published files on the IATI registry. (?)
Publishing to IATI registry is possible with size > 100 MB. But the stats wont update. (?). In the above image, we can see Download (4.5KB), this was the same stat before we published 100 activities.

Questions

We tested with 10 result, 10 indicator (each), 20 period (each). Do you have a sample 'payload' that you want us to test?
Could Findings No.4 and No.5 be confirmed. They could have been wrong interpretation from us.
If Finding No.4 is correct, is it possible to increase max file size allowed on IATI registry.

Possible changes

Currently we're validating each activity against IATI registry one after next. We did it so that we could keep track of the validating progress, so that we could display Validating activities (x/N) in our previous design. Since our current design does not have any progress bar to display texts like Validating activities (x/N). We could opt to validate all 100 activities in 1 go if they are under 60MB (IATI validator doesnt accept xml file larger than 60MB). This will decrease our bulk publish time.
Increase file size limit on IATI Registry.

PG-Momik commented 2 months ago

@emmajclegg please have a look at this.

emmajclegg commented 2 months ago

Thanks @PG-Momik - some initial responses to this:

The IATI Registry does have a 60MB file limit and that is intentional to prevent large files causing data processing problems across IATI tools. We can check with @robredpath tomorrow whether there are any changes to that threshold planned, but I don't think so, and it is important for IATI Publisher to set a max file limit.
The activity, transactions and results volumes you've tested here make sense, but remain on the upper data volume end of what we would expect from most publishers. I think it is reasonable to prevent bulk publishing of 100 activities if IATI Publisher can see that a particular publisher's data is e.g. exceeding the 60MB limit (or, even better, split data into multiple files before validation).
There is no need to show the count of validated activities in the IATI Publisher interface as you've mentioned, so yes - if it is more efficient to validate a file in one go, we should be doing this.

My questions:

Do any AidStream publishers exceed the 60MB data file size and how have they been treated in the past?
How much would increasing the "specifications" of IATI Publisher make a difference to publishing times, or are the constraints elsewhere?

PG-Momik commented 2 months ago

@emmajclegg

Aidstream has not faced issues related to filesize. All published files are less than 25MB. Since Aidstream does not validate against IATI validator, the bulk publish time in Aidstream is low as well.

Increasing the specifications of IATI publisher will not make significant improvements to bulk publishing time.
Avg time taken for each major steps in bulk publish

Time for validation complete Uploading individual XMLs to S3 Merged file manipulation and upload to s3 Publish to Registry

~ 48 - 50 mins ~ 7 - 8 mins ~ 2 - 4 mins < 1mins

Time for validation complete	Uploading individual XMLs to S3	Merged file manipulation and upload to s3	Publish to Registry
~ 48 - 50 mins	~ 7 - 8 mins	~ 2 - 4 mins	< 1mins

Most of the time consumed in bulkpublish time is on Validating against Validator. As per previous reply, changing the implementation of how we validate against IATI validator will reduce this time to less than half (I'll get back to you on the actual time improvements).

I think it is reasonable to prevent bulk publishing of 100 activities if IATI Publisher can see that a particular publisher's data is e.g. exceeding the 60MB limit (or, even better, split data into multiple files before validation).

I'll look into both setting a max file size limit and chunking dataset to 60MBs when validating in validator.

emmajclegg commented 2 months ago

@PG-Momik (cc' @praweshsth) - to summarise from our discussion this morning,

60MB is a hard maximum for data files on the IATI Registry, and this size may even reduce in future.
Therefore, IATI Publisher should definitely detect and stop an attempt to publish a file size greater than 60MB. We probably want to set the file size limit lower in IATI Publisher, depending on final processing times that come out of this work. I would not expect a user to wait more than a few minutes maximum on a data publishing or import screen; any longer than that and we should question whether what they are trying to do is sensible.
If all AidStream files are smaller than 25MB, we should treat this as representative of our IATI Publisher-Aidstream users. (I think @robredpath said today that most published activity files are actually smaller than 10MB). It is reasonable to set limits on who IATI Publisher is for (in terms of data volumes) and direct organisations with more data elsewhere.

robredpath commented 2 months ago

This is really useful and interesting - thank you @PG-Momik !

60MB is a very large amount of IATI data, so it's not surprising that it is taking a long time to go through the pipeline. We can see from the Dashboard that very few files are larger than 20MB, and those that are approaching the 60MB limit are from some very large organisations - such as UNICEF's files for their HQ budgets and activities in India.

The 60MB limit will become more important over time as we make all of IATI's tools behave in a more similar way. For example, files over the limit already don't appear in the Datastore, and may soon be excluded from d-portal as well. As I shared on the call, I would like to see if we can reduce the limit: it's easier for systems to process 3 x 20MB files than one 60MB file.

50 minutes seems like a very long time for a single file to take to validate on the IATI Validator. I've just tested a 50MB file that I downloaded (one of the UNICEF files) and it validated in ~2 minutes. I see above that you're validating activities one-by-one; I suspect that is the cause of the slow validation as that's not how the Validator is intended to be used. It also misses out on some of the more advanced validation, such as activity ID lookups for related activities.

Whenever we talk about IATI Publisher's role in the IATI ecosystem we talk about it being "for organisations that have a small amount of IATI data to publish" - if someone has enough data to hit the IATI pipeline limit, then IATI Publisher probably isn't suitable for their use case. If you're autogenerating the test file then maybe a more appropriate test file might be to randomise the number of results, indicators and periods within each activity with the maximums set at the current levels? I defer to @emmajclegg as to whether that's more realistic, however.

PG-Momik commented 2 months ago

Hi @robredpath

Regarding 60MB limit

if someone has enough data to hit the IATI pipeline limit, then IATI Publisher probably isn't suitable for their use case.

We're on the same page. With this in mind we've decided to keep track of the filesize an organization has published till date. When performing publish, we'll check if the cumulative filesize is > 60MB or not and prevent publish accordingly.

Regarding use of validator

We're currently implementing validating all activities in 1 go. One challenge we've come to face is the errors and line numbers each error is on. The IATI validator validates the entire merged xml in one go and is returning the error line numbers for the merged xml. IATI publisher shows the error messages received from IATI validator in 2 places for each activity:

When downloading published xml.
In the activity detail page.

Since the line numbers of error messages for the merged XML do not directly correspond to the line numbers for individual activities within their original XML files, it becomes challenging to accurately display the error messages in the IATI Publisher.

Would it be possible to update the IATI validator to provide a response that includes error information with line numbers for each individual activity within the merged file, rather than just the error line numbers for the entire merged file? I'm currently working on a mapper to map error lines, but it's taking significantly longer than I initially anticipated. If changes to the IATI validator are not feasible or would require a substantial amount of time to implement, I'll continue working on this approach. However, this is delaying the completion of this task more than what I had estimated.

cc: @BibhaT @emmajclegg @Sanilblank

emmajclegg commented 1 month ago

Hi @PG-Momik

We're on the same page. With this in mind we've decided to keep track of the filesize an organization has published till date. When performing publish, we'll check if the cumulative filesize is > 60MB or not and prevent publish accordingly.

This makes sense. As we discussed last week, it will likely be beneficial to set an even lower warning threshold for file size (20 or 30MB?). This doesn't need to prevent publishing, but should warn in cases where publishing is going to take a long time.

The IATI validator validates the entire merged xml in one go and is returning the error line numbers for the merged xml. IATI publisher shows the error messages received from IATI validator in 2 places for each activity:

When downloading published xml.

In the activity detail page.

Case 1 (if you're referring to the error message below) has actually been on our snagging list a while, and is something I was going to raise an issue about to remove:

We don't think many users use the XML download functionality, and the feedback above is not particularly readable or actionable. IATI Publisher's validator feedback within the activity detail page is much more useful, and users can run their XML files through the IATI Validator itself if they really need to.

Therefore - if removing the XML download feedback pictured above makes this bulk publishing work simpler, by all means remove it.

Would it be possible to update the IATI validator to provide a response that includes error information with line numbers for each individual activity within the merged file, rather than just the error line numbers for the entire merged file? I'm currently working on a mapper to map error lines, but it's taking significantly longer than I initially anticipated.

I've asked within the team about this request for activity-level error lines - I don't know if it's easy to implement or something we see as valuable elsewhere. In the meantime, can you let me know roughly how much time you think will be required for the error mapping? (and how much will this impact on the time bulk publishing takes?)

If you're autogenerating the test file then maybe a more appropriate test file might be to randomise the number of results, indicators and periods within each activity with the maximums set at the current levels?

Lastly, to respond to @robredpath's suggestion above - yes, this sounds a good way to get a better sense of average publishing times. It's still useful to know what IATI Publisher's limits are in terms of maximums but most users will have significantly less data. Possible maximums to use for random experimentation are 50 transactions, 10 results with 5 indicators each and 5 periods. Again, looking at 1-2 AidStream users with a lot of data as examples could be used as another source of maximums.

robredpath commented 1 month ago

Would it be possible to update the IATI validator to provide a response that includes error information with line numbers for each individual activity within the merged file, rather than just the error line numbers for the entire merged file?

I'm hoping that we'll be able to carry out a substantial overhaul of Validator messaging in 2025 - the line/column references are a common cause of complaints about the validator as most of the tools that people use to create their XML don't work that way. But, I wouldn't expect any changes around that before mid-2025, realistically.

I've asked a colleague to see if we can provide per-activity line numbers (or, presumably, an offset from the start of the file would be ok?) but I think that's quite a substantial change so I wouldn't expect it to be straightforward. I'll let you know when I hear back.

Do you send multiple activities to be validated in parallel? I'm not sure quite how many parallel validation processes we can run but I think that 3-5 would be fine. We can always look at additional capacity if necessary.

Alternatively, would it make sense for you to run your own instance of the Validator? All of the code is on GitHub (although it's all written for Azure). We could arrange a call with our developers if you wanted to discuss what that might involve.

PG-Momik commented 1 month ago

Would it be possible to update the IATI validator to provide a response that includes error information with line numbers for each individual activity within the merged file, rather than just the error line numbers for the entire merged file?

Thank you @robredpath. I'm a bit late on following up on this topic but I've completed writing the logic to map error line numbers on activity level based on the offset. It seems to be working well. You can disregard my comment.

cc: @emmajclegg

PG-Momik commented 1 month ago

@emmajclegg I feel like we've diverted a bit from our primary objective: Increase bulk publish to 100.

1. Regarding file size limit.

As I stated on discussion, we should address the import and publish size limits in a separate issue. We can think of whether to show warning or error on certain threshold.

2. Regarding error message on download

Therefore - if removing the XML download feedback pictured above makes this bulk publishing work simpler, by all means remove it.

If we're looking to decrease bulk publish time, removing this module on download will not result in any performance gains in bulk publish nor download.

In the meantime, can you let me know roughly how much time you think will be required for the error mapping? (and how much will this impact on the time bulk publishing takes?)

My initial concern with error mapping was the man hrs it'd take for me. I was hoping there'd be a quicker solution on the validator side. But I've managed to write a function to map errors. In terms of its impact on bulk publish time, it takes sub 2 mins to map errors (depending on the payload, even less).

From our initial stress test, we know that it is possible to publish 100 files. I think we ought to focusing on approaches to reduce bulk publish time. It took an average of 15 to 19 mins to validate the initial (60MB) payload.

I've just tested a 50MB file that I downloaded (one of the UNICEF files) and it validated in ~2 minutes.

As @robredpath mentioned, it should take ~2 mins to validate this payload and YES IT DOES JUST TAKE ~2 mins to actually validate. Most of the time is taken up by xml generation and upload to s3. (I think the s3 upload is related to other features, I'll leave a follow up on this). I'll look further into ways to reduce bulk publish time.

Possible maximums to use for random experimentation are 50 transactions, 10 results with 5 indicators each and 5 periods. Again, looking at 1-2 AidStream users with a lot of data as examples could be used as another source of maximums.

I'll comment the bp time taken for 100 publish with said payloads.

cc: @BibhaT

younginnovations / iatipublisher

Test with 100 activities with 10 results in each activity, 10 indicators in each result and 20 periods in each indicator #1557

Context

Findings

Questions

Possible changes

Avg time taken for each major steps in bulk publish

Regarding 60MB limit

Regarding use of validator

1. Regarding file size limit.

2. Regarding error message on download