Open PG-Momik opened 2 months ago
@emmajclegg please have a look at this.
Thanks @PG-Momik - some initial responses to this:
My questions:
@emmajclegg
Time for validation complete | Uploading individual XMLs to S3 | Merged file manipulation and upload to s3 | Publish to Registry |
---|---|---|---|
~ 48 - 50 mins | ~ 7 - 8 mins | ~ 2 - 4 mins | < 1mins |
Most of the time consumed in bulkpublish time is on Validating against Validator. As per previous reply, changing the implementation of how we validate against IATI validator will reduce this time to less than half (I'll get back to you on the actual time improvements).
I think it is reasonable to prevent bulk publishing of 100 activities if IATI Publisher can see that a particular publisher's data is e.g. exceeding the 60MB limit (or, even better, split data into multiple files before validation).
I'll look into both setting a max file size limit and chunking dataset to 60MBs when validating in validator.
@PG-Momik (cc' @praweshsth) - to summarise from our discussion this morning,
60MB is a hard maximum for data files on the IATI Registry, and this size may even reduce in future.
Therefore, IATI Publisher should definitely detect and stop an attempt to publish a file size greater than 60MB. We probably want to set the file size limit lower in IATI Publisher, depending on final processing times that come out of this work. I would not expect a user to wait more than a few minutes maximum on a data publishing or import screen; any longer than that and we should question whether what they are trying to do is sensible.
If all AidStream files are smaller than 25MB, we should treat this as representative of our IATI Publisher-Aidstream users. (I think @robredpath said today that most published activity files are actually smaller than 10MB). It is reasonable to set limits on who IATI Publisher is for (in terms of data volumes) and direct organisations with more data elsewhere.
This is really useful and interesting - thank you @PG-Momik !
60MB is a very large amount of IATI data, so it's not surprising that it is taking a long time to go through the pipeline. We can see from the Dashboard that very few files are larger than 20MB, and those that are approaching the 60MB limit are from some very large organisations - such as UNICEF's files for their HQ budgets and activities in India.
The 60MB limit will become more important over time as we make all of IATI's tools behave in a more similar way. For example, files over the limit already don't appear in the Datastore, and may soon be excluded from d-portal as well. As I shared on the call, I would like to see if we can reduce the limit: it's easier for systems to process 3 x 20MB files than one 60MB file.
50 minutes seems like a very long time for a single file to take to validate on the IATI Validator. I've just tested a 50MB file that I downloaded (one of the UNICEF files) and it validated in ~2 minutes. I see above that you're validating activities one-by-one; I suspect that is the cause of the slow validation as that's not how the Validator is intended to be used. It also misses out on some of the more advanced validation, such as activity ID lookups for related activities.
Whenever we talk about IATI Publisher's role in the IATI ecosystem we talk about it being "for organisations that have a small amount of IATI data to publish" - if someone has enough data to hit the IATI pipeline limit, then IATI Publisher probably isn't suitable for their use case. If you're autogenerating the test file then maybe a more appropriate test file might be to randomise the number of results, indicators and periods within each activity with the maximums set at the current levels? I defer to @emmajclegg as to whether that's more realistic, however.
Hi @robredpath
if someone has enough data to hit the IATI pipeline limit, then IATI Publisher probably isn't suitable for their use case.
We're on the same page. With this in mind we've decided to keep track of the filesize an organization has published till date. When performing publish, we'll check if the cumulative filesize is > 60MB or not and prevent publish accordingly.
We're currently implementing validating all activities in 1 go. One challenge we've come to face is the errors and line numbers each error is on. The IATI validator validates the entire merged xml in one go and is returning the error line numbers for the merged xml. IATI publisher shows the error messages received from IATI validator in 2 places for each activity:
Since the line numbers of error messages for the merged XML do not directly correspond to the line numbers for individual activities within their original XML files, it becomes challenging to accurately display the error messages in the IATI Publisher.
Would it be possible to update the IATI validator to provide a response that includes error information with line numbers for each individual activity within the merged file, rather than just the error line numbers for the entire merged file? I'm currently working on a mapper to map error lines, but it's taking significantly longer than I initially anticipated. If changes to the IATI validator are not feasible or would require a substantial amount of time to implement, I'll continue working on this approach. However, this is delaying the completion of this task more than what I had estimated.
cc: @BibhaT @emmajclegg @Sanilblank
Hi @PG-Momik
We're on the same page. With this in mind we've decided to keep track of the filesize an organization has published till date. When performing publish, we'll check if the cumulative filesize is > 60MB or not and prevent publish accordingly.
This makes sense. As we discussed last week, it will likely be beneficial to set an even lower warning threshold for file size (20 or 30MB?). This doesn't need to prevent publishing, but should warn in cases where publishing is going to take a long time.
The IATI validator validates the entire merged xml in one go and is returning the error line numbers for the merged xml. IATI publisher shows the error messages received from IATI validator in 2 places for each activity:
- When downloading published xml.
- In the activity detail page.
Case 1 (if you're referring to the error message below) has actually been on our snagging list a while, and is something I was going to raise an issue about to remove:
We don't think many users use the XML download functionality, and the feedback above is not particularly readable or actionable. IATI Publisher's validator feedback within the activity detail page is much more useful, and users can run their XML files through the IATI Validator itself if they really need to.
Therefore - if removing the XML download feedback pictured above makes this bulk publishing work simpler, by all means remove it.
Would it be possible to update the IATI validator to provide a response that includes error information with line numbers for each individual activity within the merged file, rather than just the error line numbers for the entire merged file? I'm currently working on a mapper to map error lines, but it's taking significantly longer than I initially anticipated.
I've asked within the team about this request for activity-level error lines - I don't know if it's easy to implement or something we see as valuable elsewhere. In the meantime, can you let me know roughly how much time you think will be required for the error mapping? (and how much will this impact on the time bulk publishing takes?)
If you're autogenerating the test file then maybe a more appropriate test file might be to randomise the number of results, indicators and periods within each activity with the maximums set at the current levels?
Lastly, to respond to @robredpath's suggestion above - yes, this sounds a good way to get a better sense of average publishing times. It's still useful to know what IATI Publisher's limits are in terms of maximums but most users will have significantly less data. Possible maximums to use for random experimentation are 50 transactions, 10 results with 5 indicators each and 5 periods. Again, looking at 1-2 AidStream users with a lot of data as examples could be used as another source of maximums.
Would it be possible to update the IATI validator to provide a response that includes error information with line numbers for each individual activity within the merged file, rather than just the error line numbers for the entire merged file?
I'm hoping that we'll be able to carry out a substantial overhaul of Validator messaging in 2025 - the line/column references are a common cause of complaints about the validator as most of the tools that people use to create their XML don't work that way. But, I wouldn't expect any changes around that before mid-2025, realistically.
I've asked a colleague to see if we can provide per-activity line numbers (or, presumably, an offset from the start of the file would be ok?) but I think that's quite a substantial change so I wouldn't expect it to be straightforward. I'll let you know when I hear back.
Do you send multiple activities to be validated in parallel? I'm not sure quite how many parallel validation processes we can run but I think that 3-5 would be fine. We can always look at additional capacity if necessary.
Alternatively, would it make sense for you to run your own instance of the Validator? All of the code is on GitHub (although it's all written for Azure). We could arrange a call with our developers if you wanted to discuss what that might involve.
Would it be possible to update the IATI validator to provide a response that includes error information with line numbers for each individual activity within the merged file, rather than just the error line numbers for the entire merged file?
Thank you @robredpath. I'm a bit late on following up on this topic but I've completed writing the logic to map error line numbers on activity level based on the offset. It seems to be working well. You can disregard my comment.
cc: @emmajclegg
@emmajclegg I feel like we've diverted a bit from our primary objective: Increase bulk publish to 100.
As I stated on discussion, we should address the import and publish size limits in a separate issue. We can think of whether to show warning or error on certain threshold.
Therefore - if removing the XML download feedback pictured above makes this bulk publishing work simpler, by all means remove it.
If we're looking to decrease bulk publish time, removing this module on download will not result in any performance gains in bulk publish nor download.
In the meantime, can you let me know roughly how much time you think will be required for the error mapping? (and how much will this impact on the time bulk publishing takes?)
My initial concern with error mapping was the man hrs it'd take for me. I was hoping there'd be a quicker solution on the validator side. But I've managed to write a function to map errors. In terms of its impact on bulk publish time, it takes sub 2 mins to map errors (depending on the payload, even less).
From our initial stress test, we know that it is possible to publish 100 files. I think we ought to focusing on approaches to reduce bulk publish time. It took an average of 15 to 19 mins to validate the initial (60MB) payload.
I've just tested a 50MB file that I downloaded (one of the UNICEF files) and it validated in ~2 minutes.
As @robredpath mentioned, it should take ~2 mins to validate this payload and YES IT DOES JUST TAKE ~2 mins to actually validate. Most of the time is taken up by xml generation and upload to s3. (I think the s3 upload is related to other features, I'll leave a follow up on this). I'll look further into ways to reduce bulk publish time.
Possible maximums to use for random experimentation are 50 transactions, 10 results with 5 indicators each and 5 periods. Again, looking at 1-2 AidStream users with a lot of data as examples could be used as another source of maximums.
I'll comment the bp time taken for 100 publish with said payloads.
cc: @BibhaT
Context
This time, we used 1 to 3 transactions, with 10 results , 10 indicators for each result and 20 periods for each indicator for each activity. By significantly reducing number of results and its children, we were able to reduce the size of each activity xml to less than 2 MB. We created 100 similar activities for this test.
Findings
Questions
Possible changes
Validating activities (x/N)
in our previous design. Since our current design does not have any progress bar to display texts likeValidating activities (x/N)
. We could opt to validate all 100 activities in 1 go if they are under 60MB (IATI validator doesnt accept xml file larger than 60MB). This will decrease our bulk publish time.