zimmerman-team / IATI.cloud

IATI datastore powered by Apache Solr. Automatically Extracts and parses IATI XML files referenced in the IATI Registry & refreshed every 3 hrs. IATI is a global initiative to improve the transparency of development and humanitarian resources and their results for addressing poverty and crises.
https://datastore.iati.cloud
MIT License
32 stars 28 forks source link

IOM data not updating #2504

Closed markbrough closed 3 years ago

markbrough commented 3 years ago

Describe the bug IOM data was updated on 2020-11-17. However, it has not been updated in the IATI Datastore.

To Reproduce Steps to reproduce the behavior:

  1. View one IOM activity in the IATI Datastore: https://iatidatastore.iatistandard.org/search/activity?q=iati_identifier:(XM-DAC-47066-CC.0016)&wt=xslt&tr=activity-xml.xsl&rows=1
  2. See that there are no humanitarian-scope elements
  3. View the same IOM activity in D-Portal: http://d-portal.org/q.xml?aid=XM-DAC-47066-CC.0016
  4. See that there are humanitarian-scope elements
  5. Check the IATI Validator: http://validator.iatistandard.org/?perm=https:__data.iom.int_data_acc-IOM.xml_1606251659
  6. See that the file validates against the IATI Standard
  7. Check the IATI Datastore Dataset API query: https://iatidatastore.iatistandard.org/api/datasets/?name=iom-activity&format=json
  8. See that the file was last updated in the Datastore on 2020-11-14T07:39:01.745614

N.B. D-Portal also helpfully copies a number of attributes from <iati-activities> to each <iati-activity>, such as the generated-datetime. I think this would be very helpful to include in the Datastore output (in a namespace, in the same way as D-Portal does it).

Expected behavior Datastore updated within a reasonable timeframe (e.g. 24 hours)

siemvaessen commented 3 years ago

Hi @markbrough - as we make use of two data pipelines (one for production, one as the failover) that contain the Datastore data we switch them if we have codebase updates that require a complete clean Parse and Indexing process. As we switched last Wednesday from one pipeline to another.

The process of updating the Parse and Indexing process is an automated process. But as management of this process happens under the, we plan on setting up a service page allowing anyone to access the list of publishers in the Datastore and provide some basic stats -both from the datastore itself, activity account etc- but also that meta information on when last data was retrieved from Registry URI endpoint (as data itself does not live in the Registry) + information on when that data was processed in the Datastore. We hope this provides some clarity on when what happens as some of this is a bit of mystery to outsiders. I am re-opening this issue, feel free to add some ideas/specs for that basic service page so we can collect them and create it.

amy-silcock commented 3 years ago

This is also happening for Oxfam Novib. Their data in the datastore still has the last-updated-date of 16th or 17th November.

There actual data should have a date of the 28th or 29th of November.

For example activity: NL-KVK-27108436-000023 In the datastore image

In d-portal: image

amy-silcock commented 3 years ago

Please can you provide a timeline for when we'll be able to see Oxfam's data updated?

markbrough commented 3 years ago

Hi @siemvaessen -- thanks for the explanation! I understand you are planning to set up a "service page" that will make it easier to see this kind of information.

However, I am unclear why some data would sometimes not be updated, or take a long time to update. It is also a bit concerning if things are failing silently at this stage.

Is the process for determining when/which files get updated documented somewhere? How long does it currently take to update data? NB:

siemvaessen commented 3 years ago

Hi @markbrough I do not think you understand how data processing in the Datastore works and the Validator dependency that it has built in as per design (unlike D-Portal which has no schema validation, has a subset of IATI data on file and makes no use of the IATI Validator). This is something that needs clearer explanation I guess.

I think it’s best we clarify this data process in a Slidedeck for better understanding for everyone working with IATI data and how data is processed and how fast / how slow. And I kindly request you to not point me out to the ToR in future conversations, as I you may imagine I can dream everything that is in there. Not sure what the point is.

Datastore has all of the bits integrated.

As every IATI dataset travels throughout the IATI ecosystem (currently being Registry, Datastore and Validator) presuming that just continuously polling for new files = instant processing is not correct.

We will setup a Slidedeck on how Datastore processes data including a VS D-Portal data processing. Hopefully that will clarify things and also make way for faster data processing in the future.

markbrough commented 3 years ago

Hi @siemvaessen, many thanks for your reply. Yes, I think it would help to have some clearer explanation - I am not aware that this process is documented in detail anywhere, but if I have missed something then feel free to point me towards it.

The reason I pointed to the TOR was to explain why I expect that the Datastore would be updated much more quickly than is currently the case. However, it sounds like you are saying that I should not expect data to be updated within ~24 hours. If that is the case, I think it would be good to state this explicitly somewhere.

siemvaessen commented 3 years ago

Ah yes I can explicitly and confirm that this is not some kind of default.

Currently a full clean parse & index will take 8 days (only done when data model changes are processed in codebase updates not). Updates of data happen in the same way, but as all XML datasets need to be downloaded in a continuous cycle and compared to spot any change, then datastore either skips the file after comparing it (=identical file) or it spots a change (=different file) then it adds that file in the data queue for full processing again. It then needs to wait for the Validator to process that file which may also have an internal queue. Once Validator has processed the file, the Validator message is needed for Datastore to either: process file (it's a schema validated file) or ignore the file if the validator message tells Datastore it should not process the file (=validation error). Datastore is triggered to parse and store that file in PostgreSQL and once PostgreSQL has finished it will Trigger Apache Solr to update its Index(es). Apache Solr API is the main gateway for the QueryBuilder.

The continuous queue process a a linear process, so it starts at dataset #1 and collects them up to the last dataset available (#8.000 something) and starts its check: download file from publisher location, compare, offer to validator or not etc. Suppose 1 transaction has been updated, the complete file needs to fully reprocessed as it were completely new.

So in theory updates may be a minute but may be way longer as in theory all files may have changed and they all need to be reprocessed.

Once we have that slide deck finished we will share it to clarify this and see what parts of the IATI Data Travel can be optimised.

markbrough commented 3 years ago

Thanks for the explanation @siemvaessen - this is very helpful. I agree it would be great to think about how this process can be optimised.

leostolk commented 3 years ago

@siemvaessen I understand this is linear process with some time needed to proceed through all steps. I also understand the process depend on the cue of the Validator. Just lookup ONL data files on the validator, reports were generated on 30/11 and 01/12 without fatal errors, so I assume they were validated . So that would explain a time difference of 1 or two days.
Still the query builder produces XML records with 17/11 as last updated date.

We ask publishers to increase frequency for COVID-19 specifically and for all hum activities in general. So we should have clarity how a daily frequency translates in a Data store frequency as a SLA. My expectation is daily too, and find the current process lengths of two weeks discouraging for both publishers and data users. Supports @markbrough pledge to "optimise" processes

alexlydiate commented 3 years ago

For clarity on the Validator in its present incarnation, it takes about a minute or so to produce a validation report, and it does so in series - ie, it can only do one at a time.

It relies on the Datastore to know whether a file is available to be validated, as it does not presently query the Registry directly.

It queues its files based on the Downloaded date as received from the Datastore, and it runs its sync with the Datastore every hour. So, the length of its queue is dependant on the number of files reported as new or changed by the Datastore, and as a rule of thumb you can multiply that number by 1 minute to gauge how long in time it will take to produce each report.

If 120 files are reported, for example, you've got up to around a two hour wait per report.

We're looking to address this variable queue with the next major version by processing in parallel - in conjunction with considering how we provide a high-performing API - but, that's at an early stage of thought.

leostolk commented 3 years ago

Thanks @alexlydiate so the fact that the Validator show ONL files as downloaded on 30th of November and 1st of December means that the datastore provided that information in a more or less timely way. And reports were generated without fatal errors on 28th of November and most recently this morning on 2nd of December.
For me as a database layman, that brings the issue back to the datastore, why no refreshed data in the query builder?

markbrough commented 3 years ago

Thanks @alexlydiate for the reply, I agree the Validator processing multiple files in parallel sounds like it would be a definite improvement.

I was wondering if it is possible to roughly quantify the amount of time spent at each stage of the process? That could help identify other ways in which the process could be optimised.

It sounds to me like a lot of time is spent waiting for the reply from the Validator to reply. However, a lot of what the Validator is doing is not relevant for the Datastore (it doesn't care about the ruleset tests, it only cares about schema validation). For example, that IOM dataset took ~5 minutes on the new Validator, but around ~5 seconds on the old validator (which is only doing schema validation). This makes me think that one way of optimising the process would be to either:

Michelle-IOM commented 3 years ago

For what it is worth, IOM updated data on 17 Nov where we started including HRP data in the humanitarian element. While the HumPortal has picked up that we have HRP data now (visible 9 days later on the 26), it still indicates that our last publication date was 30 October. And on 27 Nov we published additional projects such that we now have 2050 activities. The HumPortal still shows 2020 (was the count for both 30 Oct and 17 Nov) and as of today the date of last publication is still 30 October. While it has only been 5 days since our last publication round, it doesn't explain the date issue. Perhaps that is HumPortal only issue, in which case Mark B., will address, but I would second the question placed by @stormnl (sorry don't know who that is, Leo maybe?) how is it that we can push for frequent publishing in emergencies (i.e. covid) if it takes more than a week for the data to show up in the the publicly available tools? I have no clue how to fix this technically, but I know this wasn't the expectation from the member or user community. I look forward to hearing that there is a solution though.

markbrough commented 3 years ago

@Michelle-IOM - just to confirm, there are currently 2020 IOM activities in the IATI Datastore (which the Humportal follows). There are 2050 IOM activities in D-Portal.

I have separately raised the issue of the last updated date not correctly reflecting the last updated date of the dataset (this is also coming from the Datastore but perhaps there is an issue with the query or something).

Sources:

leostolk commented 3 years ago

@Michelle-IOM stormnl was leo stolk indeed, old account, now changed to leostolk ;-) and today query builder delivers ONL activities with last-update date of 2020-12-01, so the gap was 13 days. will monitor this over next month.

PetyaKangalova commented 3 years ago

@siemvaessen following-up on this one related to Oxfam Novib's activities data again with delays with refresh:

https://iatidatastore.iatistandard.org/api/datasets/?publisher_identifier=NL-KVK-27108436&format=json image

Query builder still shows last update 6th of December: https://iatidatastore.iatistandard.org/search/activity?q=reporting_org_ref:(NL-KVK-27108436)&wt=xslt&tr=activity-xml.xsl&rows=50 image

siemvaessen commented 3 years ago

Hi @PetyaKangalova well, not 'again'.

By design requirement DS is connected to the Validator which downloaded a new file from ONL 19/12 but only just processed 1/2 - 2 days after download. https://iativalidator.iatistandard.org/organisation/onl - now that Datastore has a report from the validator will it start to process that new file.

I will start to close issues like this, as the original issue was solved and this is not a specific DS issue, but more a validator issue as it took a very long time to get that file processed. The validator is a service Datastore has 0 control over.

Screenshot 2020-12-21 at 21 35 36
siemvaessen commented 3 years ago

Once the sha1 for each activity functionality is in place (eta March 2021) updating activities (updates, new etc) will be done on a per activity basis rather than per dataset.

markbrough commented 3 years ago

This still seems to be a problem. e.g. World Bank data was last updated on 24th March: https://iatiregistry.org/dataset/worldbank-cd

However, the file still has not been parsed two weeks later (the last-updated-datetime is 2021-02-22): https://iatidatastore.iatistandard.org/search/activity?q=iati_identifier:(44000-P149233)&wt=xslt&tr=activity-xml.xsl&rows=1

This data is visible in D-Portal (the last-updated-datetime is 2021-03-24) http://d-portal.org/q.xml?aid=44000-P149233

You can see in D-Portal that this transaction exists, but is missing in the DSv2 output:

<transaction>
  <transaction-type code="3"/>
  <transaction-date iso-date="2021-02-28"/>
  <value value-date="2021-02-28">314318</value>
  <description xml:lang="en">
    <narrative>Total Disbursement in First Quarter of 2021</narrative>
  </description>
  <provider-org ref="44002">
    <narrative>International Development Association</narrative>
  </provider-org>
  <flow-type code="10"/>
  <finance-type code="410"/>
</transaction>

It looks like the Validator was asked today to parse this file (dataset is worldbank-cd): https://iativalidator.iatistandard.org/organisation/worldbank