Closed anjesh closed 2 years ago
Data process should be two steps - data initialization and subsequent data updates. First step is the data initialization part - which is done one time only. Subsequent process is the data update and create based on the org-data published by the publishers. The subsequent changes will also be logged to see how data is being updated over the course of time.
The following db structure is identified to maintain the org-data.
Table: Organisation
Fieldname | Type |
---|---|
id | int |
identifier | varchar |
type | int |
country | varchar |
is_publisher | tinyint |
is_publisher | tinyint |
Table: name
Fieldname | Type |
---|---|
organisation_id | int |
lang | varchar |
name | varchar |
This is one time process. We will read the organisation-data-xml files and publishers information from the registry. These two sets of data will be combined and a clean data set will be prepared. Only those organisation data will be included in the clean set if they have valid organisation identifier
These information will also be manually checked for errors in the initialization process. This data will be available to AidStream through API. This data could also be browsed through explorer.
For example, this might be the first set of data prepared (two tables from above merged for simplicity)
id | identifier | type | country | is_publisher | lang | name |
---|---|---|---|---|---|---|
1 | GB-CHC-1074937 | 1 | GB | True | en | ADRA-UK |
1 | 46002 | 1 | True | en | African Development Bank | |
1 | KE-NGC-3372 | 1 | KE | False | en | I CHOOSE LIFE AFRICA |
There will be regular pulling of data - both organisation-data and publishers' data - and checked if there are any changed in the data from the one that is prepared in "data initialization process". Data from the step 1 might be updated manually as there will be possibility for the changes-suggestions through AidStream interface as well as explorer.
If the changes in the organisation-data/publishers' data are recent, then these will appear as suggestions to the moderator, who will approve or reject these changes.
https://github.com/younginnovations/iati-organisations-cleanup has the details/scripts on auto-cleanup process. The auto-cleaned-up data now needs to be manually checked for other issues and then dumped into database for the API.
The script still needs to be updated/improved as we see there are ways to improve data based on existing auto-cleaned data.
There are 400+ organisation XML files in IATI Registry by 550+ publishers. These organisation (name, identifier, type) data from these files and publishers data are to be scrapped and explored for how these could be cleaned-up.
This exercise will guide us in preparing the API spec to be used in the org-module.