younginnovations / aidstream-org-data

2 stars 0 forks source link

Data (org data and publishers data) exploration for the explorer UI and API #3

Closed anjesh closed 2 years ago

anjesh commented 7 years ago

There are 400+ organisation XML files in IATI Registry by 550+ publishers. These organisation (name, identifier, type) data from these files and publishers data are to be scrapped and explored for how these could be cleaned-up.

image

This exercise will guide us in preparing the API spec to be used in the org-module.

anjesh commented 7 years ago

Data process should be two steps - data initialization and subsequent data updates. First step is the data initialization part - which is done one time only. Subsequent process is the data update and create based on the org-data published by the publishers. The subsequent changes will also be logged to see how data is being updated over the course of time.

image

link to above diagram

Table structure

The following db structure is identified to maintain the org-data.

Table: Organisation

Fieldname Type
id int
identifier varchar
type int
country varchar
is_publisher tinyint
is_publisher tinyint

Table: name

Fieldname Type
organisation_id int
lang varchar
name varchar

Step 1: Data Initialization Process

This is one time process. We will read the organisation-data-xml files and publishers information from the registry. These two sets of data will be combined and a clean data set will be prepared. Only those organisation data will be included in the clean set if they have valid organisation identifier

These information will also be manually checked for errors in the initialization process. This data will be available to AidStream through API. This data could also be browsed through explorer.

For example, this might be the first set of data prepared (two tables from above merged for simplicity)

id identifier type country is_publisher lang name
1 GB-CHC-1074937 1 GB True en ADRA-UK
1 46002 1 True en African Development Bank
1 KE-NGC-3372 1 KE False en I CHOOSE LIFE AFRICA

Step 2: Data updates process

There will be regular pulling of data - both organisation-data and publishers' data - and checked if there are any changed in the data from the one that is prepared in "data initialization process". Data from the step 1 might be updated manually as there will be possibility for the changes-suggestions through AidStream interface as well as explorer.

If the changes in the organisation-data/publishers' data are recent, then these will appear as suggestions to the moderator, who will approve or reject these changes.

anjesh commented 7 years ago

https://github.com/younginnovations/iati-organisations-cleanup has the details/scripts on auto-cleanup process. The auto-cleaned-up data now needs to be manually checked for other issues and then dumped into database for the API.

The script still needs to be updated/improved as we see there are ways to improve data based on existing auto-cleaned data.