Data (org data and publishers data) exploration for the explorer UI and API

anjesh commented 7 years ago

There are 400+ organisation XML files in IATI Registry by 550+ publishers. These organisation (name, identifier, type) data from these files and publishers data are to be scrapped and explored for how these could be cleaned-up.

This exercise will guide us in preparing the API spec to be used in the org-module.

anjesh commented 7 years ago

Data process should be two steps - data initialization and subsequent data updates. First step is the data initialization part - which is done one time only. Subsequent process is the data update and create based on the org-data published by the publishers. The subsequent changes will also be logged to see how data is being updated over the course of time.

link to above diagram

Table structure

The following db structure is identified to maintain the org-data.

Table: Organisation

Fieldname	Type
id	int
identifier	varchar
type	int
country	varchar
is_publisher	tinyint
is_publisher	tinyint

Table: name

Fieldname	Type
organisation_id	int
lang	varchar
name	varchar

Step 1: Data Initialization Process

This is one time process. We will read the organisation-data-xml files and publishers information from the registry. These two sets of data will be combined and a clean data set will be prepared. Only those organisation data will be included in the clean set if they have valid organisation identifier

identifier is present in the organisation-codelist maintained in iatistandard
their organisation-registration-agency component of their identifier is present in org-id.guide list

These information will also be manually checked for errors in the initialization process. This data will be available to AidStream through API. This data could also be browsed through explorer.

For example, this might be the first set of data prepared (two tables from above merged for simplicity)

id	identifier	type	country	is_publisher	lang	name
1	GB-CHC-1074937	1	GB	True	en	ADRA-UK
1	46002	1		True	en	African Development Bank
1	KE-NGC-3372	1	KE	False	en	I CHOOSE LIFE AFRICA

Step 2: Data updates process

There will be regular pulling of data - both organisation-data and publishers' data - and checked if there are any changed in the data from the one that is prepared in "data initialization process". Data from the step 1 might be updated manually as there will be possibility for the changes-suggestions through AidStream interface as well as explorer.

If the changes in the organisation-data/publishers' data are recent, then these will appear as suggestions to the moderator, who will approve or reject these changes.

anjesh commented 7 years ago

https://github.com/younginnovations/iati-organisations-cleanup has the details/scripts on auto-cleanup process. The auto-cleaned-up data now needs to be manually checked for other issues and then dumped into database for the API.

The script still needs to be updated/improved as we see there are ways to improve data based on existing auto-cleaned data.

younginnovations / aidstream-org-data