wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
26 stars 4 forks source link

Update Reach Pipelines to populate Citations to PG12 #504

Open jdu opened 4 years ago

jdu commented 4 years ago

We're using the warehouse schema to house data for out processes across reach and the data oriented applications as there's some overlap between them.

The belwo is rough, but should get us about 90% there.

Overview

[ ] Create a new table in postgres to hold citation information for a given publication.

CREATE OR REPLACE reach_citations (
    uuid UUID PRIMARY KEY NOT NULL DEFAULT uuid_generate_V4(),
    epmc_id UUID REFERENCES warehouse.epmc_metadata (uuid) NOT NULL,
    policies UUID[] NOT NULL DEFAULT '{}',
    created TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc'),
    modified TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc'),
)

[ ] Update reach pipelines to populate the citations data into the new reach_citations above. [ ] Update reach citation search to use a array_agg for JSONB to query and get policies related to a given publication search result.

SELECT 
pub.pmcid,
pub.pmid,
pub.title,
(SELECT array_to_json(arrag_agg(row_to_json(t))) FROM (
SELECT * FROM warehouse.reach_policies AS rps WHERE rp.uuid @> pub_rel.policies
) t ) AS citations
FROM warehouse.epmc_metadata AS pub 
LEFT JOIN warehouse.reach_citations AS pub_rel ON pub_rel.epmc_id =  pub.uuid
WHERE <search_query>

[ ] Update the yaml pipelines for the various reach workflows with any changes necessary to run this.

Notes

You can deploy updates to the datalabs-ifnra-utils if necessary by:

  1. Exporting your creds for Amazon
  2. (cd pipelines)
  3. make docker-push

Most of the workflows are configured to run the latest image from the repo, you can find the workflows in k8s/datalabs0/argo/workflows/

Export (Download CSV)

For CSV downloads from the individual UI components, you can use the COPY semantics of Postgres through psycopg2 in order to get a download of a bulk set of data versus iterating on a cursor against the database and writing chunks to an output file.