wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
26 stars 4 forks source link

SQLite / Document ID Refactor for Scrapers #407

Open jdu opened 4 years ago

jdu commented 4 years ago

This is a fairly sizeable refactor target which covers a few different issues.

Warning Before anyone starts on this, this issue might be negated by architectural changes proposed in #419

General Design Notes

Refactor the scraper pipelines to storetheir data in SQLite databases configured with WAL and the JSON1 extension.

The idea is to move from using a JSON manifest file to a relational database, this will provide th following:

  1. The ability to easily query on multiple properties of a scraped document to decide whether we need to use that document (DocumentId, File Hash, source url, title, etc...)
  2. Reduce overhead of loading the manifest into memory to edit and flushing to disk.
  3. Allow us to easily store information about alternate ids for a given document using JSON1 arrays in the schema such as alternate_hashes, alternate_dids, alternate_urls. This will allow us to have multiple identifiers for a given document in order to reduce duplication because of file hash changes or lack of a DID.

Tasks

Other Notes

ivyleavedtoadflax commented 4 years ago

This should resolve https://github.com/wellcometrust/reach/issues/48

kristinenielsen commented 4 years ago

It is related to the architecture work and will be re-written