wellcomecollection / catalogue-pipeline

:oil_drum: The data pipeline services extracting & transforming data from our museum and collections.
https://developers.wellcomecollection.org/catalogue
MIT License
13 stars 2 forks source link
wellcome-digital-platform

catalogue-pipeline

Build status Adapter deployment status Pipeline deployment status

The catalogue pipeline creates the search index for our unified collections search. It populates an Elasticsearch index with data which can then be read by our catalogue API. This allows users to search data from all our catalogues in one place, rather than searching multiple systems which each have different views of the data.

Requirements

The catalogue pipeline is designed to:

High-level design

We have a series of "adapters" that fetch records from our source systems. The adapters are responsible for staying up-to-date with changes in the source systems.

The adapters feed a transformation pipeline, which transforms source records into a common model, adds a pipeline identifier, and combines records from different systems. The structure and logic of the transformation pipeline evolves over time, as we find new and better ways to transform the data.

Once the transformation pipeline has finished processing the records, it stores them in a search index, which can be read by the catalogue API.

The catalogue pipeline runs entirely in AWS, with no on-premise infrastructure required.

Usage

We always have at least one pipeline which is populating the currently-live search index, but we may have more than one pipeline running at a time.

Running multiple pipelines means we can try experiments or breaking changes in a new pipeline, and keep them isolated from the live search index (and the public API). Over time, newer pipelines replace older pipelines, and the older pipelines are deleted.

We publish our source code so that other people can learn from it, but it's very unlikely anybody would want to run it themselves. It contains a lot of Wellcome-specific logic, and would need extensive modification to be useful elsewhere.

Development

See docs/developers.md.

License

MIT.