wellcomecollection / catalogue-pipeline

:oil_drum: The data pipeline services extracting & transforming data from our museum and collections.
https://developers.wellcomecollection.org/catalogue
MIT License
13 stars 2 forks source link

Trigger path concatenator from within the matcher/merger subsystem #2527

Open paul-butcher opened 10 months ago

paul-butcher commented 10 months ago

Path concatenator is currently in the relation_embedder subsystem, but this is the wrong place for it as it writes to works-merged, so it belongs in the matcher/merger subsystem.

It could be triggered as part of the sendWorkOrImage function in the merger.

paul-butcher commented 6 months ago

This is a non-trivial refactor, so I'll put in some of my thinking now so I don't have to relearn it later.

  1. move the path_concatenator code - it should be in matcher_merger
  2. move the triggering code, it is currently in router, it should be in sendWorkOrImage
  3. change the downstream notification to send work ids to merger output.

Perhaps this can be better expressed without being a separate stage, and instead being run as part of the merger?

paul-butcher commented 6 months ago

It needs to be a (sub)stage after merger, as it relies on the right data being in the merged database.

paul-butcher commented 6 months ago

There is a minor conflict between efficiency and purity here. The current incarnation knows that there is a collectionPath in each record it messes with, so it can notify the relationEmbedder via the pathSender.

However, I want us to be able to describe each full stage (either individual standalone apps, or a subsystem like relation_embedder or matcher_merger) as accepting and sending work ids.

As a result, the concatenator, as the final stage within the matcher_merger subsystem, should notify downstream with work ids, which will then be used by the merger to retrieve the work and notify the next stage in its own subsystem with a path.

This inefficiency is pretty minor, and it's better to be clear about boundaries.

paul-butcher commented 6 months ago

I think it also needs to accept a work id. That way it knows what to notify downstream about if nothing changes.