wellcomecollection / concepts-pipeline

Some sort of ETL pipeline for concepts in the Wellcome Collection catalogue
MIT License
0 stars 0 forks source link

Use collapse in NotInIndexFlow #115

Closed paul-butcher closed 1 year ago

paul-butcher commented 1 year ago

To make NotInIndexFlow even more efficient, it should collapse on the ids so that it only returns one record per input identifier.

paul-butcher commented 1 year ago

Currently, NotInIndexFlow may return some false negatives. If any of the ids in the batch match multiple records in the database (which is now possible due to the way we handle sameAs relationships), they can knock one of the other identifiers off the list (because it expects, at most, the same number of records returned as ids requested).

NotInIndexFlow is just an optimisation itself (though a vital one for bulk aggregator runs, it would be far too slow without it), so perfecting it, in the sense of having perfect recall is unnecessary. So this is not a problem that needs to be fixed.

It needs to not return any false positives, but as long as the number of false negatives it returns is low enough compared to the true positives, then no worries.

paul-butcher commented 1 year ago

Actually, it's not possible to match more than one record, because each canonicalId only appears in one record in catalogue concepts. Incoming records with the same source id are all merged into a single record with multiple canonical ids.

It is only in the next stage that they become multiple records.