wellcomecollection / catalogue-pipeline

:oil_drum: The data pipeline services extracting & transforming data from our museum and collections.
https://developers.wellcomecollection.org/catalogue
MIT License
13 stars 2 forks source link

Six METS records are failing in the transformer #2546

Open paul-butcher opened 8 months ago

paul-butcher commented 8 months ago

They appear to be failing because they cannot find the corresponding data in storage. They all yield a message similar to this in the logs.

ERROR w.p.transformer.TransformerWorker - TransformerWorker: TransformerError on MetsFileWithImages(s3://wellcomecollection-storage/digitised/b20442117,v2/data/b20442117.xml,List(v2/data/b20442117_0001.xml, v1/data/b20442117_0002.xml, v1/data/b20442117_0003.xml),2023-02-09T12:58:38.913Z,2) with Version(b20442117,2) (software.amazon.awssdk.services.s3.model.NoSuchKeyException: The specified key does not exist.

b20442117, b31360051, b24875831, b2170594x, b21705938, b24873342

They are all b-numbered (therefore goobi) METS files.

This has occurred in the new 02-01 pipeline, so may be due to recent changes to handle archivematica mets. This may be something that has been filtered out/guarded against before, but those changes have somehow removed that protection, or it may be something that happens in previous pipelines but has so far gone unnoticed.

paul-butcher commented 8 months ago

None of these values are present in logs in January, so this must be new to 02-01

Actually, I've just searched again, and I have found them all, with the same error, on 2024-01-09. Evidently I set the date incorrectly when I first looked.

paul-butcher commented 8 months ago

None of these b-numbers return any results from a search on the website (currently pointing to the 01-09 pipeline)

paul-butcher commented 8 months ago

Apart from b31360051 (f5ndv2hu) these are all DELETED. I suspect the new pipeline is erroneously trying to create a full record for a DELETED one.

paul-butcher commented 8 months ago

This remains a problem, but I don't think it's panic-worthy, and certainly not a blocker for deploying the 02-01 pipeline to the API.