wellcomecollection / catalogue-pipeline

:oil_drum: The data pipeline services extracting & transforming data from our museum and collections.
https://developers.wellcomecollection.org/catalogue
MIT License
13 stars 2 forks source link

Extract CALM SDB_URLs as merge candidates #2481

Closed paul-butcher closed 4 months ago

paul-butcher commented 10 months ago
          The CALM record has the UUID of the METS record: 
"SDB_URL": [
    "dd29d9d6-8f61-48e3-b84d-0f5a1d12d2f5"
  ],

Which is present in the METS premis:object (In mets:mets/mets:dmdSec[@ID="dmdSec_1"]/mets:mdWrap/mets:xmlData)

<premis:object xmlns:premis="http://www.loc.gov/premis/v3" xsi:type="premis:intellectualEntity" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" version="3.0">
          <premis:objectIdentifier>
            <premis:objectIdentifierType>UUID</premis:objectIdentifierType>
            <premis:objectIdentifierValue>dd29d9d6-8f61-48e3-b84d-0f5a1d12d2f5</premis:objectIdentifierValue>
          </premis:objectIdentifier>
          <premis:originalName>SAFIH_B_2_7_10-dd29d9d6-8f61-48e3-b84d-0f5a1d12d2f5</premis:originalName>
        </premis:object>

So the link may be able to go in that direction. I don't think we extract that value at all yet, but we could.

Originally posted by @paul-butcher in https://github.com/wellcomecollection/catalogue-pipeline/issues/2476#issuecomment-1801884816

paul-butcher commented 8 months ago

It might be SDB_REF that I want, rather than SDB_URL.

https://wellcomecollection.org/works/mgtqmung is GC/253/A/34/9 which has an SDB_REF 4182633e-8cf0-4f2a-9065-5740c23d3a12 and an SDB_URL http://sdb.wellcome.ac.uk/explorer/explorer.html#prop:7&4182633e-8cf0-4f2a-9065-5740c23d3a12

However, looking at the METS for that, the UUID does not correspond!

IIIF: https://iiif.wellcomecollection.org/presentation/collections/archives/GC/253/A/34/9 Manifestation: https://iiif.wellcomecollection.org/dash/Manifestation/GC253_A_34_9 METS: https://iiif.wellcomecollection.org/dash/Peek/XmlView/GC253_A_34_9/METS.a765b7de-d8dd-45fe-94c9-8c4632e02178.xml

I wonder if there was something anomalous about SAFIH_B_2_7_10

paul-butcher commented 8 months ago

Having reconsidered this, I think that using the CALM RefNo is probably the right approach. This would also be consistent with the way existing METS files link with Sierra.

However, the CALM RefNo is not the identifier used for the item derived from a CALM record, so this may present a challenge.

paul-butcher commented 8 months ago

Actually, panic over. It looks like my initial plan stands. SDB_URL is documented as containing the Archivematica UUID.

This means that it is just that the examples I was looking at have not yet been updated accordingly.

An advantage of this usage is that if the field is not a UUID, then I can ignore it.

paul-butcher commented 8 months ago

Hmm... SDB_URL: 114edb90-809b-411e-80ab-d4c2bb241c30 is present in two CALM records: PP/SUL/B/7/2/6/1 and PP/SUL/B/7/2/6/2 (8c32507f-7e32-4f1d-818e-fbf58c9072a9 and 38ba14ca-e844-4a1a-834e-637e94031d72). I wonder if that's going to cause problems.

It would cause those two to merge, but maybe they should, or maybe one of them is hidden, I don't know. I haven't looked yet.

paul-butcher commented 4 months ago

Some discussion in Slack

paul-butcher commented 4 months ago

It turns out this is wrong. The id in question is unstable