wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
48 stars 10 forks source link

Identity / URLs of IIIF Manifests and Collections #4659

Closed tomcrane closed 3 years ago

tomcrane commented 4 years ago

What is the source of identity for IIIF Manifests and Collections? For the majority of works, this is easy - it's the b number. The b number gets used as part of a URL:

https://wellcomelibrary.org/iiif/b28047345/manifest

In the new iiif-builder, the URL will be whatever the outcome of this discussion says, e.g.,

https://iiif.wellcomecollection.org/presentation/b28047345

This is fine when the work identifier aligns with a IIIF Manifest. But this is not always the case:

https://wellcomelibrary.org/item/b18031511#?m=1

Here's a 6 volume work. The IIIF resource at https://wellcomelibrary.org/iiif/b18031511/manifest redirects to a IIIF Collection, which in turn contains 6 manifests, one for each volume. Each of these is available at its own URL, e.g., for volume 2:

https://wellcomelibrary.org/iiif/b18031511-2/manifest

The PDF for this volume is at https://dlcs.io/pdf/wellcome/pdf-item/b18031511/2

That 2 in these URLs is deduced by the DDS - it's the ordinal number of the manifest in the b number. It's the second manifestation the DDS encountered when traversing the METS. It doesn't necessarily indicate "Volume 2".

It turns out that it's not such a good idea to use this ordinal as part of the identity of a manifest. Things can get added, missed, re-ordered. Identity based on ordinal position is not stable and is not authoritative.

This becomes really apparent with Chemist and Druggist, where we have manifests like https://wellcomelibrary.org/iiif/b19974760-4545/manifest. This URL is meaningless, and may point at something else later if issues are reordered, some additional ones found and inserted, etc. C&D is an outlier, but the problem is present in other multiple manifestations too. A PDF for a manifestation is cached once produced - and if the sequence changes, the PDF is the wrong PDF for its identifier. A URL for one volume's manifest might suddenly start pointing at different content. This is rare, but it does happen, and needs to be fixed by re-synchronising.

The dashboard doesn't have this problem, because it isn't authoritative for the identity of the parts of digital objects - it uses structMap identifiers from METS. Here, you might have b23232323_0001, b23232323_0002, b23232323_0004, b23232323_0005 as 4 volumes ("3" is missing). The dashboard always shows the right thing for an identifier in the address bar, because the identifier is produced from the METS the DDS is reading. Identity for IIIF resources is minted downstream of the dashboard, by the DDS, following a convention that hasn't changed since launch, pre-IIIF. This is the chance to fix that problem, because everything's getting new URLs anyway.

Here's the same thing on the dash: https://dds-stage.dlcs.io/Dash/Manifestation/b18031511_0002

This looks similar, but it isn't - that b18031511_0002 is an opaque string that has come from an attribute in the METS that gives identity to this manifestation. Goobi assigned it, and Goobi can keep track of it. The DDS is not assuming anything. Goobi might be making use of a counting variable in assigning identity to manifestations - almost certainly is - but crucially that is Goobi's concern, not the DDS's. The DDS defers to Goobi for identity of parts of works. Goobi can keep track of volumes and issues as it sees fit, Goobi can ensure that a structural identifier in METS is stable - the DDS has no way of knowing.

This also allows meaningful bibliographic identity to be assigned to parts of a work (e.g., the 1978 volume of The Guardian) if that becomes a thing later... the DDS doesn't care, it's just an opaque string. The identity is assigned by someone driving Goobi, not by logic in the DDS. It would also allow the use of identifiers that are not b numbers in the METS.

We should use the METS-derived identifier as the identity throughout, and not introduce a computed ordinal. This would mean that public IIIF Manifests have the same last path segment as the dashboard pages for the same manifestation - they must have the same path segment, by definition.

Examples:

https://dds-stage.dlcs.io/Dash/Collection/b18031511 https://dds-stage.dlcs.io/Dash/Manifestation/b18031511_0002 https://dds-stage.dlcs.io/Dash/Collection/b19974760 (C&D)* https://dds-stage.dlcs.io/Dash/Collection/b19974760_14 (Volume) https://dds-stage.dlcs.io/Dash/Manifestation/b19974760_14_0022 (Issue)**

The equivalents would be

https://iiif.wellcomecollection.org/presentation/b18031511 (Collection) https://iiif.wellcomecollection.org/presentation/b18031511_0002 (Manifest) https://iiif.wellcomecollection.org/presentation/b19974760 (Collection) https://iiif.wellcomecollection.org/presentation/b19974760_14 (Collection) https://iiif.wellcomecollection.org/presentation/b19974760_14_0022 (Manifest)

They have the appearance of hierarchical identifiers, because that's how Goobi assigns them - but they are opaque strings as far as the DDS is concerned.

* This is Chemist and Druggist. Its storage map is huge, and in the dashboard only cached for 3 mins. The first hit on this will take 30s or so to load. ** at time or writing, needs another deployment to work.

tomcrane commented 4 years ago

Another driver for this is that it allows a chunk of DDS code to be thrown away, but the identity issue is a stronger case.

tomcrane commented 4 years ago

One consequence of this is that we'd need a new pdf query in DLCS, with parameters for space and string3

string3 in DLCS for http://localhost:8085/Dash/Manifestation/b19974760_6_0031 is b19974760_6_0031

jtweed commented 4 years ago

This is a good idea and definitely preferable, it's also how things are arranged in the storage service. My one concern though, is are there any impacts to this outside of DDS?

tomcrane commented 4 years ago

At other times there would be - but on this occasion, we are changing the URLs of IIIF resources anyway - so anything dependent on those URLs will break (or better, be prepared to follow a redirect). So we might as well use this chance to implement the most desirable URL scheme.

For redirection, a stub service handling old wellcomelibrary.org iiif paths can translate an old-form identifier like b12345678/5 by traversing the b12345678 resource to find the fifth manifestation, see what it's called, and redirect to its new identity (and cache the result).

This raises another possibility - you could use the CALM ref as the identifier in the IIIF URL for archive material, rather than the b number. This wouldn't come from METS but DDS could have a hook to override the METS identifier from another source for the public facing resource.

e.g.,

https://iiif.wellcomecollection.org/presentation/PPCRI/D/4/3

or perhaps, for separation:

https://iiif.wellcomecollection.org/presentation/archive/PPCRI/D/4/3

...or whatever.

If the DDS maintains the same metadata knowledge it does now (but not acquired from Sierra) then it can easily respond to the CALM form; this already works at the Collection level but the Manifest falls back to the b number form:

https://wellcomelibrary.org/service/collections/archives/PPCRI/D/4

jtweed commented 4 years ago

Yeah, for various reasons not so keen on exposing the Calm IDs. I think we stick with what comes from Goobi, but agree that it makes sense to actually do that in all cases.

tomcrane commented 4 years ago

(From Slack discussion)

It's not just manifests within collections that use an index-based identifier in the current DDS.

Canvas: https://wellcomelibrary.org/iiif/b28047345/canvas/c178

ALTO seeAlso: https://wellcomelibrary.org/service/alto/b28047345/0?image=178

Annotations for canvas: https://wellcomelibrary.org/iiif/b28047345/contentAsText/178

These could instead use an identifier from METS that has a greater correspondence with the digitised page.

For example (all of these ARE canvas c178, look at the discrepancies!)

      <mets:file ID="FILE_0179_OBJECTS" MIMETYPE="image/jp2">
        <mets:FLocat LOCTYPE="URL" xlink:href="objects/b28047345_0181.jp2" />
      </mets:file>

      <mets:file ID="FILE_0179_ALTO" MIMETYPE="application/xml">
        <mets:FLocat LOCTYPE="URL" xlink:href="alto/b28047345_0181.xml" />
      </mets:file>
      <mets:div ADMID="AMD_0179" ID="PHYS_0179" ORDER="179" ORDERLABEL="161" TYPE="page">
        <mets:fptr FILEID="FILE_0179_OBJECTS" />
        <mets:fptr FILEID="FILE_0179_ALTO" />
      </mets:div>

b28047345/PHYS_0179 is attractive because it's the METS identifier that most closely corresponds with a canvas. It allows for multiple assets, even. The downside is that it isn't related to the page image identifier (b28047345_0181.jp2).

tomcrane commented 4 years ago

Now having second thoughts about PHYS_nnnn. The image filename is at least the filename on a disk of a photographed real world thing.

tomcrane commented 4 years ago

API Paths under iiif.wellcomecollection.org

(/text is under api.wellcomecollection.org but still served by same app, so either work, up to ALB rules)

Manifests and Collections that are digitised items (can be related to one work)

They all support conneg Will return Presentation 3 resources by default

Explicit paths are also available - v2 = Presentation 2.1, v3 = Presentation 3.0 (for now).

Collections that are sets of different digitised items (e.g., by subject)

(Also support conneg, and explicit paths)

Search API

v1 and v2 placeholder. v2 could just return v1, but we'll do a little tidying up. NB v1 is the version of the search service, not the Presentation API

Text

(nb will be at api.wellcomecollection.org) No conneg here. The /v1/ is Wellcome Collection API version (see #4603)

Canvas identity

Some operations are for a single canvas (in practice, related to a single JP2 or video). E.g., the text of a single page in Annotation or ALTO form.

We also need Canvas identity in Manifests, whether or not we want to make them dereferenceable.

In v2 and v3 manifests, the Canvas identity must be the same canonical form, as it is the target of annotations. But where do we get the identifier from? We are abandoning DDS-inferred index-based IDs in favour of information from METS, which is a better source of truth.

Candidates are:

The first is more aesthetically appealing and doesn't confusingly end in an image file extension. The second and third are consistent with the DLCS identifier, which is more practical for consumers and spelunkers (you can cut and paste a bit of this identifier to a DLCS image service).

The second has some redundancy, as we already guarantee global uniqueness of the filename-derived identifier (see RFC). On the other hand, the second allows operations like /annotations/... to have manifest AND canvas level actions.

I think my preference is for

This form is used in the following examples.

Annotations

These don't need conneg, we won't do that for now but could put it in later. The already-negotiated manifest links to explicit paths for further linked resources, because you have already made a version decision.

As there is an absolute mapping between IIIF version and Anno model, we should use the Presentation API versions, same as Manifests, to distinguish between Open Annotation and W3C Web Anno Data Model.

We should also include a slot in the path for text granularity, even if we only provide line level (as now) - https://iiif.io/api/extension/text-granularity/#2-text-granularity-levels-and-the-textgranularity-property

(Offering other levels in future is trivially easy, we just don't do it right now. Apart from glyph, for which we don't have the OCR data).

Manifest level

We have two manifest level annotation pages.

This does mean that all and images are magic strings as they occupy the same slot as the Canvas ID in the following:

Annotations for one page

(v2 and v3 are Presentation API versions; map to OA and W3C)

METS-ALTO

Versioning is hard here. What exactly are you specifying? METS-ALTO versioning varies over time depending on when something was digitised and who did it.

Poster Image (not DLCS)

Work-level thumbnail

https://wellcomelibrary.org/service/workthumb/b18310916/0 This is not needed as wc.org will have better logic for this. (?)

Repository Logo

Current: https://wellcomelibrary.org/biblio/b24967646/repositoryLogo NB get the text to match from noteType: {id: "location-of-original"...}

PDF Cover Page

(not the full PDF which comes from DLCS)

(NB this is the Manifest level ID, not the collection; this allows variant cover pages for volumes of a multi-volume work) current: https://wellcomelibrary.org/service/pdfcoverpageaspdf/b18310916

(aside) Consider making this a Python service or even a Lambda that gets what it needs from the Manifest in S3. There are better PDF libraries available in Python than .NET Core - existing .NET libraries often make Win32 calls that won't work on Linux. Everything it needs should be in the Manifest (this is a good exercise for considering what goes in the provider Presentation 3 structure).

MOH

(MOH API omitted from here for now)

Other stuff that isn't so public, that we can just port with sensible options

Semi-internal API (used in dashboard and testing tools)

https://wellcomelibrary.org/service/bNumberSuggestion?q=john%20dee https://wellcomelibrary.org/service/bNumberSuggestion?q=b14658

Used by UV

https://wellcomelibrary.org/service/playerconfig https://wellcomelibrary.org/service/playerconfig?v=moh (also licenses and options config)

FileInfo

There are many endpoints used by the dashboard for instrumentation on caches that are redundant now, I won't list them here.

Healthchecks (used by Pingdom, and then Wellcome monitoring, and our own monitoring)

Example: https://wellcomelibrary.org/service/DdsHealthCheck.Release1

Existing API that isn't used any more

https://wellcomelibrary.org/biblio/b14658161 https://wellcomelibrary.org/service/snippet/b28047345/0?snippetSize=100&intList=5005 https://wellcomelibrary.org/service/imagelist/b28047345/0

Was consumed by Encore full text harvester: https://wellcomelibrary.org/service/BornDigitalBibs?from=2020-01-01 https://wellcomelibrary.org/service/AltoBibs?from=2020-06-01&to=2020-06-30

(Preservica wrapper obviously not working any more..)

tomcrane commented 4 years ago

At time of writing we already have stubs (IIIF Precursors) here:

https://iiif-test.wellcomecollection.org/presentation/b18031511 https://iiif-test.wellcomecollection.org/presentation/b18031511_0003

(observe the difference in comment, and collapse catalogueMetadata to see a clearer difference)

And also

https://iiif-test.wellcomecollection.org/presentation/v2/b18031511 https://iiif-test.wellcomecollection.org/presentation/v3/b18031511

(compare the iiifVersion property at the end)

This will start becoming IIIF from next week.

jtweed commented 4 years ago

Thanks @tomcrane, this is incredibly useful. I have also taken a note for when I start working on docs, as I can use this as a basis for that too.

kenoir commented 4 years ago

Looks nice to me. I defer to @jtweed in all things URL related.

One comment: /pdf-cover feels wrong putting the file type in the path like that.

tomcrane commented 3 years ago

Picking up the migration again, and considering Cloudfront rules:

DDS paths are covered in the comment above: https://github.com/wellcomecollection/platform/issues/4659#issuecomment-686448554

iiif.wc.org => DLCS rewrites are discussed in https://github.com/wellcomecollection/platform/issues/4603#issuecomment-653121630 and the comments that follow that one

Example:

        ____
https://iiif.wellcomecollection.org/image/b28047345_0032.jp2/info.json 
   => https://dlcs.io/iiif-img/wellcome/5/b28047345_0032.jp2/info.json
                                       _^_
        _________
https://iiif-test.wellcomecollection.org/image/b28047345_0032.jp2/info.json 
   => https://dlcs.io/iiif-img/wellcome/6/b28047345_0032.jp2/info.json
                                       _^_

https://github.com/wellcomecollection/platform/issues/4997 is for Cloudfront => DLCS

donaldgray commented 3 years ago

Updated PDF service to run under rather than pdf-cover to avoid rewrite in cloudfront pdfcoverpage

See: https://github.com/wellcomecollection/iiif-builder/pull/76

tomcrane commented 3 years ago

This issue is too big and should be closed.

Before closing, replace with an issue that has a final version of the paths in https://github.com/wellcomecollection/platform/issues/4659#issuecomment-686448554, with checklist for each.

This list needs to take into account everything in UriPatterns as well as things still not handled anywhere, like text-zip.

tomcrane commented 3 years ago

Superseded by #5039