wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
47 stars 10 forks source link

Identify and provide data required for IIIF rebuild in snapshot #4815

Closed kenoir closed 3 years ago

kenoir commented 3 years ago

In order to achieve: https://github.com/wellcomecollection/platform/issues/4813 we need to update the snapshot_generator to provide the required data for the IIIF build.

We need to identify what that data is and ensure it is available in the snapshots.

tomcrane commented 3 years ago

Here's what IIIF-builder does at runtime to get a Work for a b number.

1) query, using the b number as identifier: https://api.wellcomecollection.org/catalogue/v2/works?identifiers=b14658197&include=identifiers

2) having found the work id (nydjbrr7): https://api.wellcomecollection.org/catalogue/v2/works/nydjbrr7?include=identifiers,items,subjects,genres,contributors,production,notes,parts,partOf,precededBy,succeededBy

Earlier, we were trying to do it all in one hit by specifying all of the include values on the initial identifiers query - but not everything comes back that way (the archival tree in particular). Hence the two-step approach.

For initial population, I was thinking of using the snapshot just to get a source of b numbers that have digital locations;

e.g., https://api.wellcomecollection.org/catalogue/v2/works/nydjbrr7?include=items

very pseudo-codey:

from works where work has .items -> locations -> locationType.id == "iiif-presentation" ...yield work.identifiers...value where identifierType.id == "sierra-system-number"

(should give us ~250,000 b numbers)

I hadn't thought of using the snapshot data to also get the full work data, instead of querying the API.

I guess this would be a tradeoff. We could write another implementation of this class - one that gets Works from a snapshot dump instead of HTTP calls:

https://github.com/wellcomecollection/iiif-builder/blob/master/src/Wellcome.Dds/Wellcome.Dds/Catalogue/ICatalogue.cs

Or we could just use the snapshot to get a list of 250,000+ bnumbers, and run them through IIIF-Builder as if Goobi had triggered them, which right now would mean steps 1 and 2 above.

tomcrane commented 3 years ago

Current Catalogue API, HTTP Client implementation of ICatalogue:

https://github.com/wellcomecollection/iiif-builder/blob/master/src/Wellcome.Dds/Wellcome.Dds.Repositories/Catalogue/WellcomeCollectionCatalogue.cs

It might be easy to make another impl that pulls the same data out of a dump - depends how it's arranged.

alexwlchan commented 3 years ago

The dumps are compressed gzip files in S3, with one work per line. The JSON for each work should be the same as if you'd retrieved it from the Catalogue API and specified all the includes.

You can see a fairly recent snapshot here: https://data.wellcomecollection.org/catalogue/v2/works.json.gz

And the docs, what there are: https://developers.wellcomecollection.org/datasets

tomcrane commented 3 years ago

So...

Part 1: Iterate through the entire dump matching works with digital locations, yielding a list of b numbers. Save each matching line of JSON... somewhere. They need to be look-uppable by the b number.

Part 2: Deploy a build of IIIF-Builder that has a "dump" impl of ICatalogue - can look things up by b number, in the pulled-out dump lines. Feed it the b numbers from part 1

Or -

Part 1 Iterate through the entire dump matching works with digital locations, yielding a list of b numbers. (we're done with the dump now)

Part 2 Feed these to regular IIIF builder.

This second version will hit the Catalogue API 500,000 times (twice per b number)

Do you mind that? It will be fairly spaced out.

tomcrane commented 3 years ago

I favour the second approach, not just because it's less work!

Save each matching line of JSON... somewhere.

You already have it saved, in Elasticsearch, and have given us an HTTP means of fetching it from there.

kenoir commented 3 years ago

Noting that the snapshots include all includes by default (all work data will be available): https://github.com/wellcomecollection/catalogue/blob/master/snapshots/snapshot_generator/src/main/scala/uk/ac/wellcome/platform/snapshot_generator/services/SnapshotService.scala#L55

kenoir commented 3 years ago

The location data in the Catalogue API may be updated in the future (though the format will remain the same).

As far as preparing snapshot data, this is complete - /cc @jamesgorrie @tomcrane will need to know if the location data i going to change.

tomcrane commented 3 years ago

At some point the location of the IIIF resource in the catalogue API will change from wl.org/iiif.. to iiif.wc.org/..

The new IIIF-builder will handle old wl.org/iiif/ paths with a "moved permanently" redirect, forever (no reason to ever stop) - #4755. For our code, we obviously know that a wl.org and wc.org iiif path are equivalent and can handle both. We're not even following the location URL, just noting that it has a location of type iiif-presentation.

So for us, the timing of a change to the location is non-critical, and for third parties the timing of a change may not be important if they follow redirects. The wl.org manifest would redirect to an explicit IIIF v2 version (as that's what anyone asking for that URL would be expecting).

This leaves the wc.org UI. You might want to start using the v3 manifest as soon as you can (especially for AV). The iiif.wc.org canonical path returns the v3 manifest unless you explicitly use conneg to ask for the v2 manifest. So the timing of a change to the location of IIIF resources is probably most important to the experience team.