wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
48 stars 10 forks source link

Initial Population of new DDS (IIIFBuilder) #4752

Closed tomcrane closed 3 years ago

tomcrane commented 4 years ago

While we could copy already-created serialised binary objects, text maps, etc from on-premise old DDS to the new one, this is not really a good idea.

We're moving from .NET 4.5 (many years old) to .NET Core 3.1. Running on IIS on Windows, to Kestrel on Debian. There will have been minor code changes too.

It's safer to rebuild everything from a clean start. Which also proves that we've got it all right, too - see #4751

(this doesn't mean resynching to DLCS - the images are already done)

We'll need to scale up the WorkflowRunner environment to get this done - more than one instance running, with plenty of horsepower. IIIFBuilder runs on all b numbers. We can do some on stage (e.g., 10% of all b numbers), then run them all on production.

donaldgray commented 4 years ago

Implementation note: we need to ensure the WorkflowRunner picks a job and marks it as not waiting in a single operation to avoid multiple instances picking up the same job.

tomcrane commented 3 years ago

@jtweed What's the best way to get a list / generator of all the b numbers that should have IIIF representations?

In the past, the existing DDS could make them on demand, even if the link in Sierra hadn't been made yet; over time the DDS amassed data on all the b numbers but it never technically needed to know what all the possible digitised b numbers were.

When we did the storage migration, we used the copied METS bucket in S3 as a source of keys; simple Python and .NET generators could yield whatever chunk of distinct available b numbers you needed.

Anything new or updated coming through Goobi will trigger IIIF building, but we need to run the whole lot through. Is there a way of querying the Catalogue API, using the presence of a DigitalLocation as a filter? Would we miss any that way? How does the Catalogue API decide that a work has a digital location?

We need to be able to:

foreach (var bnumber in digitisedBNumbers)
{
   BuildIIIF(bnumber);
}

I'd rather not use the old DDS database as a source, or Sierra directly; just new stuff, that's repeatable and up to date. Given that any bnumber that needs to be built must have a Work retrievable from the Catalogue API, it seems sensible that Catalogue API is also the source of all the b numbers that should have IIIF.

jtweed commented 3 years ago

There is indeed, there is a locationType filter that you can use as so:

?items.locations.locationType=iiif-presentation

However, @jamesgorrie is currently doing some work to properly sort the locations model, which will break this. It will ultimately become something more like:

?items.locations.type=DigitalResource

But that is not there yet.

We add a digital location when we find a METS file, so we don’t rely on the dlinks or existence of an “e” bib.

tomcrane commented 3 years ago

We add a digital location when we find a METS file, so we don’t rely on the dlinks or existence of an “e” bib.

That's good, matches the DDS (needs to be this way because the old DDS will often process something before Sierra has been updated with dlinks etc). But what's the sequence of events that gets the digital location into the Catalogue API Work? The DDS needs the Work to fully build the manifest, and uses the digital location to pick a better thumbnail. If there is no digital location yet, the DDS won't mind, it just won't get that better thumbnail (although it could use the same logic to pick it itself*, usually the title page). If Goobi notifies DDS about a newly digitised work before the catalogue API has been updated, it will see the work without a digital location.

On standup yesterday - discussed that API returns a maximum of 10,000 works, so using it as a hose for initial population might still need some direct Elasticsearch access.

Also the digital location filter sounds related to another discussion on Slack - https://digirati.slack.com/archives/CBT40CMKQ/p1599653653152000 - about including counts of total descendants AND total digitised descendants, pre-computed, on all archive levels so that you can see these values from any tree node on the members of parts, partOf, precededBy and succeededBy.

*There are a couple of places (https://github.com/wellcomecollection/docs/pull/30 is another) where the IIIF-Builder code duplicates logic elsewhere in the platform; when the new DDS beds in maybe a review of some of this?

alexwlchan commented 3 years ago

On standup yesterday - discussed that API returns a maximum of 10,000 works, so using it as a hose for initial population might still need some direct Elasticsearch access.

It occurs to me that this is exactly the sort of thing the API snapshots (https://developers.wellcomecollection.org/datasets#catalogue-snapshot) might be useful for, although the snapshots are currently quite out-of-date.

tomcrane commented 3 years ago

That sounds perfect...

When we're ready, we start new DDS off, listening to Goobi. So new stuff coming through is processed. Then we take a snapshot, and process that. There might be a couple of things that get processed twice because they came through after DDS was turned on but before the snapshot was taken, but in the grand scheme of things, not really a big deal.

jtweed commented 3 years ago

Yes, we've been talking a bit about this today and I think we should put some work into properly fixing the snapshots so that you can use them for this.

tomcrane commented 3 years ago

Update: snapshots are now ideal for this and we have successfully generated b numbers for workflow processing from a snapshot.

Tasks:

Once Manifests and Collections are in "release candidate" form, we need to do a spike of testing and validation.

At the moment:

For the full content creation run, we don't want to do any DLCS synchronisation (DLCS already has the images in space 5). Control of this is covered in https://github.com/wellcomecollection/iiif-builder/pull/78.

For limited test runs (e.g., 10% of the content), if we do that on iiif-test, then we'll do a huge amount of unnecessary DLCS image registration (as space 6 doesn't have the images).

We could either make iiif-test use space 5 for this kind of work, or run them into iiif.wc.org (production). We can clear it out multiple times. This latter option seems cleaner and keeps the test environment for other kinds of testing, but it might look more official than desired - are these real Wellcome manifests? Does that matter?

Once we are happy with this we need to roll the presses - generate the ~3-400K IIIF resources (x 2) and the ~30m annotation pages (x 1). This is a massively parallelisable process.

tomcrane commented 3 years ago

Superseded by #5040