wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
48 stars 10 forks source link

Implement IIIF Content Search 1.0 #4740

Closed tomcrane closed 3 years ago

tomcrane commented 4 years ago

This is fairly trivial as it's almost the version currently offered by wl.org, which is actually 0.9.3 (the IIIF search spec was developed at the same time, but later tweaks to Search API didn't go back into wl.org implementation).

This is the search version that will be referenced from the IIIF Presentation 2 manifests.

tomcrane commented 3 years ago

Hits vs Pages

The current (wl.org) implementation of IIIF Search is adapted from the pre-IIIF Wellcome Player implementation. This previous version did what it needed to do to serve the UI of the Wellcome Player / UV, but no more. In that UI, results are conveyed as pages - 1 match on page 37, 2 matches on page 53, etc. In the client, page markers on a sparkline represent the extent of the book.

In adapting to IIIF, this page-level result was mapped to the IIIF Search model of a "Hit". But the IIIF Search API serves a wider range of use cases than the UV. The current wl.org implementation conflates page and hit, because of the particular UI (Player/UV) it targets.

Suppose the search term is "the red cat".

This page has one hit, returned as one rectangle (one annotation to draw over the image):

--------- page 17 --------- blah blah blah blah blah blah blah blah blah blah blah the red cat blah blah blah blah blah blah blah blah blah blah blah blah

This page has one hit, returned as TWO rectangles (two annotations to draw over the image):

--------- page 21 --------- blah blah blah blah blah blah blah blah blah blah blah blah blah the red cat blah blah blah blah blah blah blah blah blah

This page has two hits, returned as two rectangles:

--------- page 29 --------- blah blah blah blah blah blah the red cat blah blah blah blah blah blah blah the red cat blah blah blah blah blah blah

And this page has two hits, returned as THREE rectangles:

--------- page 35 --------- blah blah blah blah blah blah the red cat blah blah blah blah blah blah blah blah blah blah the red cat blah blah blah

And finally this is only one hit, spread across rectangles on two different pages:

--------- page 37 --------- blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah the

--------- page 38 --------- red cat blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah

In its current form, the Wellcome IIIF Search API doesn't distinguish between these scenarios. All hits target only one page, and say how many matches there are on that page:

--------- page 17 --------- one match

--------- page 21 --------- two matches

--------- page 29 --------- two matches

--------- page 35 --------- three matches

--------- page 37 --------- one match

--------- page 38 --------- one match

The last two especially are fine for the UV's UI - it only wants to drop a marker/navigation device, so you can get to the page and just see the box highlights.

But it doesn't work for a UI that wants to show contextual results, like this:

image

That image isn't the best example, I need to find one that matches the scenarios above. This UI has a distinction between "result" and "page that has one or more results or parts of results on it" - the current UV implementation is ONLY the latter, IIIF Search API allows for the UI of both of these and many more. The current wl.org implementation doesn't say what the results actually are, it's about which pages have boxes drawn on them. You could imagine a far more textual search results UI that showed textual results in context, maybe even without images until you navigate to a result. etc.

What to do...

The current IIIF Search API on Wellcome Library is not the finished version anyway - it informed the IIIF spec with use cases, but work continued on the IIIF Search API after work on search on wl.org was finished.

It's obviously possible to produce a UV/sparkline page-based result with the proper 1.0 Search API. The consuming client just needs to be aware that:

We need to check what the UV does with Hits that follow the IIIF spec as above. The danger is that other people have adapted their server-side search implementation to the UV, rather than have the UV updated (I need to check with a proper search test). A new client can't really do contextual results that show parts of the page, or understand results that span page breaks, without making some changes to the current implementation, but I think they are minor for both client and server.

Luckily, the current Wellcome Search identifies itself as version 0, not version 1:

image

Which means we could have two services, one that conflates hits and pages as now, and one that doesn't, and is a correct Search 1.0 implementation (this might be a simple change). Our backwards-compatible IIIF 2.1 manifests can retain the current version 0 behaviour, and the new IIIF 3 ones can have what is essentially the same API but with the more subtle Hit behaviour.

tomcrane commented 3 years ago

After discussion with @jennpb, and FYI @jtweed and @gestchild

I don't want to veer off the critical path - we should implement IIIF Search properly, but other things are more pressing. I still need to migrate Search functionality to new DDS (working on that today), and while I'm reading/porting that code I'll keep the above comment in mind. If supporting both patterns is trivial I'll do it, but if there is any complexity I'll reproduce as-is.

jtweed commented 3 years ago

The URL that redirects from wl.org should behave as per current DDS. We can look to implement Search 1.0 as the new canonical URL, but only once everything else is up and running.

tomcrane commented 3 years ago

...leaves room for the v1 later.