wellcomecollection / alpha

Alpha version of a catalogue explorer for Wellcome Library (deprecated).
0 stars 0 forks source link

Newest digitised stuff feed #97

Closed jennpb closed 8 years ago

jennpb commented 8 years ago

@tomcrane is going to do some thinking about how best to build a 'hot off the press' feed from the DDS so we can show digitised materials as they're made available in the viewer (rather than waiting for a Sierra process to run twice weekly)

frankieroberto commented 8 years ago

@jennpb 👍

I'm using http://wellcomelibrary.org/resource/collections/access/all-open/ in the meantime, to identify newly digitized b-numbers (by comparing status to our database).

The one thing this doesn't give us is metadata for b-numbers which aren't in our database at all – for those, we'll have to get the metadata from the live sierra site somehow (e.g. by parsing the MARC text).

tomcrane commented 8 years ago

(Note to self) - look at "digitisation complete" notifications from goobipdf and work back from there. Avoid b numbers with errors, which are likely to feature more prominently in a "recent" feed generated this way because Wellcome staff will keep re-running them through the finish line as they diagnose the problem.

Hi @frankieroberto -although it's way out of scope, you might want to have a look at https://github.com/pulibrary/marc_liberation which takes MARC records from (in this case Princeton's Voyager library catalogue) and turns them into JSON-LD like this:

https://bibdata.princeton.edu/bibliographic/4609321/ => https://bibdata.princeton.edu/bibliographic/4609321/jsonld

This currently works with the Voyager LMS, but it is very modular. I'd love to see an adapter for Sierra.

frankieroberto commented 8 years ago

@tomcrane thanks. It looks like that's using this gem: https://github.com/ruby-marc/ruby-marc/ – which I also took a look at yesterday, but couldn't get to work with the MARC format at http://search.wellcomelibrary.org/iii/encore/record/C__Rb2190277?lang=eng&marcData=Y yet – looks like the code is expect a batch file in .dat format...?

tomcrane commented 8 years ago

This seemingly simple task opened a can of worms which I won't go into here. Anyway...

...and so on.

I'm being a bit careful, let me know if you need to be more lenient than this:

http://wellcomelibrary.org/service/collections/recently-digitised/open/?from=2016-01-01

And once I'd done that, I made this for fun:

http://wellcomelibrary.org/service/collections/lucky-dip/open/ http://wellcomelibrary.org/service/collections/lucky-dip/any/

Returns 20 random single-image items (i.e., it won't return a book), on the grounds that a single image is usually something interesting. Repeated querying of this shows that the Wellcome Library has a particular range of niche interests...

@jennpb fyi

tomcrane commented 8 years ago

(to expand on the above) The reason I limited it to 8 days is that I have to do quite a bit of work to determine whether the item should be returned in this list. It needs to have ALL images in all sequences (volumes) passing for it to be included. This prevents problematic items appearing over and over again, but involves some intensive subqueries per bnumber. This sort of thing will improve dramatically with somewhere official to store this kind of data.

frankieroberto commented 8 years ago

@tomcrane looks great, thanks! What does the navDate refer to in the list? Is this something approximating date-of-digitisation?

BTW I had a bit more of a look into https://github.com/ruby-marc/ruby-marc/ as used by https://github.com/pulibrary/marc_liberation – and it seems to be expecting to read MARC in binary format, whereas what's published by Sierra is a text format? Have you ever had to use any tools to read MARC before?

tomcrane commented 8 years ago

Hi Frankie, you can treat the navDate as the date of digitisation.

http://iiif.io/api/presentation/2.1/#navdate

A date that the client can use for navigation purposes when presenting the resource to the user in a time-based user interface, such as a calendar or timeline.

btw I've spotted a bug this morning, I need to redeploy something. If you've already started consuming this feed you may lose some data from it - i.e., things that are in it now will drop out, but after today it should be fine.

A couple of years ago I did some experiments reading MARC from the library's Z39.50 interface:

https://github.com/tomcrane/MarcTools/blob/master/MarcTools/Controllers/MarcController.cs#L129

=> http://wellcomelibrarymarc.azurewebsites.net/marc/bibframe/b18035978?serialisation=turtle

You might be able to get it via the Z39.50 endpoint - that might be the binary format.

frankieroberto commented 8 years ago

@tomcrane Never heard of Z39.50 – looks painful to use: https://en.wikipedia.org/wiki/Z39.50 ("pre-Web")! Any idea where the endpoint is for Wellcome Library (if it exists?)

jennpb commented 8 years ago

http://wellcomelibrary.org/using-the-library/how-to/z3950/

Such exciting content!

frankieroberto commented 8 years ago

@jennpb holy moly! I'm still thinking it should be easier to parse the MARC from Encore via the Web (e.g. http://search.wellcomelibrary.org/iii/encore/record/C__Rb2020458?marcData=Y ) rather than fallback onto a ye olde pre-Web interface?

tomcrane commented 8 years ago

It was quite painful to use... and it goes back to the 1970s. Probably drives an Austin Allegro.

You could try

...although surely the MARC XML is available directly from Sierra somewhere?

Those endpoints are not really for production, they were just me poking about to get something to convert to BIBFRAME. I don't know how much load they would take.

frankieroberto commented 8 years ago

@tomcrane I've learned not to say "surely" in the context of this project... 😉 But yeah - MARCXML (in the same format as the dumps) would be useful.

tomcrane commented 8 years ago

Is http://wellcomelibrarymarc.azurewebsites.net/marc/marcxml/b19784909 in the same format as the dumps?

@jennpb - http://tomcrane.github.io/wellcome-today/

(give it time to load... shows the most recent 24 open digitised items). The performance of this could be greatly improved by an additional param on the new feed:

http://wellcomelibrary.org/service/collections/recently-digitised/24

(i.e., give me the most recent x items). Would that be useful? It's so trivial I'll do it anyway.

tomcrane commented 8 years ago

http://wellcomelibrary.org/service/collections/recently-digitised/24/ now works.

And so does http://wellcomelibrary.org/service/collections/lucky-dip/24/

Test harness:

http://tomcrane.github.io/wellcome-today/

Notice the performance difference between the two. This is because the recently digitised material is "warm" in the cache. When this is all on the DLCS the difference between the two should be much less.

jennpb commented 8 years ago

Cool stuff @tomcrane ! Can I load up any IIIF collection, or only wellcome ones?

tomcrane commented 8 years ago

At the moment it expects that the loaded collection will be of manifests, and that each manifest has an IIIF image service available on the thumbnail property. All second-level Wellcome collections are like this, but generally IIIF collections would need a bit more navigating to pull out the image services. I will add some extra code to make it work with any collection at some point.

If I added a IIIF icon to pages like this... http://wellcomelibrary.org/collections/browse/genres/Wet%20collodion%20negatives/ ...you could drag it into that page.

frankieroberto commented 8 years ago

@tomcrane yeah, http://wellcomelibrarymarc.azurewebsites.net/marc/marcxml/b19784909 is in the same (MARCXML format) as the dumps – is that data 'live'?

I wrote a bit of code over the weekend to parse the TXT format, which seems to work, so I can use that otherwise.

frankieroberto commented 8 years ago

@tomcrane so I'm guessing that your site at wellcomelibrarymarc.azurewebsites.net isn't using a live feed, as this recently-added b-number throws a server error: http://wellcomelibrarymarc.azurewebsites.net/marc/marcxml/b28037534 ?

jennpb commented 8 years ago

I can confirm that there's no live XML - I checked with Joao, Natalie and Branwen.

frankieroberto commented 8 years ago

@tomcrane one other thing: some of the navDates seem a bit suspicious (e.g. 0001-01-01T00:00:00.0000000) – is there a reason why these might be missing?

frankieroberto commented 8 years ago

Closing this as it's done – but it'd still be good to know why some items in your feed are missing navDates @tomcrane.