REF/RB slow indexing - Githubissues

yulgit1 commented 4 years ago

Blacklight Indexing currently runs weekly starting at 6:30PM Saturday.

~ 60000 LIDO records takes 50 minutes ~57000 MARC records used to take 17 hrs ~57000 MARC now takes 42 hrs

The bump from 17 to 42 hours occurred after making a change to look up Credit Line in the 583 field of the mfhd.

It's likely the difference between LIDO in <1 hr vs MARC ~17 hr is due the lookup of language and callnumuber in the DC harvester for the MARC records.

@flapka @robl @edgartdata @EdwardTown1 @KraigBinkowski @mxgold Is there a way of getting this information w/o the DC and MFHD lookups to make ingest more performant?

One suggestion: run a separate process to populate a lookup DB table with language, callnumber, and credit line.

flapka commented 4 years ago

@yulgit1 @robl @edgartdata @EdwardTown1 @KraigBinkowski @mxgold

I'm certainly open to using a different method for harvesting the MARC data, if it functions as needed.

On a related note (sort of), in our next Zoom meeting I'd like to discuss display of Credit Line information and whether it might be clustered with information about call numbers and Aeon requests. Such clustering would be especially useful when we have more than one copy of an item.

Compare the clustering of call number, credit line, and Aeon request links in this Orbis record: http://hdl.handle.net/10079/bibid/580800

to the same record in Blacklight: http://ycba-collections-dev.herokuapp.com/catalog/orbis:580800

The most significant problem with the Blacklight display is that it doesn't allow a patron to specify which copy she'd like to request (via Aeon).

I should have anticipated this problem earlier (sorry).

flapka commented 4 years ago

@yulgit1 @KraigBinkowski Returning to the above, taking this record again as an example: http://ycba-collections-dev.herokuapp.com/catalog/orbis:580800

At a minimum, YCBA libraries want to group information like this:

Collection: Rare Books and Manuscripts Call Number: PZ8 F34 no.3 Credit Line: Yale Center for British Art, Paul Mellon Collection Access: Accessible by request in the Study Room [Request]

Collection: Rare Books and Manuscripts Call Number: PR3991.A1 J33 1854 Credit Line: Yale Center for British Art, Gift of Gloria Scheuer Access: Accessible by request in the Study Room [Request]

In addition to that grouping, it'd be nice if there was some light visual demarcation for each cluster of information (as there is in Orbis, but it can be more subtle). Perhaps just a tad more spacing between these clusters and the fields that precede/follow?

flapka commented 4 years ago

@yulgit1 I recognize that we are also looking for a way to speed up the harvesting, now that we have to make a call to the related MFHD records.

If we grab the call number from the MFHD, that leaves just the language field for querying the related DC record. It would be nice if we could drop that DC query, by getting language from the primary MARCXML record (which ought to be possible). I will troubleshoot this issue again to see if there is a rock left unturned.

yulgit1 commented 4 years ago

@flapka - the call to the mfhd during indexing is proving to be problematical. In addition to adding 25 hours to the 1 hr process, the actual service appears to have a downtime at 7AM daily resulting in ~1000 errors each day (Sun 7AM, Mon 7AM in this weekend's ingest).

To call to DC for language isn't perfect either. This weekend it errored ~50 times, due to intermittent outages. I think this has gone unnoticed as it just results in a missing language field, which only shows up as a lower count in the facet. This DC lang lookup adds 17hrs to the 1hr process.

I am open to suggestions about workarounds, but I cannot think of anything but accepting a +60 hr weekly ingests with ~2000 errors if these fields are critical.

I will send an email to Michael Appleby and Yue, Michael mentioned something in the meeting last week about providing an alternative for holdings lookup, which may help with this.

Also the clustering you suggest sounds simple but cannot be done within the blacklight framework. What might be able to be done is create 1 field called "Holdings" or something similar, and push and collate the 4 arrays parts (collection/callnubmer/credit line/aeon link) into it.

yulgit1 commented 4 years ago

@flapka all, Another alternative: don't index the mfhd, get it on page load (which would slow down the item page just a little), and figure a way to get the language from MARCXML. But then there would be no credit line facet, although I don't think we have that field anyway now for library materials so it would still be an improvement. I think libraries's quicksearch does a dynamic lookup, but they don't have the facet to worry about.

What do you think?

flapka commented 4 years ago

@yulgit1 I think your latter suggestion makes sense.

It's also possible that we could put the credit line information in the bib record and the MFHD:

in the bib record (541) so that it could be indexed for the credit line
in the MFHD so that we can connect the credit line with the copy to which it applies (can't do this in a strong way in the bib record)

It's not a particularly elegant solution, but there are plenty of inelegant things in our MARC. It'd be functional.

KraigBinkowski commented 4 years ago

Right now we currently just add the credit information in the bib record's 541 field, correct? Would adding the credit info to the MFHD mean re-cataloguing the MARC records we currently have or would the process be automatic? As for elegance, multiple copies are displayed in both vufind and blacklight as just extra call numbers -- not very elegant to begin with. As far as I can see, copies are not always designated.

yulgit1 commented 4 years ago

@KraigBinkowski @flapka - please see item pages for rb/ref, I was able to extend the blacklight metadata at the bottom and added a "Holdings" block. This info isn't indexed and is pulled dynamically from the library IT mfhd service.

Note the callnumber, collection, credit line, and access are now redundant above, but they remain because many of the ref's I checked don't have mfhd holdings and if removed would be completely absent.

flapka commented 4 years ago

Thanks @yulgit1, this is a promising step.

Before I respond at greater length, are you able to point me to one or two examples of items that didn't return MFHD holdings (which is unexpected)?

yulgit1 commented 4 years ago

https://ycba-collections-dev.herokuapp.com/catalog/orbis:8707133 https://ycba-collections-dev.herokuapp.com/catalog/orbis:4318533

flapka commented 4 years ago

@yulgit1 @KraigBinkowski That's helpful, thanks!

Probably the exceptions are all (?) cases where the holding code has a suffix to bacrb, bacref, or bacia. The two examples above are bacrefv and bacrefp, respectively. Would it be possible to retrieve all MFHDs for bacrb, bacref, and bacia*?

(This mostly impacts Ref material. rb and ia have fewer than 10 items outside of the base codes. Ref has thousands.)

yulgit1 commented 4 years ago

~57000 MARC records indexing used to take 17 hrs ~57000 MARC records indexing not having to lookup the Dublin core now takes 2 hrs

~57000 MARC records pulled using the old fieldhand harvester took ~6 min ~57000 MARC records pulled using the alternative ruby-oai harvester took ~1.5 hours

so total savings 13.5 hrs/ingest!

It was necessary to change to the ruby-oai harvester as the original fieldhand harvester uses a XML parsing library called OX which conflates multiple spaces into one causing the MARC 008, which relies on whitespace for its semantics, to fail.

There is one small issue in that when the ruby-oai harvester completes, it doesn't finish gracefully in logs (says there's no resumption token on the last resumption).

Will let the whole process run in this weekend's ingest as a final assessment.

flapka commented 4 years ago

@yulgit1 @KraigBinkowski

I notice this afternoon that everything we've requested above has been successfully implemented, and works splendidly (I think)!

I see also that Blacklight passes the MFHD id to Aeon (as a parameter), which is a functional requirement when we have multiple copies. Works great! Many thanks Eric.

edgartdata commented 4 years ago

Keep issue open a little while longer for testing purposes.

yulgit1 commented 4 years ago

Indexing last weekend was successful. Keeping this ticket open as I've added an additional index, so there is one for test and one for prod, with a cron copy that can be turned off for prod, so testing can occur once we go live.

edgartdata commented 4 years ago

@yulgit1 will work on this over the weekend and update us next week.

edgartdata commented 4 years ago

This process to get at RBMS data from MFHD might change in the Fall when Sterling launches its data cloud repo.

ycba-cia / blacklight-collections2

REF/RB slow indexing #190