wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
48 stars 10 forks source link

Create some spreadsheets to help the Collections team with the ephemera inventory #5674

Closed alexwlchan closed 1 year ago

alexwlchan commented 1 year ago

The Collections team are doing an inventory of the ephemera collections in the autumn, and have asked if we can extract some data from Sierra in a more useful structure than the exports they can get from Sierra directly.

I had a conversation with Alex H yesterday; this ticket is a summary of our conversation and what we're going to prepare as a first pass.

What is the ephemera collection?

For our purposes, it's anything for which:

How are records in the ephemera collection structured?

Useful background: basic bib and item record linking.

The inventory is only interested in physical items – although some of the ephemera collection has been digitised and there are items in Sierra for those digitised copies, we should ignore them in our analysis (and in the discussion below).

There are several cases which we need to think about.

Case #​1: only a bib record for the box – the most minimal level of cataloguing
Case #​2: a bib and item record for the box – this box-level item is the thing that can be ordered by readers
Case #​3: a bib and item record for the box, and bib records for the individual objects inside the box, but no items. This may occur when the individual objects have been digitised; a bib record is created that digitised items can be attached to. (It's unclear whether the bib–bib link here and in case #​4 is conceptual or is actually in the Sierra data.)

The box-level item record is linked to every individual bib so you can order the box from the catalogue page for the bib.
Case #​4: as case #​3, but with item records for the individual objects also. The item records on the individual records are for cataloguing only; the only orderable item is the box-level – you can't order individual objects from the box. This is the highest level of cataloguing.

For cases 3 and 4, I'm not sure whether there's anything in the Sierra metadata that identifies a bib record as being at box level, but it usually becomes clearer when you look at a connected component of all Sierra bibs/items that are linked together (using https://github.com/wellcomecollection/catalogue-pipeline/pull/2322):

Screenshot 2023-03-31 at 07 47 46

Notice here that:

We can also look at the refno/shelfmarks, which have a structure to them – EPH36 identifies the box-level record, whereas the individual objects are EPH36:1 EPH36:2 EPH36:3. (Note: the suffix is not strictly numeric; it's sometimes in chronological order of items in the box, which means sometimes you get e.g. EPH36:3a when an item has to be added to the sequence later.)

What do the inventory team need?

Eventually they're going to do an item-level inventory of the ephemera collection, following case #​4. However, it's not clear how many records have that granular level of metadata, and there's no easy way for them to work it out from Sierra alone. I think we can help them out here.

We can analyse the Sierra data in bulk, and group it into per-ephemera item "clusters". Then we'll produce three lists:

This will allow them to assess the scale of the problem; later we can discuss getting those lists into a form which is useful for inventory.

pollecuttn commented 1 year ago

I know nothing about how the ephemera is catalogued, but looking at Sierra re:

For cases 3 and 4, I'm not sure whether there's anything in the Sierra metadata that identifies a bib record as being at box level

i14883077 and b15629843, which you've inferred are box level bibs, have bib lvl c COLLECTION in their Sierra bib records.

i10733796 has bib lvl d COLL. SUBUNIT in the Sierra bib record.

Don't know if that's helpful or not or if Alex H can tell you how those are used.

bib lvl is fixed length field.

alexwlchan commented 1 year ago

I started by scraping the reporting cluster for any bibs/items with a matching refno/shelfmark, which gave me a complete list of affected b-numbers, attached below.

ephemera_bnumbers.txt

alexwlchan commented 1 year ago

I'm gonna copy the text of an email I sent Alex Hill with my analysis:


I did some analysis trying to identify the “clusters” of linked records like we discussed in March; attached is some initial analysis.

I started with an initial list of:

then I expanded to include every other bib/item that was linked to those records. These turned into the mini “clusters” of linked records that we discussed in March.

I’ve been able to bin the clusters into two buckets:

1) Only two items in the cluster – a bib and an item

This is stuff that’s only catalogued at box level, if I understand the terminology correctly? They’re in the spreadsheet boxes.csv; 649 rows in total.

2) Everything else

Surprise! The data is quite messy. The attached zip graphs.zip contains 593 diagrams showing what each cluster of linked records looks like, along with their refno/shelfmark. I’m not sure I can write a good automated classifier for these, but they do give a shape of the data. I think the diagram gives a better picture of what’s going on than any spreadsheet I could create quickly, e.g. b15567175 is catalogued to item level, and b15610159 isn’t.

alexwlchan commented 1 year ago

Closing as done for now; will reopen if Alex wants us to dig in further.