wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
48 stars 10 forks source link

Tweak identifier model for genre/subject identifiers in Sierra #2030

Closed alexwlchan closed 6 years ago

alexwlchan commented 6 years ago

The MARC tags we read for identifiers on genre/subject either come from a couple of prefilled headings, or a freeform text field in a different subfield:

Second Indicator 0 - Library of Congress Subject Headings 1 - LC subject headings for children's literature 2 - Medical Subject Headings 3 - National Agricultural Library subject authority file 4 - Source not specified 5 - Canadian Subject Headings 6 - Répertoire de vedettes-matière 7 - Source specified in subfield $2

https://www.loc.gov/marc/bibliographic/bd651.html

It's that 7 which is causing us issues.

The current code rejects anything that's not "0" or "2", and so we have ~250k records on the DLQ. Some of these have second indicator "4", which is a small patch.

The remaining genres/subjects come from all over the shop. This is a brief tally from the first 4000 records:

Counter({'aiatsiss': 2,
         'bidex': 5,
         'bisacsh': 57,
         'blmlsh': 6,
         'cct': 22,
         'dcs': 2,
         'eclas': 88,
         'eflch': 1,
         'embne': 4,
         'fast': 8107,
         'fssh': 3,
         'gmgpc': 392,
         'gmgpc ': 4,
         'gnd': 2,
         'gsafd': 1,
         'idszbz': 10,
         'idszbzes': 3,
         'inriac': 4,
         'jhpk': 3,
         'jlabsh/4': 1,
         'larpcal': 139,
         'lcgft': 27,
         'local': 198,
         'ram': 174,
         'rasuqam': 22,
         'rbgenr': 444,
         'renib': 16,
         'rero': 8,
         'retrosciences': 1,
         'sao': 1,
         'sigle': 7,
         'unbist': 6})

I think we have a couple of options:

  1. Ignore all of them
  2. Add enumerations for the most common identifier schemes, and ignore the rest
  3. This list seems to be (partially) populated from the MARC site https://www.loc.gov/standards/sourcelist/genre-form.html – we could get rid of the enumeration, and load this list as a resource (a la Miro contributors). If we go down this approach, we might need to think about getting rid of the snake-case-strings and replace them with proper sentences (which we might want to do anyway!)
jtweed commented 6 years ago

Here's what I'm thinking:

So that gives us:

0 lc-subjects
1 lc-childrens-subjects
2 nlm-mesh
3 nal-subjects
4 - no id
5 lac-canadian-subject-headings
6 rvm-topics
7 marc-subjects-{$2}

But I'd like @silveroliver to weigh in on this.

alexwlchan commented 6 years ago

If we go down the route of lc-subjects, lc-childrens-subjects, and so on, I’d like to push for having a human-readable label. Some of the names are quite opaque, and not especially searchable.

alexwlchan commented 6 years ago

So chatting to Jonathan just now:

  1. As a temporary fix, I’ll modify the transformer to drop any identifiers which aren’t LCSH, MESH or (no ID), which empties the DLQs.

  2. We want to think about moving to a new model for identifiers, maybe something like:

    {
        "scheme": "lc-subjects",
        "description": "Library of Congress Subject Headings",
        "value": "lcsh-123",
        "type": "Identifier"
    }

    with the details to be fleshed out while chatting to Silver.

  3. Don't index the description field (or similar) in Elasticsearch.

silveroliver commented 6 years ago

Proposed change to identifier model:

"identifiers": [
  {
    "identifierType": {
      "id": "lc-subjects",
      "label": "Library of Congress Subject Headings",
      "type": "IdentifierType" 
    } ,
    "value": "lcsh-123",
    "type": "Identifier"
  }
]
alexwlchan commented 6 years ago

I’d be happy with the suggestion above.

silveroliver commented 6 years ago

If we are happy with new model this has implications for id model across the transforms. For example contributor, production event..

jtweed commented 6 years ago

I believe the way @alexwlchan is implementing it will change identifiers across the board to the new serialisation.

alexwlchan commented 6 years ago

Blocked on #2190.