Closed alexwlchan closed 6 years ago
Here's what I'm thinking:
So that gives us:
0 lc-subjects
1 lc-childrens-subjects
2 nlm-mesh
3 nal-subjects
4 - no id
5 lac-canadian-subject-headings
6 rvm-topics
7 marc-subjects-{$2}
But I'd like @silveroliver to weigh in on this.
If we go down the route of lc-subjects
, lc-childrens-subjects
, and so on, I’d like to push for having a human-readable label. Some of the names are quite opaque, and not especially searchable.
So chatting to Jonathan just now:
As a temporary fix, I’ll modify the transformer to drop any identifiers which aren’t LCSH, MESH or (no ID), which empties the DLQs.
We want to think about moving to a new model for identifiers, maybe something like:
{
"scheme": "lc-subjects",
"description": "Library of Congress Subject Headings",
"value": "lcsh-123",
"type": "Identifier"
}
with the details to be fleshed out while chatting to Silver.
Don't index the description field (or similar) in Elasticsearch.
Proposed change to identifier model:
"identifiers": [
{
"identifierType": {
"id": "lc-subjects",
"label": "Library of Congress Subject Headings",
"type": "IdentifierType"
} ,
"value": "lcsh-123",
"type": "Identifier"
}
]
I’d be happy with the suggestion above.
If we are happy with new model this has implications for id model across the transforms. For example contributor, production event..
I believe the way @alexwlchan is implementing it will change identifiers across the board to the new serialisation.
Blocked on #2190.
The MARC tags we read for identifiers on genre/subject either come from a couple of prefilled headings, or a freeform text field in a different subfield:
It's that 7 which is causing us issues.
The current code rejects anything that's not "0" or "2", and so we have ~250k records on the DLQ. Some of these have second indicator "4", which is a small patch.
The remaining genres/subjects come from all over the shop. This is a brief tally from the first 4000 records:
I think we have a couple of options: