Mint IDs for unidentified concepts in the catalogue

paul-butcher commented 2 years ago

We ignore FAST subjects in the transformer, because they often result in duplication (e.g. against the same concept in MeSH in the same record)

I wonder if the same problem is true WRT unidentified concepts. Would this introduce a problem further downstream?

paul-butcher commented 2 years ago

Example with unidentified concepts: b24303318 (qrvudmy7)

A treatise on bear's grease, with observations to prove how indispensible the use of that incomparable substance, to preserve the head of hair

650  0 Hair|xCare and hygiene|vEarly works to 1800. 
650  0 Hairstyles|xHistory|vEarly works to 1800.

Currently yields this in the API

{
  "subjects": [
    {
      "label": "Hair - Care and hygiene - Early works to 1800",
      "concepts": [
        {
          "label": "Hair",
          "type": "Concept"
        },
        {
          "label": "Care and hygiene",
          "type": "Concept"
        },
        {
          "label": "Early works to 1800",
          "type": "Concept"
        }
      ],
      "type": "Subject"
    },
    {
      "label": "Hairstyles - History - Early works to 1800",
      "concepts": [
        {
          "label": "Hairstyles",
          "type": "Concept"
        },
        {
          "label": "History",
          "type": "Concept"
        },
        {
          "label": "Early works to 1800",
          "type": "Concept"
        }
      ],
      "type": "Subject"
    }
  ]

paul-butcher commented 2 years ago

There was a bit of a lack of testing in this space - I think I could have added this functionality without breaking any existing tests. I've added a few, as well as the happy path for this functionality in concepts.

TODO: I still need to extend this to the other concept-like things we want to identify in this way, and ensure that all possibilities are covered.

paul-butcher commented 2 years ago

Something I spotted when testing was that the tests in this space function more like mocks than matchers.

There are places where the return value from the SUT is checked to match against an object generated by the various object.apply calls that are found within the SUT.

The "extraction" part of it, where it finds the data in the Sierra data to pass to those apply calls is being tested, but what goes on inside apply is just being called again in the test.

The result of this is that we don't actually know what the member values are in any of the instances. If we changed (e.g. ParsedPeriod.apply() to reverse the label it's trying to parse, we'd never know.

Many of the tests in this space are currently fairly readable, given that they assume there are no ids anywhere, but would become much harder to read when all these new IDs are implemented.

As a result, I have created some Matchers so that the tests can be read in a more narrative fashion. Rather than "calling this function like this returns the same thing as calling that function like that", the new tests read more like "calling this function returns something that has these properties and those properties".

A further step might be to populate them with given/when/then/and, to make the narrative even clearer (because there are things in each test that don't directly answer the question posed by the test title, but are still important), but there's enough yak shaving going on here already.

paul-butcher commented 2 years ago

https://github.com/wellcomecollection/catalogue-pipeline/pull/2149 (Draft PR for visibility)

paul-butcher commented 2 years ago

There is an inconsistency between Subject and Genre. Currently:

Subject has its own id, if available in $0, which represents the whole MARC field, and no Concepts have ids Genre has no id, and the first Concept accepts the id from $0

My policy with this change is that everything that currently has an identifier will continue having that same identifier. Only things that do not currently end up with an identifier will change.

paul-butcher commented 2 years ago

This record has a 110 field with no second indicator, but the id in 0 is the LCNames identifier that matches the label

https://catalogue.wellcomelibrary.org/search/a?searchtype=Y&searcharg=Glucodin%2C+Minadex%2C+Adexolin+and+Adexocal+&searchscope=12&SORT=D

I assume that the behaviour should be as though indicator2=0 is the default (in subject/contributor fields).

That said, this has the potential to introduce error. If the field is actually in the "wrong" scheme, and the id value coincidentally corresponds with an LCNames id, which is unrelated to the subject/contributor in question, then we would end up with dodgy links.

Perhaps the safest thing to do is this:

indicator2	subfield 0	no subfield 0
"0"	lc names	label derived
""	unidentifiable	label derived
anything else	unidentifiable	unidentifiable

i.e.

only create ids when there is no id.
only extract ids when we know what the id is
don't create or extract ids when the scheme is unexpected

paul-butcher commented 2 years ago

Actually, it looks like it should assume LCNames, according to tests for Contributors., e.g. https://github.com/wellcomecollection/catalogue-pipeline/blob/704cec1f6c43496313aebe0cc167e8b5aac32021/pipeline/transformer/transformer_sierra/src/test/scala/weco/pipeline/transformer/sierra/transformers/SierraContributorsTest.scala#L226-L255

wellcomecollection / concepts-pipeline

Mint IDs for unidentified concepts in the catalogue #3