wellcomecollection / catalogue-pipeline

:oil_drum: The data pipeline services extracting & transforming data from our museum and collections.
https://developers.wellcomecollection.org/catalogue
MIT License
13 stars 2 forks source link

Data in the reporting cluster does not match data elsewhere #2541

Open paul-butcher opened 9 months ago

paul-butcher commented 9 months ago

Slack

When investigating https://github.com/wellcomecollection/catalogue-pipeline/issues/2536, I noticed that one of the problematic records has an invalid MARC 650 field. It declares that it is a Library of Congress id (ind2=0), but contains a MeSH id (subfield 0 starts with D)

The record in question is b1269556, but this is not unique to that record. I have also seen this error occur with D009524Q000266 and D010297 and plenty of other incorrectly marked MeSH ids.

650  0 Newspapers|xEnglish.|0D009524. 

When this record last went through the pipeline, it logged an error:

Could not determine LoC scheme from id 'D009524'

When I looked in VHS for it, the field in question is incorrect as expected (ind2=0, subfield 0= D009524)

{
      "fieldTag": "d",
      "marcTag": "650",
      "ind1": " ",
      "ind2": "0",
      "subfields": [
        {
          "tag": "a",
          "content": "Newspapers"
        },
        {
          "tag": "x",
          "content": "English."
        },
        {
          "tag": "0",
          "content": "D009524."
        }
      ]
    },

I decided I should make a report on this, to see how widespread the issue is, and facilitate its resolution.

Imagine my surprise when there appear to be no 6xx varfields with ind2=0 and a MeSH id. I know this to be incorrect, as I am currently looking at one. I searched for a few other known offenders, and they all have the correct ind2 value (2).

I cannot work out where this is coming from. How does the reporting cluster end up with different content to everywhere else?

paul-butcher commented 9 months ago

I did fix a problem in the adapters recently, that could cause reporting to have missed some changes that the pipeline will have picked up.

However, that's the wrong way round.

paul-butcher commented 9 months ago

I have just run a window covering the last known change on that document {"start": "2023-08-18T09:35:00.000000+00:00", "end": "2023-08-18T09:35:36+00:00"}, and ind2 in reporting is now 0.

This is now the only ind2=0 varfield whose code does not start with sh.

Perhaps they were already all correct and then a batch process on Sierra broke them all. This record was correct in 2021.

paul-butcher commented 9 months ago

Right. Very odd.

b1269556 was originally correct, then at 2023-08-10T18:08:07Z it changed ind2 from 2 to 0.

So, we have (had) two problems here

  1. The source data problem. It's wrong.
  2. The reporting hole, perhaps I was wrong when I estimated it had been broken since September. The data is now going into reporting as expected, but there is still a gap of some kind.

The existence of this source data problem could have been easier to spot and deal with if we had a better way to report dodgy content to collections staff.

paul-butcher commented 9 months ago

the same is true of b30489313. August last year - 2023-08-17 15:53:15Z - it changed from the correct ind2=2 to the incorrect ind2=0