Data in the reporting cluster does not match data elsewhere

paul-butcher commented 9 months ago

When investigating https://github.com/wellcomecollection/catalogue-pipeline/issues/2536, I noticed that one of the problematic records has an invalid MARC 650 field. It declares that it is a Library of Congress id (ind2=0), but contains a MeSH id (subfield 0 starts with D)

The record in question is b1269556, but this is not unique to that record. I have also seen this error occur with D009524Q000266 and D010297 and plenty of other incorrectly marked MeSH ids.

650  0 Newspapers|xEnglish.|0D009524.

When this record last went through the pipeline, it logged an error:

Could not determine LoC scheme from id 'D009524'

When I looked in VHS for it, the field in question is incorrect as expected (ind2=0, subfield 0= D009524)

{
      "fieldTag": "d",
      "marcTag": "650",
      "ind1": " ",
      "ind2": "0",
      "subfields": [
        {
          "tag": "a",
          "content": "Newspapers"
        },
        {
          "tag": "x",
          "content": "English."
        },
        {
          "tag": "0",
          "content": "D009524."
        }
      ]
    },

I decided I should make a report on this, to see how widespread the issue is, and facilitate its resolution.

Imagine my surprise when there appear to be no 6xx varfields with ind2=0 and a MeSH id. I know this to be incorrect, as I am currently looking at one. I searched for a few other known offenders, and they all have the correct ind2 value (2).

I cannot work out where this is coming from. How does the reporting cluster end up with different content to everywhere else?

paul-butcher commented 9 months ago

I did fix a problem in the adapters recently, that could cause reporting to have missed some changes that the pipeline will have picked up.

However, that's the wrong way round.

The source data error is the kind of error one makes when originally writing the content. I can't imagine someone coming along and changing ind2 to the wrong value.
I have looked back through some of the history of this record on VHS. It last changed before that problem arose, this field seems to always have contained ind2=0

paul-butcher commented 9 months ago

I have just run a window covering the last known change on that document {"start": "2023-08-18T09:35:00.000000+00:00", "end": "2023-08-18T09:35:36+00:00"}, and ind2 in reporting is now 0.

This is now the only ind2=0 varfield whose code does not start with sh.

Perhaps they were already all correct and then a batch process on Sierra broke them all. This record was correct in 2021.

paul-butcher commented 9 months ago

Right. Very odd.

b1269556 was originally correct, then at 2023-08-10T18:08:07Z it changed ind2 from 2 to 0.

So, we have (had) two problems here

The source data problem. It's wrong.
The reporting hole, perhaps I was wrong when I estimated it had been broken since September. The data is now going into reporting as expected, but there is still a gap of some kind.

The existence of this source data problem could have been easier to spot and deal with if we had a better way to report dodgy content to collections staff.

paul-butcher commented 9 months ago

the same is true of b30489313. August last year - 2023-08-17 15:53:15Z - it changed from the correct ind2=2 to the incorrect ind2=0

wellcomecollection / catalogue-pipeline

Data in the reporting cluster does not match data elsewhere #2541