Open paul-butcher opened 9 months ago
I did fix a problem in the adapters recently, that could cause reporting to have missed some changes that the pipeline will have picked up.
However, that's the wrong way round.
I have just run a window covering the last known change on that document {"start": "2023-08-18T09:35:00.000000+00:00", "end": "2023-08-18T09:35:36+00:00"}
, and ind2 in reporting is now 0.
This is now the only ind2=0
varfield whose code does not start with sh
.
Perhaps they were already all correct and then a batch process on Sierra broke them all. This record was correct in 2021.
Right. Very odd.
b1269556 was originally correct, then at 2023-08-10T18:08:07Z it changed ind2 from 2 to 0.
So, we have (had) two problems here
The existence of this source data problem could have been easier to spot and deal with if we had a better way to report dodgy content to collections staff.
the same is true of b30489313. August last year - 2023-08-17 15:53:15Z - it changed from the correct ind2=2 to the incorrect ind2=0
Slack
When investigating https://github.com/wellcomecollection/catalogue-pipeline/issues/2536, I noticed that one of the problematic records has an invalid MARC 650 field. It declares that it is a Library of Congress id (ind2=0), but contains a MeSH id (subfield 0 starts with D)
The record in question is b1269556, but this is not unique to that record. I have also seen this error occur with D009524Q000266 and D010297 and plenty of other incorrectly marked MeSH ids.
When this record last went through the pipeline, it logged an error:
When I looked in VHS for it, the field in question is incorrect as expected (ind2=0, subfield 0= D009524)
I decided I should make a report on this, to see how widespread the issue is, and facilitate its resolution.
Imagine my surprise when there appear to be no 6xx varfields with ind2=0 and a MeSH id. I know this to be incorrect, as I am currently looking at one. I searched for a few other known offenders, and they all have the correct ind2 value (2).
I cannot work out where this is coming from. How does the reporting cluster end up with different content to everywhere else?