wellcomecollection / catalogue-pipeline

:oil_drum: The data pipeline services extracting & transforming data from our museum and collections.
https://developers.wellcomecollection.org/catalogue
MIT License
13 stars 2 forks source link

There's something strange going on with hierarchies on Works. #2500

Closed paul-butcher closed 11 months ago

paul-butcher commented 11 months ago

https://wellcome.slack.com/archives/C8X9YKM5X/p1702638185461429

https://wellcomecollection.org/works/b62y2grn is erroneously appearing as a child of https://wellcomecollection.org/works/vsyg427x within the FPA archive.

paul-butcher commented 11 months ago

My initial thought is that the candidate causes for this kind of thing include:

paul-butcher commented 11 months ago

in /pipeline/matcher/scripts $ yarn getMatcherGraph b62y2grn returns a graph with a single node, so it should not be merging with anything.

Looking at works-identified-2023-11-09/_doc/b62y2grn it has no mergeCandidates

...
"state": {
      "sourceIdentifier": {
        "identifierType": {
          "id": "sierra-system-number"
        },
        "ontologyType": "Work",
        "value": "b33292656"
      },
      "canonicalId": "b62y2grn",
      "sourceModifiedTime": "2023-12-13T12:00:48Z",
      "mergeCandidates": [], <-------------HERE-----------
...
paul-butcher commented 11 months ago

Also in works-identified-2023-11-09/_doc/b62y2grn, it has a collectionPath with no slashes, which means it expects to be the root of its own hierarchy.

...
    "collectionPath": {
        "path": "3329265i"
      },
  ...

This property is set from the Sierra MARC field 001 on any documents that contain fields of type 774, which can be seen here (VPN required) b3329265

paul-butcher commented 11 months ago

Oddly, in works-merged (works-merged-2023-11-09/_doc/b62y2grn), the document has a collectionPath of "SAFPA/CB/2/41/43", which is exactly where it is curently being displayed. I don't yet know what could have set collectionPath like this.

There are no redirectSources in the version in works-merged, so nothing has been merged into it.

paul-butcher commented 11 months ago

Another possibility in cases like this is that there was a source data problem, and although it has since been resolved, the pipeline has hit a snag, and it's not been updated past a certain stage. (it doesn't look like this is the case here)

The final (indexed) version has the following times: "modifiedTime": "2023-12-13T12:00:48Z" "mergedTime": "2023-12-13T12:03:59.519348Z", "indexedTime": "2023-12-13T12:23:39.438640Z",

The merged version has the following times in its state: "mergedTime": "2023-12-13T12:03:59.519348Z", "sourceModifiedTime": "2023-12-13T12:00:48Z",

The identified version "sourceModifiedTime": "2023-12-13T12:00:48Z",

So at every stage, it has the same source modified time, which suggests that they are all built from the same version of the source data. Similarly, mergedTime matches between merged and indexed and the times on the later stages are after those on the earlier stages. So it looks like this document has passed through the system as expected.

paul-butcher commented 11 months ago

Just to be sure, I've checked the DLQs. the merger has some (7) messages on its DLQ, five of which are sub3yrcq(which is part of a different story. However, the three distinct messages there do not look relevant (sub3yrcq , zzdb6wka and dd63gq8m).

So this points to a problem somewhere in the relation embedder (including the path concatenator)

paul-butcher commented 11 months ago

The path concatenator is where that paucity of data mentioned in the first comment is resolved. When faced with a path branch/leaf it looks up root/branch and returns root/branch/leaf.

It's not obvious how this could result in what we are seeing here, but it's worth a look. This has highlighted to me that the path concatenator hampers observability

paul-butcher commented 11 months ago

This is interesting. The James Gardiner Collection has suffered the same fate, and in the same position.

If you visit the preceding or following sibling, and refresh the page (the hierarchy is not reloaded when you just click on the link), then the Drawings of Childbirth entry in the tree changes to James Gardiner.

This makes it look like there's something special about the "parent" record that's making it grab hold of these two collections.

paul-butcher commented 11 months ago

Using the message injector, I pushed b62y2grn onto id_minter_output.

If this does not fix the problem, then there is something awry in the processing of that document itself. If it does, then either this is something wrong with vsyg427x or it was some strange ephemeral thing I may not be able to reproduce.

I will eventually do the same with ctu3s9j3 and vsyg427x to get everything back to normal, but I want to check the behaviour one-at-a-time.

paul-butcher commented 11 months ago

I wonder if something funny happened in the Batcher. See logs&_a=(columns:!(log),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!t,index:cb5ba262-ec15-46e3-a4c5-5668d65fe21f,key:ecs_cluster,negate:!f,params:(query:catalogue-2023-11-09),type:phrase),query:(match_phrase:(ecs_cluster:catalogue-2023-11-09)))))&breadcrumb=%23%2F%3F_g%3D(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-7d%2Cto%3Anow))%26_a%3D(columns%3A!(log)%2Cfilters%3A!(('%24state'%3A(store%3AappState)%2Cmeta%3A(alias%3A!n%2Cdisabled%3A!f%2Cindex%3Acb5ba262-ec15-46e3-a4c5-5668d65fe21f%2Ckey%3Aecs_cluster%2Cnegate%3A!f%2Cparams%3A(query%3Acatalogue-2023-11-09)%2Ctype%3Aphrase)%2Cquery%3A(match_phrase%3A(ecs_cluster%3Acatalogue-2023-11-09))))%2Cindex%3Acb5ba262-ec15-46e3-a4c5-5668d65fe21f%2Cinterval%3Aauto%2Cquery%3A(language%3Akuery%2Cquery%3ASAFPA)%2Csort%3A!(!('%40timestamp'%2Cdesc))))

All those 332963i/3329263i.nn paths are part of the Rigshospitalet tree, In this log in particular%26_a%3D(columns%3A!(log)%2Cfilters%3A!(('%24state'%3A(store%3AappState)%2Cmeta%3A(alias%3A!n%2Cdisabled%3A!f%2Cindex%3Acb5ba262-ec15-46e3-a4c5-5668d65fe21f%2Ckey%3Aecs_cluster%2Cnegate%3A!f%2Cparams%3A(query%3Acatalogue-2023-11-09)%2Ctype%3Aphrase)%2Cquery%3A(match_phrase%3A(ecs_cluster%3Acatalogue-2023-11-09))))%2Cindex%3Acb5ba262-ec15-46e3-a4c5-5668d65fe21f%2Cinterval%3Aauto%2Cquery%3A(language%3Akuery%2Cquery%3ASAFPA)%2Csort%3A!(!('%40timestamp'%2Cdesc)))), they are all in an input batch with SAFPA/CB/2/41/43.

However, the selectors created look correct - it correctly identifies three distinct trees and provides the selectors for them.

paul-butcher commented 11 months ago

b62y2grn has gone all the way through now, and the problem is still there.

paul-butcher commented 11 months ago

So I tried vsyg427x, with no success. Though now the Rigshospital content is the one that appears in that slot. I assume that the last one to be processed gets presented there. James Gardiner is still stuck in the tree if you go straight to it.

paul-butcher commented 11 months ago

Got it! This is an interesting one, so I thought I'd just exclaim in here, then explain in the next comment.

paul-butcher commented 11 months ago

In the path concatenator, when a record is trying to find its parent or its children, it splits off the head or tail of its own path, then constructs a wildcard query for it.

So, given trunk/branch, it looks for its parent */trunk and its children branch/*

All paths are constructed with this / path separator.

When we have a path with just one segment (i.e. it is the root of its own tree), it still does the same attempt at a query, using the split path. However, the head of the path, as used by the query, is not the whole of the path(because it's looking for the parent of this record), but an empty string. So the parent query looks for */. The tail is the whole string, so root works as expected to find its children: root/*.

(admittedly this has always been inefficient, and it has now proven to be incorrect as well)

The / separator is expected to only go between path segments. We don't expect to find it at the start or end of a collectionPath.

Enter hzg8xkez (b1953498x&_a=(columns:!(),filters:!(),index:d5ea4b8c-f58d-409c-8fb7-8a9973ea67f7,interval:auto,query:(language:kuery,query:'varField.subfields.content.keyword:%20SAFPA%2FCB%2F2%2F41%2F43%2F'),sort:!()))), which has a collectionPath of SAFPA/CB/2/41/43/

So now, any document that is ingested as the root of its own tree is erroneously turned into a child of SAFPA/CB/2/41/43/. Even better, because it has happened due to the lack of its own path segment, it becomes SAFPA/CB/2/41/43/, which is then "corrected" to SAFPA/CB/2/41/43.

paul-butcher commented 11 months ago

This will need some reindexing to fully fix, but if we're lucky, then it has only hit the ones mentioned above, so I can do a targeted reindex of just those.

I'll do some analysis today to see how widespread it is.

paul-butcher commented 11 months ago

There are 48 records that have been affected by this, so it's simple enough to just push those through when we're ready.

works-merged-2023-11-09/_search  {
  "query": {
    "bool":{
    "filter": {
      "wildcard": {
        "data.collectionPath.path.keyword": {
          "value": "SAFPA/CB/2/41/43"
        }
      }
    } 
    }
  }  ,
  "_source":false,
"size": 100
}
kenoir commented 11 months ago

There are 48 records that have been affected by this

How did you find those records?

paul-butcher commented 11 months ago

By running the query in that comment on works-merged-2023-11-09 in pipeline-2023-11-09 (using the developer console in ES).

I found hzg8xkez earlier in the thread with a similar query containing "value": "*/" (in fact, this is why the 48 record query uses an unnecessary wildcard query, because I just used the same JSON and changed the query term), and discovered it is the only record with a trailing slash. So all the records that have been hit by this issue will have a path of SAFPA/CB/2/41/43.

Then I turned it into a list of ids for use in the message injector mentioned here thus:

$ grep "_id" 'response.json' | sed 's/.*: "//' | sed 's/".*//'

I think I will also have to reload hzg8xkez, to get it back into its rightful position, and purge whichever cuckoo is sitting in its place under vsyg427x at the time. (so it's actually 49)

paul-butcher commented 11 months ago

This is all fixed and working.

However, you may still see the wrong tree due to caching. I have checked with a different browser, and both https://wellcomecollection.org/works/b62y2grn and https://wellcomecollection.org/works/ctu3s9j3 look OK.

https://wellcomecollection.org/works/pss8y4yn shows that the James Gardiner Collection is back to normal and https://wellcomecollection.org/works/d6zfhy2b shows that the Rigshospitalet collection is back to normal, even if the cached version appears not to be.