wellcomecollection / catalogue-pipeline

:oil_drum: The data pipeline services extracting & transforming data from our museum and collections.
https://developers.wellcomecollection.org/catalogue
MIT License
12 stars 2 forks source link

Some works are merged non-deterministically #1561

Open jamieparkinson opened 3 years ago

jamieparkinson commented 3 years ago

Context

We have previously assumed that the matcher graphs recorded by the matcher would be relatively simple and predictable - for example, a Sierra work and a METS work, or a Sierra work and several Miro works. However, it turns out that for some works the graphs can be extremely complex:

image

These complexities most often occur when several Sierra works link to a single Miro work, although we've seen that it can also occur when Sierra works link to each other. I've only explored the Miro case so far.

The problem

The merger has no understanding of graphs and just operates on lists of works (it does not care how they are linked, as this is the job of the matcher). While this works well for most cases, it does not handle the complex graphs: we see non-deterministic behaviour depending on the order in which the merger encounters the works. In the case of high-degree Miro works, this usually manifests as images being linked to the wrong Sierra works and so seeming to disappear.

In some cases, we can identify these cases as errors in the catalogue records (ie incorrect linking). However, in many others, the links are valid representations - see next comment for some examples.

Incomplete solutions

It seems likely that we can automatically break up graphs in the matcher by applying some sort of business logic, but this has turned out to be both difficult and incomplete. Perhaps we can be OK with this, but it needs some more work. Some initial work on this is given below.

Further work on this would probably aim to handle some of these cases automatically, and warn (or even error) on ambiguous cases. It may be that the solution is not to merge the ambiguous cases at all so as to ensure that no information is hidden unexpectedly.

jamieparkinson commented 3 years ago

Examples of ambiguous graphs

jamieparkinson commented 3 years ago

Graph splitting algorithm (incomplete, work in progress)

Def. An overlinked Miro work is one that has a Miro source identifier and is linked to by more than subgraph - the number of subgraphs is called the number of links. Note that a Miro work can have multiple neighbours and not be overlinked if those neighbours are still in the same (sub)graph when the Miro work is discounted.

Def. A canonical parent record is a Sierra work that links to a Miro work and can be regarded as a description of the digitised content therein. Sierra records can link to Miro works but not be canonical parents - for example, a record describing a photo album that links to several Miro works containing individual digitised pages would not be a canonical parent if there also existed Sierra records for those individual pages (the latter would be the canonical parents).

Desiderata: Split works graphs such that non-ambiguous overlinked Miro works are eliminated and that the required partitions never separate them from their canonical parent.

Algorithm

G is a graph that contains at least one overlinked Miro work.

  1. Select all of the overlinked Miro works in G, call this ordered set M.

  2. Sort M by the number of links of each work in it.

  3. For each work m in M:

    If any neighbour of m has a neighbour that is a METS work, then remove all edges except the one connecting m to this. Otherwise:

    1. Find the lowest degree d of any neighbour of m.
    2. Find the set N of neighbours that have a weighted degree higher than d.
    3. If N is empty but d > 1, set N to all of the neighbours of m.
    4. Remove all edges that connect m to members of N.
jamieparkinson commented 3 years ago

Further examples

These are work IDs where the graph in which the work belongs contains an overlinked Miro work. Some of them are in the same graph.

Expand for full list ``` ['wunesuh2', 'j3g4x6zx', 'c4hvjeu5', 'skfue9bq', 'sgwmd3gw', 'd49j3kp9', 'nnsy84wp', 'frd2hcdg', 'n6x987r2', 'x8743kg9', 'r3edq87t', 'hcbs9g2j', 'qp2va8md', 'vq2gazss', 'gcwj7z5y', 'hw56qb3v', 'zbwyf3rx', 'jz7j3ff6', 'k2vxsbnw', 'cza66jd5', 'gya2thz8', 'gwc4g852', 'nf8dabab', 'zyxsshpg', 'w63af7tm', 'an5eje2w', 'ehh3km8t', 'zrj8d9f3', 'w7264r9g', 'rw78zyu3', 'gghs744b', 'rqhbeex3', 'qq6kevx3', 'pmf5ku6m', 'kkygbwqt', 'q2rcrknr', 'a7wwbxdn', 'xu8kksk8', 'gdty3mgp', 'tjfdz7hj', 'xz5axkyn', 'yg9gsxc3', 'sg9pf6f7', 'ub348n36', 't7xz5umn', 'qjme3ju4', 'urd6f2zn', 'rjh4berq', 'veyywkur', 'rts8fhzh', 'nanexq44', 'vup7874s', 'gya6pdnb', 'xdkabzvk', 'mesggkzj', 'dwvwnupa', 'u2vff6nw', 'p266z5et', 'cvyb3ezq', 'q846ec43', 'cjae2rqu', 'qwuu45g2', 'dp5kbp9a', 'gxn88qub', 'a55ztwxc', 'by88d7s8', 'r9xh85bq', 'mxvn2r59', 'g9uxt75f', 'zcmkns4g', 'ayauwm5y', 'f9zes3hw', 'yuxj6gfs', 'ehd9me9s', 'bkhsm54k', 'nzemx5a9', 'gcjqmh2u', 'g6rzk6y7', 'zd5mppgj', 'r2n4q6cm', 'mjztyzyq', 'a2swmpjh', 'q8k6bwc3', 's85zsm74', 'zmvfebeh', 'r3bc722s', 'uxqaw3tq', 'pnu996ns', 'hcbkgh5v', 'tbmj9gpr', 'rp84w5nk', 'j67zx644', 'yycyn7be', 't9srfpyz', 'nc2gm7wn', 'yv9htevp', 'fk4ujywn', 'me28acfk', 'xszs738h', 'x6mftkpy', 'a2urpfj6', 'vj6sa2fn', 'dhd7v633', 'dg3ptkm9', 'rjad2u6d', 'yd3vcwce', 'zwm6rxsn', 'gbu7cwft', 'ya2rewag', 'wyuwq9bj', 'et5wydjg', 'g8hfbh3r', 'v5gpcwsm', 'akaqd2a3', 'sr5mr527', 'cpb9a47c', 'ephaug8v', 'hv633444', 'hgnuxr8t', 'mw3w47td', 'd6e5myxw', 'r4a3vjnf', 'as49a2tk', 'syny8fc6', 'rj3np6h9', 'y4upwe97', 'hrzdawgu', 'enhppkth', 'ma2vj6sv', 'hrj5qnqj', 'qh8z8mzb', 'ft6xrv9c', 'u233ptqs', 'ebwgt5y7', 'k36hmdjr', 'pzqt6qy7', 'wbbae4qv', 'ypuf7qfm', 'uev7tts9', 'r4acvyxg', 'p89vq9eq', 'ees3mtrq', 'hsr83rsc', 'adugj8qy', 'b6trg8n8', 'kf94xrc9', 'wphdkw9j', 'hbp47pug', 'r5ymbrjj', 't8fwnmg8', 'd9nttj52', 'fma25m9j', 'mnrmze9k', 'tsxxawtf', 'bvntsvpy', 'z33nefwn', 'bzh6npvj', 'dk5829ay', 'rkunme52', 'nnzbjm9j', 'ev95hgj9', 'mvqssg7s', 'zm8dvktw', 'kjac67g7', 'ke6mexmz', 'vhmgwdug', 'p5aq9dgn', 'ptpw56cq', 'z8w2mua3', 't77cw3ys', 'hpmj9w7e', 'g34dm4bu', 'gcaavu2b', 'qkr9k6e5', 'hauktadt', 'gkg99z76', 'rarsp9rz', 'cy8s2n7s', 'cftnuyht', 'e6cjsxr5', 'zrkk2wke', 'zyvcgxph', 'pwjch4ep', 'p8kbjjft', 'sr6p8bfh', 'cctpj6nc', 'rt9xdrhe', 'yd2zu86e', 'g8w2yg9h', 'd2pvqaef', 'metnuqny', 'x9n68pqg', 'w6q3xzvw', 'jg63sr4e', 'm35cabf8', 'cashknc9', 'awrwda5r', 'thrxsfev', 'g6kn7fj5', 'd8z8f2cx', 'w94hh6d2', 'fdg5h4mg', 'xdfvnn5n', 'yapascbg', 'rfw6us2n', 'yb7akc82', 'redfk4wt', 'fpsp62bf', 'murhqufb', 'y9qzzae4', 'n8hez93z', 'm4kg6rkf', 'fzkqgtm8', 'krbhwjct', 'jys43px7', 'jg978du4', 'tr4n82cx', 'fqtm5kr7', 'w4u3jejf', 'nhs5rgb6', 'nm4rwexg', 'dtmfj3zj', 'acsdnuhp', 'yun6mp3w', 'qwaz5nxb', 'hvxruajd', 'hr2nr6w4', 'z3wdvevs', 'sdnzs7vb', 'dc84hmk4', 'ek72ayzw', 'vfgsn9y8', 'wsvn2br6', 'weug74bu', 'adrj6nh2', 'yagwf7xh', 'a7ukwqyx', 'dbpmaqf4', 'krzw3556', 'yaay9hw7', 'vhhuqta5', 't2yrz68t', 'hxwbmnsj', 't3nz4h7v', 'd3rnsdy8', 'quh7snuk', 'xrbbmh67', 'aqzjup2t', 'gwteq7wm', 'ds9k3vjc', 'bf28kued', 'j34q2g3q', 'eye85t2p', 'p528emrk', 'vku43aft', 'kx4cf4rb', 'yu9tqetu', 'wu9x7bxj', 'aqmfax69', 'v3mu7n96', 'zzj896t7', 'kuw9t86g', 'aycgr7h4', 'cc4z9qh6', 'nqytucnc', 'm3uhz5xf', 'fvwnqv54', 'u3jcxq2k', 'akqub6fs', 'stuarksm', 'kh6zwpkm', 'd96kpndf', 'csrxq7gd', 'w8b88ezp', 'b53sndut', 'teg2xtrc', 'ybxuh6pa', 'e3m547pu', 'swqtf2ua', 'ka89smsp', 'bdacqfe9', 'uqqc2mfd', 'h5k4zd5x', 'zkq9kbyn', 'f3f6eefu', 'v77b9ad8', 'fmbb5eka', 'veehpvvp', 'vr9vyfgh', 't8ne3c93', 'pcvn5abd', 's79ww48c', 'srzaafaf', 'qh8r2jgj', 'zgsb4wy3', 'm29ymke2', 'gux8sx4m', 'z948dh2b', 'm38m6np2', 'dh8e48jf', 'grsny3fs', 'jgxbfcm7', 's82gggrw', 'buqfjjee', 'y6mg83n7', 'jjga75kr', 'k855fqu2', 'm8wq3cfy', 'h8z87etb', 'mbghvhcp', 'fmqytvwm', 'vv3qhvr4', 'xf3j87bj', 'g76bdkfj', 'n53jq623', 'jfeua395', 'gdwkcfpx', 'zu4ahq8z', 'njk6c97e', 'jcdnbbm8', 'eeavpdva', 'amtwwdc2', 'hm5thm42', 'vcjha9u4', 'vm2x3q5c', 'y9a8xrqm', 'wywmd7bt', 'zzmb46na', 'zzwpmvpk', 'cuasrwvx', 'gujxtrbj', 'twx5b8c6', 'wvr4wuw3', 'fnaspv8v', 'pxjjfh2x', 'gmm6f92u', 'pwtgj7q8', 'q6fxhatn', 'ad354tk3', 'mfzeebgj', 'sfjxgd8e', 'gtuhmyfb', 'rrufyc2b', 'xq54hjye', 'zq88vnzc', 'swfzr8tg', 'wmnwyxxc', 'mycy9pqh', 'hfzd7tyx', 'a9rcsnxe', 'f5q8jrpj', 'rtr5zr2g', 'x4svfndb', 'q7xp3k55', 'bymcqnjk', 'zfzg7p82', 'xgcxhb87', 'yjub6dsb', 'rrhzpnr9', 'xcux653v', 'u9xntt6m', 'bmmcmnwb', 'hdv6m5g8', 'abgdsbw3', 'jxp629f3', 'my3nf5f4', 'em8uzz4n', 'uxh3c5uq', 'k7mj93ud', 'rrzrcdfx', 'vccdt4xt', 'k6j4spja', 'amnqbp2f', 'gy22denv', 'jp46jz4s', 'bz889mq3', 'b5b4dh93', 'eunp2apv', 'ff6p7crx', 'jhbn9xdh', 'cxxbrtpm', 'kfxqvy7d', 'ekuv3df9', 'bn29vm7e', 'enpm4vak', 'a6kpjyyh', 'ttnwq8hz', 'u253sgpx', 'xyee682t', 'p5axej94', 'wke9r2t8', 'jrbkbdgr', 'ar4qarv8', 't3apkkc7', 'fqgrbcgy', 'u9f7mqqt', 'kzqcpycs', 'w7wddw6a', 'dnmmmk5j', 'c7hfrmjs', 'ebyfdp8p', 'cepeys33', 'cych6w3e', 'rq66na6d', 'eshxnppc', 'fwjaaej3', 'hsxspey6', 'gwfvzmhz', 'hwy4kkv9', 'uvekmvn7', 'p27cwcey', 'utn47ecb', 'gz5hzn4t', 'bs3zgxwj', 'hjhy5ath', 'yda94ufp', 'gnh4kuh7', 'vvxpq8aq', 'rfdqm5wa', 'wqz8w96b', 'n9hc56w8', 'f5tjrmhw', 'ev5ye67y', 'xqkc8my2', 'fua72ea5', 'awpssn6j', 'bg6jm5w9', 'vhre2z5c', 'sxzwqck8', 'q7ugpa9a', 'ejtypb97', 'ckkd5sh4', 'kq9vbuuu', 'ces76u5e', 'fgutqqay', 'c9k2wjhh', 'e57udgt4', 'fde7xm37', 'pg2ytwg9', 'hqj4zaem', 'qjxnq9es', 'pqubzu42', 'tb9evgka', 'wpedb5am', 'ra8ay5q8', 'tg9rfxre', 'btnkqq5f', 'npgrm34a', 'wq5utzpw', 'zq4v7eek', 'vuaxksa5', 'gejkbznm', 'gqr3uc8w', 'e9h5a9bs', 'cwq6uwqr', 'd2phfmdq', 'vr9tqxeb', 'x9xs9ujx', 'wfpye6h9', 'tgkz9s6b', 'wg8abeeu', 'y8yfwu8k', 'u6c8uvss', 'm4fjhmqv', 'eqddrzv7', 'jsza366f', 'jd2wf5kx', 'zx9w9aeb', 'atqg5ygj', 'zyub6gbx', 'xwn6g36c', 'sv76t25r', 'yu8pk969', 'vehbgqh9', 'dm3zs6rm', 'f7xas2yw', 'c6f43xue', 'nz3u5st3', 'szrqt38w', 'mfcbe895', 'u5quqv5r', 'pfyc22aa', 'ycxckds6', 'c7stu3jd', 'vwhscbzw', 'pvrg8q6y', 'axwmhxvy', 'u88r7swm', 'n69tsmxj', 'ankarrkr', 'kx5gspe7', 'j2f63pxm', 'mk7jxuaj', 'tyfy63x3', 'suetj89b', 'f4jcpkyb', 'drbfaek9', 'y546mmzm', 'vkzutmkw', 's3f9ccu9', 'eftf4uek', 'szjj6gqy', 'xngnkf2v', 's5f4wxj4', 'hcp7etnh', 'tg4d4erm', 'gzw5sv6j', 'n5rfzk9a', 'mckwcvx7', 'zeuvfttb', 'bsaqumcq', 'ahvh5dvs', 'xeqgtce7', 'h3dkcjnp', 's3jvx6ag', 'cdsz5wt8', 'e2b4xpwg', 'enp5bdmp', 'v3xwn39t', 'f9uwz3ar', 'rmskqmhg', 'wy5w8w4t', 'kpreuaus', 'wjzzhm83', 'p6cm8pee', 'xrskjt7x', 'h2cs5f2x', 'pcaqt2y4', 'sbmkzfak', 'hfvaer53', 'c5ahp95u', 's7qt7a55', 'fdaawytm', 'gr5a4h5d', 'nca3m8hb', 'pcz7nbdn', 'pmk4cwyv', 'cb3cc9aj', 'syj55bc7', 'dnnwagqy', 'xyfqdw7k', 'q553rcqy', 'b24c5zgy', 'tg5rmhmx', 'h2vu6ypp', 'au3kk3sp', 'd6h9szft', 'yjrd8f84', 'a5aa75q9', 'cga2f22h', 'xvt2uk7s', 'wdf8mhvz', 'wrtykhg7', 'sk2tckew', 'jsjq5baz', 'yj3rm3nk', 'b5zs3efh', 'esent4sj', 'nszmhqmc', 'acyutm8f', 'ytm39q9e', 'jz2hm8ek', 'skjmfmvw', 'b2xxc6ck', 'qfy77ym5', 'faxzxsqr', 'k6durmnz', 't46zrtbb', 'jnx586zh', 'a3f3cy3d', 'wy3b2h6a', 's58tvxv2', 'yse7tt59', 'bb9ufk9p', 'a5g8wztv', 'wvv5ktqz', 'acmseadm', 'n2ur9asy', 'cb4zq8fs', 'm4nahwcu', 'fg95wmy8', 'bmtyrdr8', 'df8dsnev', 'je6n8un9', 'gg4ja66v', 'ffyk6tr9', 'vhcb8mt8', 'bqmzbdp6', 'as3nun54', 'd2vtr6cd', 'a36hf2f5', 'fmh8z8pc', 'h9uuz845', 'dymn2e9d' ```
jamieparkinson commented 3 years ago

Jupyter notebook

This notebook contains some snippets that were used in investigating these works (you will need to either change the kernel or create your own matcher-nb kernel)

jamieparkinson commented 2 years ago

Somewhat relevant conversation which is another case of the merger's lack of knowledge of graph topology (ie that it only has knowledge of the nodes) causing an issue https://wellcome.slack.com/archives/C8X9YKM5X/p1628601560049300?thread_ts=1628597088.049100&cid=C8X9YKM5X

alexwlchan commented 2 years ago

Another potential wrinkle I thought of while chatting with @cbowskill recently:

Consider a journal which has been catalogued in Sierra with:

Where do you put the digitised items for the issues? Ideally you have them on both, but that's completely impossible in our current approach (and there's no way to render multiple items on the journal page).

paul-butcher commented 5 months ago

I noticed this when thinking about Archivematica/CALM merges a few weeks ago Slack. I've not suffered by it. It was just something that I spotted.