wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
48 stars 10 forks source link

Be better about handling Sierra records with a 260 and a 264 #2280

Closed alexwlchan closed 6 years ago

alexwlchan commented 6 years ago

Spun out from #2252.

There are 92 Sierra records with both a 260 and a 264, which the transformer currently flags as a cataloguing error.

It looks like, at least in some cases, the 264 field only contains a copyright statement, which we discard. For example:

260    Würzburg :|bKönigshausen & Neumann,|c[2015] 
264  4 |c©2015 
Afflicted records 1528064 2064911 2119618 2119830 2124305 2124759 2124895 2124946 2126925 2126978 2132114 2133427 2136956 2148880 2150151 2153999 2159646 2163242 2163244 2163272 2173146 2177570 2199854 2199855 2200264 2200398 2200399 2200481 2256170 2264988 2267509 2277618 2347667 2397796 2398615 2438982 2438986 2439205 2472339 2472341 2472344 2472349 2474267 2474409 2474435 2475494 2475517 2476701 2476718 2477059 2477190 2477191 2477299 2490813 2491108 2491132 2491381 2496696 2497815 2498764 2500828 2500878 2535394 2707300 2823118 2826732 2846303 2846354 2846850 2846902 2846904 2846920 2847024 2849024 2851379 2851422 2851473 2852703 2853511 2855098 2855232 2855775 2855867 2871407 2881998 2928394 2992788 3006352 3015891 3016012 3034408 3043187
alexwlchan commented 6 years ago

We can apply two fairly blunt heuristics to reduce the size of this problem:

That leaves us with four records which will hit this error, which have moderately non-trivial resolutions. I've flagged them to Branwen, because I suspect the best fix is updating Sierra:

('2173146.json',
 {'260': [('a', '[Washington, D.C.] :'),
          ('b', 'Congressional Research Service,'),
          ('c', '[2014]')],
  '264': [('a', 'Marston Gate, Great Britain :'),
          ('b', 'amazon.co.uk, Ltd.,'),
          ('c', '[2014?]')]})
('2398615.json',
 {'260': [('b', 'Little Brown & Co'), ('c', '2015.')],
  '264': [('a', 'New York :'),
          ('b', 'Little, Brown and Company,'),
          ('c', '[2015]')]})
('2823118.json',
 {'260': [('a', '[Paris] (r. de Rivoli 140) :'),
          ('b', 'Marchant, Edit., Alliance des Arts,'),
          ('c', '[1852]')],
  '264': [('a', 'Paris :'), ('b', 'Imp. Bertauts.')]})
('3034408.json',
 {'260': [('a', 'Ekne'), ('b', 'Falstadsenteret1'), ('c', '2012')],
  '264': [('a', 'Ekne'), ('b', 'Falstadsenteret1')]})
alexwlchan commented 6 years ago

Reviewed; just waiting for a green Travis run.

alexwlchan commented 6 years ago

The code fix is merged; we're waiting for a view on the remaining four records.

jtweed commented 6 years ago

These come out as publishing events, right? So we can just add both, 260 then 264, as separate events? And if there's duplication that's a cataloguing error to fix.

alexwlchan commented 6 years ago

This is a cataloguing error, so we should flag these. Dropping them onto the DLQ is good enough for now, until if/when we build a mechanism for reporting cataloguing errors more directly.

alexwlchan commented 6 years ago

(I asked Branwen to fix all the existing errors, and I'm not going to change the transformer behaviour for now.)

jtweed commented 6 years ago

Not convinced. Dropping on the DLQ doesn't get us clean indexes. Handling cleanly without failing is what we should do. And also report for fixing.

Be liberal in what you accept etc etc.

alexwlchan commented 6 years ago

Dropping on the DLQ doesn't get us clean indexes.

We have clean reindexes now because we've fixed all the current errors – we'll drop any hypothetical future records that get miscatalogued on the DLQ.

(There are a several other places where we might drop something on the DLQ if it comes along and is catalogued badly, but we don't have anything that's catalogued that way so we get clean reindexes in practice.)

Handling cleanly without failing is what we should do.

We already have several heuristics for trying to fix this (e.g. if they match, only have copyright statements) – we fail if we don't have enough information about what to do. What should we do if we get a conflict?

And also report for fixing.

I already have a ticket for doing this in a more general way: #2562. This isn't the only place where we might get cataloguing errors.