zooniverse / shakespeares_world

Full text transcription project for the Folger Shakespeare Library
https://www.shakespearesworld.org
Other
8 stars 5 forks source link

Multiple identical classifications submitted for same subject, same user #323

Closed VVH closed 6 years ago

VVH commented 7 years ago

Recently, @astopy put together a file of transcriptions by a few select users from ShaxWorld whose transcriptions we trust. This is because we've been having trouble with the aggregation code. It is a short term measure. Paul at the Folger noticed that there are some subjects in the csv file with multiple identical transcriptions submitted by the same user. For example, a transcription by hwolfe for a page in V.a.140 (file 123203.jpg, subject id 1275389) appears 15 times. This has implications for aggregation which uses a majority rules mechanism to determine agreement for a line, and to retire lines by putting the grew dots around them. If 15 transcriptions by the same user are thrown into the pot, these could overwhelm submissions by other users. I've talked to @astopy and @simoneduca about this and we think the possibilities for where this error is originating are as follows:

1) somewhere on the front end in the way transcriptions are captured and submitted to the database. If this is the origin, it's something @rogerhutchings would need to look at because it could originate with AnnoTate.

2) somewhere on the backend, but none of us could think how/where/why, so that's possibly a question for @camallen and @marten once we've ruled out front end origins.

3) Somewhere in the pipeline that gets transcriptions into the database and then out again...but I don't really understand how that would work either.

A volunteer raised a very detailed series of questions over on Talk, which might be related. Carl began by saying that he was seeing subjects several times over. He has also said that when he transcribes a page his classification counter sometimes changes by 2 or 3, rather than 1. This might indicate that multiple transcriptions are being submitted per session. See more here.

Have we seen any behaviour like this elsewhere? @simensta, @hughdickinson, @eatyourgreens have you noticed these problems on text projects before? I doubt it's text specific, but I figured best to open this investigation wide to start, and try to narrow down from there. Thanks in advance.

simoneduca commented 7 years ago

I confirm there are 15 duplicates of the following transcription in the raw CSV file

{
   "subject_id": 1275389,
   "text": "ffor an Ague\nA toy Seigneur s'entend ta Creature\nTake two quartes of good Ale make a possett and take\nof the Crudd then take a good handfull of Ribbe and\nEt en son tempe tu luy donner pasture\nboyle itt a good while in the possett drinke and putt\nOuurant ta main par ta faueur tres grande\nin a little pepp<ex>er</ex> and drinke itt in the morninge\nand fast an hower after and att night when you\nA toy Seigneur s'entend ta Creature\ngoe to bedd for foure or five dayes\nEt en son tempe tu luy donner pasture\ntoe comende wele\nThe herbes are to boyle till the vertue\nOuurant ta main par ta faueur tres grande\nbe boyled out of them the said quantitie\nA toy Seigneur s'entend ta Creature\nare to be droncke at fyue draughts\nEt en son tempe tu luy donner pasture\nblude warme./\nOuurant ta maint",
   "variants": "herbes, sentend, luy, boyled, donner, ffor, faueur, ouurant, fyue, draughts",
   "user_name": "hwolfe",
   "Author": "",
   "Call Number": "V.a.140",
   "Filename": "123203.jpg",
   "Genre": "Receipt Books",
   "Hamnet URL": "http://hamnet.folger.edu/cgi-bin/Pwebrecon.cgi?BBID=231384",
   "Luna URL": "http://luna.folger.edu/luna/servlet/detail/FOLGERCM1~6~6~1201046~197000",
   "Origin": "compiled ca. 1600",
   "Page Number": "folio 18 verso || folio 19 recto",
   "Page Sort": 24,
   "Priority": 28,
   "Title": "Receipt book [manuscript]."
 }
simoneduca commented 7 years ago

SW and AnnoTate submit classifications to the API in different ways. SW has extra code that is supposed to check if there were any classifications that failed to submit from previous sessions and, if so, submit them again. I'll check if it does anything else weird.

simoneduca commented 7 years ago

Update: I spent the best part of today, trying to reproduce the bug of multiple identical classifications and failed to do so. I hacked at the code that creates and submits the classifications, but nothing is wrong with it. I also tried to test the scenario of network failure while submitting a classification and the classification gets resubmitted successfully, with all the correct data, on the next submission. @VVH it would be helpful to see if Heather can somehow reproduce that, but I appreciate that's a long shot.

VVH commented 7 years ago

Hey all, I think that the multiple submissions might be a browser bug from what Christ and Brooke has said, but @vrooje flagged up another old bug in PFE that might somehow be related to Greg's code. It's a bit lengthy, but worth a look: https://github.com/zooniverse/aggregation/issues/165

simoneduca commented 7 years ago

@VVH from what I understand, although the GZ Bars issue seems pretty important and impact aggregation in general, it's not relevant for SW. Or is it? The apps are running different codes.

vrooje commented 7 years ago

I think it's more that if the handling of duplicates is the same or similar across two different codes, especially if they're written by the same person, a similar kind of sub-optimal result might occur.

I should note that the bug @VVH flagged actually causes more severe problems for the aggregation when the duplicate classifications are true duplicates of the annotations (either because they have the same classification ID or because the user is skilled enough to classify the same subject the same way all the time). In the agglomeration described in my Issue it results in a complete failure to identify true clusters. In something more like a consensus analysis it would strongly weight a reported aggregation toward whatever the duplicate classification content is, right or wrong.

simoneduca commented 7 years ago

@vrooje that makes sense. It's worth noticing though that in the case observed in SW the "duplicates" have different classification IDs, so they're not true duplicates. It could just be that the user submitted multiple times because the page was lagging... I don't think the above is correct actually, they could be true dupes after all.

camallen commented 7 years ago

This sounds like multiple submissions of the same classification (though i haven't checked the data), it can happen if a button is pressed twice in quick succession. @simoneduca what are the time stamps on the classficiations and do they have different metadata, etc?

finally in aggregation, can the alg only take the user's first by created date and just ignore the rest?

simoneduca commented 7 years ago

@camallen I only have the raw csv @VVH gave me and that doesn't include that information unfortunately. Check the json above, that's all the fields I have.

camallen commented 7 years ago

@simoneduca @VVH seems there were multiple classifications submitted for the subject / user combo linked above, e.g. hwolfe and subject_id 1275389:

classification_id, front_end_started_at, front_end_finished_at, server_created_at
12167051, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:02-04:00, 2016-05-12 03:00:10 UTC
12167056, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:02-04:00, 2016-05-12 03:00:10 UTC
12167052, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:02-04:00, 2016-05-12 03:00:10 UTC
12167063, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:06-04:00, 2016-05-12 03:00:12 UTC
12167057, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:02-04:00, 2016-05-12 03:00:11 UTC
12167060, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:02-04:00, 2016-05-12 03:00:11 UTC
12167058, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:06-04:00, 2016-05-12 03:00:11 UTC
12167062, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:07-04:00, 2016-05-12 03:00:12 UTC
12167066, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:08-04:00, 2016-05-12 03:00:13 UTC
12167054, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:02-04:00, 2016-05-12 03:00:10 UTC
12167055, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:02-04:00, 2016-05-12 03:00:10 UTC
12167064, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:06-04:00, 2016-05-12 03:00:12 UTC
12167061, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:02-04:00, 2016-05-12 03:00:11 UTC
12167067, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:07-04:00, 2016-05-12 03:00:13 UTC
12167053, 2016-05-11 22:47:31-04:00 - 2016-05-11 23:00:02-04:00, 2016-05-12 03:00:10 UTC
simoneduca commented 7 years ago

That's good news in a way, because it leaves open the possibility of multiple clicking.

camallen commented 7 years ago

yeah, check how the finished_at gets set and see what UI event triggers that in your code.

eatyourgreens commented 7 years ago

Could the same classification be stored multiple times if submission fails?

simoneduca commented 7 years ago

Good idea, thanks @camallen @eatyourgreens I checked for that scenario, but can't confirm that. Every time I tested submission failure the classification was stored only once and submitted again at the next occasion.

simoneduca commented 7 years ago

finished_at is set when a classification is submitted via one of the three option in the modal window that pops up when "I'm done" is clicked (bit of a mouthful); here https://github.com/zooniverse/shakespeares_world/blob/69c464991199b5f8380f332339343fc722529be2/app/modules/transcribe/classification.factory.js#L32

Couldn't find anything wrong in the code, but I did find something that might be related. To reproduce:

  1. Classify
  2. Click I'm done
  3. Choose one of the appropriate options for submitting your classifications
  4. Dismiss the next modal, by clicking outside it
  5. You'll be still on the same subject, so go back to 2.

This will result in classifications with the same text, but different ids and timestamps. All the classification below have the same text

screen shot 2017-05-03 at 16 49 30

I'll open a separate issue and link it here

VVH commented 6 years ago

This duplicates issue should no longer be a problem now that @CKrawczyk's new aggregation code is in place as per issue #344