zooniverse-glacier / notesFromNature

https://www.notesfromnature.org/
Apache License 2.0
13 stars 11 forks source link

Transcriptions/classifications that do not match the task/subject #370

Closed denslowm closed 8 years ago

denslowm commented 9 years ago

There are a number of transcriptions/classifications that do not match the task/subject. The pattern found is that whenever start and finish time are equal, this issue appears. For example, 10 cases from Jul 1~7, and the transcriptions are completely not matching the image. You can quickly find it strange that the exact same transcription and user appears for a number of different tasks/subjects.

For example on Jul 1, you can find 6 identical transcriptions from foxx86 for the subjects SELU0010136,SELU0006547,SELU0004005,SELU0010669,SELU0006802,SELU0008005. https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/5570795de3f6661fa5b2055f.jpg https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/55707958e3f6661fa5b1f77c.jpg https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/55707954e3f6661fa5b1ed95.jpg https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/5570795fe3f6661fa5b20773.jpg https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/55707959e3f6661fa5b1f878.jpg https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/5570795ae3f6661fa5b1fd29.jpg

This is potentially affecting 5.5% of the transcriptions, which is really significant.

chrissnyder commented 9 years ago

Has this occurred since July 7? Does the transcription that was submitted look like a legit transcription or was it blank? Did this only happen to Herbarium records?

robgur commented 9 years ago

I think this has been a persistent problem by all accounts...

On Tue, Jul 28, 2015 at 2:53 PM, Chris Snyder notifications@github.com wrote:

Has this occurred since July 7? Does the transcription that was submitted look like a legit transcription or was it blank? Did this only happen to Herbarium records?

— Reply to this email directly or view it on GitHub https://github.com/zooniverse/notesFromNature/issues/370#issuecomment-125716588 .

chrissnyder commented 9 years ago

Yes, but I'm trying to narrow down what it's root cause is. My questions serve two purposes:

To note, I don't think it's an API issue, as other projects have not reported similar patterns of classifications from users. So it has to be something within the NfN codebase itself that might cause a user to submit identical transcriptions.

denslowm commented 9 years ago

I am looping in @ammatsun. She can help us answer. We know it is the herbarium for sure at this point.

ammatsun commented 9 years ago

This is a persistent problem. I just selected July 1~7, 2015 to show that it has occurred recently, but I observed this in records since 2013 (so, not due to a recent change). My best guess at this time, without knowing the code, is that there is some concurrency problem and state from different workers are getting mixed and/or generating this situation. In particular, this might be happening when one worker skips a transcription work, but I could not locate a definitive pattern in the data.

I haven't looked at other collections closely, but I just glanced over the macrofungi collection, and found that transcription 5313467447bc7245280007be for subject 52545d9e5c2a110000000b7d (image http://www.notesfromnature.org/subjects/macrofungi/mich/52545d9e5c2a110000000b7d.jpg) has nothing about Canada in it as the transcription indicates, and the exact same transcription is also present for another subject 525468915c2a11000000121e.

Differences in start and finish time of 1 second and 2 seconds also point to cases where this issue appears.

JoyceGross commented 9 years ago

I can confirm that this is happening with CalBug records too.

An example is subject 519e5c7eea30523400000457 (EMEC593148 Undetermined sp.jpg). There are 3 transcriptions with the correct locality information (Nevada), and a 4th one with completely different locality information (Minnesota).

The 4th record (transcription 54bbfdb9832cec520b0000c3) has the exact same data as transcription 54bbfdb929a6f6290f0000cb, which is for a different specimen. Both records were recorded at almost exactly the same time.

The date is January 2015 for the two above-mentioned transcriptions.

I remember seeing this problem a year or more ago with the CalBug data.

denslowm commented 9 years ago

Hey @chrissnyder, I just wanted to check to see if you have any updates on this issue. What do we need to do to move this forward?

denslowm commented 8 years ago

I just wanted to report that I am seeing this issue in the BRIT (herbarium) dataset that came last night.

In addition to what has already been reported, I will note one other thing.

As noted earlier, start and end times of the problematic records are the same (within each record and across all erroneous records) EXCEPT one record will have an earlier start time. Usually this is the last record in the set, but not always. The record which has this earlier start time seems to be for the image that was transcribed and applied to all other records in the set.

trouille commented 8 years ago

@denslowm Can you calculate how often you see this happening in the BRIT dataset? Since it appears as such a specific error, is it something you can code into your aggregation script to remove from the data? Have you already shared with the other research teams your approach for removing it, so they do the same with their data?

The reason to find out the rate is that this issue and #384 (which may be related) are proving very difficult to reproduce on command for our devs to troubleshoot. I am wondering whether, if the rate is less than a few %, we could do the following:

1) message very clearly in Talk that we're aware of the problem and also the rate at which it is happening and how you're removing it from the results so we know it's not contaminating the research 2) have people keep posting in Talk when they see it happen so we can have a sense for whether there's suddenly an uptick in breaks of this type 3) focus dev effort on the new platform rather than spending significantly more time trying to find the solution to this problem that may have a limited enough impact that we're willing to remove it in post-processing

parrish commented 8 years ago

Closing since the app has been relaunched. This was a weirdly intermittent bug in the front end that I never did manage to reproduce.