Duplicate classifications

ggdhines-zz commented 9 years ago

Don't know if this is a bug - really not sure what is going on. Anyways, for the subject with Zooniverse ID "APK0002oi9" I get the following weird things (doubt that "APK0002oi9" is the only subject where this is happening, just the first I came across). With the DB dump from Jan 1st 2015:

db.plankton_subjects.findOne({zooniverse_id:"APK0002oi9"}) { ... "metadata" : { "counters" : { "blank" : 3 }, ... }

So this subject was retired because three people labelled it as blank. However, if I search for those three people using the following Python code:

import pymongo client = pymongo.MongoClient() db = client['plankton_2015-01-01'] classification_collection = db["plankton_classifications"] for classification in classification_collection.find({"subjects.zooniverse_id":"APK0002oi9"}): if "user_name" in classification: print classification["user_name"],classification["created_at"] else: print classification["user_ip"],classification["created_at"]

I get yshish 2013-12-12 16:15:13 yshish 2013-12-12 16:16:28 80.189.250.204 2013-12-23 18:10:43

So two of those classifications are both from yshish - these classifications are identical. Given how closely these two classifications were submitted the most likely explanation is that the browser/software whatever, seems to have submitted the same classification twice. BTW I think I came across something similar in Penguin Watch and mentioned it to Chris Synder.

On top of that, if we look at the classifications from 80.189.250.204:

classification = classification_collection.find_one({"user_ip":"80.189.250.204","subjects.zooniverse_id":"APK0002oi9"}) print classification["annotations"]

we get [[{u'p2': [u'827.7861938476562', u'213.53382873535156'], u'p3': [u'830.7861938476562', u'353.53382873535156'], u'p0': [u'807.7861938476562', u'281.53382873535156'], u'p1': [u'844.7861938476562', u'281.53382873535156'], u'species': u''}, {u'p2': [u'326.78619384765625', u'437.53382873535156'], u'p3': [u'326.78619384765625', u'437.53382873535156'], u'p0': [u'326.78619384765625', u'437.53382873535156'], u'p1': [u'326.78619384765625', u'437.53382873535156']} ...]

So this person actually thought this image was not blank - however for whatever reasons a species was not recordered with their markings. This seems to have resulted in Ourborous (or something) counting this classification as "blank".

Given that these classifciations happened over a year ago, I don't know if this problem has been fixed. However, I can't find any closed issues which appear to be relevant. If this problem has been solved, we need to figure out which classifications were affected and if any subjects need to be "unretired".

Sorry, if I've missed something simple. Greg

chrissnyder commented 9 years ago

As with Penguin Watch, I don't have a great explanation for why there might be duplicate classifications returned, especially in this case when the classifications were over a minute apart. Occasionally I wonder if the classifications are accidentally being saved locally and then being sent shortly afterward, but we don't have a way to test if a classification came from localStorage or not.

For the second part, yes, a classification without a specified species would be counted as blank, despite there being marking information. I couldn't break the interface to make that scenario happen; how often did this occur?

ggdhines-zz commented 9 years ago

Out of the last 1000 subjects which were retired on Plankton Portal, 83 of them had repeat classifcations by the same user - the most recent being in December. So this is definitely still happening. Out of the last 50000 classifications (which goes back to mid September 2014), there were 553 which did not have a species associated with the markings. (So just over 1%) Of these 553, 153 had blank species strings and the rest just did not have the species attribute in the markings dictionary.

brianaharder commented 9 years ago

Repro for the two 'no species recorded' classifications cases.

1) Click on the image to start an annotation. 2) Click Finished.

This returns a classification that has x and y data, and no species entry at all.

1) Click on the image to start an annotation. 2) Select a category. 3) Click Finished.

This returns x and y data, and species: null

ggdhines-zz commented 9 years ago

@brianaharder - those markings typically just get ignored

ggdhines-zz commented 9 years ago

sorry - @brianaharder, I don't think that has anything to do with duplicate classifications

brianaharder commented 9 years ago

@ggdhines, I know they're separate issues. Chris said he couldn't figure out how the blank species classifications were getting into the system, so that's at least one (old) mystery solved.

yshish commented 8 years ago

It has just happened again, I got this image to classify, only a few minutes after I classified it before: http://talk.planktonportal.org/?&_ga=1.262422924.1606832519.1429734835#/subjects/APK0007a03

yshish commented 8 years ago

User suzeroo reported a number if issues and one of them is to be getting images she has already classified:

I'm also suddenly getting the same images to classify. Started a couple of days ago, or so, Now in the past 2 days it is happening more often - most often when I get the out of data message, then I sign out and sign back in.. etc..

This is happening at my work and home laptop computer. Both are PC's, and I use Chrome. Windows 10 on my laptop, and the prior version on my work computer. I tried clearing the cache, without success. I tried using Mozilla Firefox on my laptop, but got the same thing. I've tried various other measures such as signing out of everything, then back in, rebooting my computer, etc. No help. I tried Penguin Watch - and had no problems at all. Haven't tried other projects.

Here is the list of other issues she's experiencing: http://talk.planktonportal.org/#/boards/BPK0000002/discussions/DPK0000ej9?page=1&comment_id=56a44c1aff903818de002028

camallen commented 8 years ago

@srallen any ideas on dup image issues? Maybe the subject queue for the user is not emptying properly under certain conditions?

srallen commented 8 years ago

@camallen I think this may be a confluence of a few different bugs happening at the same time. The out of data message folks are seeing is a bug that I haven't totally been able to track down and is extensively documented in #52. What I know is that it happens when volunteers use the return to classifying button in talk, they often see an out of data message instead of properly loading up the next subject. This one is difficult to debug since I can't use talk locally in development. What I think is happening is when the user returns to the classification page, there's a race condition between fetching the user and fetching the subject. What should be happening is that if a user is logged in, check their preferences for the subject group and then queue up subjects from that group. If the user does not have a set preference or the user is not logged in, then select a group randomly and then queue up subjects. If a user sets their preference, which can be switched between the two groups on the home page, then clear the subject queue and load up a new queue from the selected group.

There may also be a bug based on this report of the subject queue not emptying properly too. What may be happening is that because of the race condition, a random subject group's subject queue is loading, then the user is fetched and their preference is loaded and it happens to be the other subject group they're set to in their preferences. It should clear the queue if the other subject group is selected, but it may not be.

The way this works could use an entire refactor which may resolve the issues, but like I said, it's difficult to test and replicate to determine exactly what's happening and I haven't had time lately to look into it. I can give it a look today or tomorrow.

If I were to refactor this, I'd prefer the the subject group selection to only be present on the classifier page rather than the home page. There's a mess of code written to make sure the state between the two are in sync with the user preferences. Having the group selection only be in classifier controller would simplify the code a bit and possibly eliminate a source of issues. I don't recall exactly why the group selection was placed on the home page to begin with, to be honest.

jiho commented 8 years ago

Hi everyone,

I did a little more digging in the latest classifications dump from PlanktonPortal (received on 2016-01-31).

In this log are recorded, among other things:

user name of the person doing the classification (user_name)
url of the classified image (image_url)
date and time of classification (created_at)
classification_id: a unique string, for a given combination of the three fields above (looks like some kind of hash sum).

I get 1092338 unique classification_id as of this sunday, which is compatible with what is displayed on the homepage of PP this morning (monday) so I am assuming this is what gets counted as a "classification".

The same combination of user_name and image_url can get repeated if there are several organisms on the same image. In that case, all classifications are recorded at the same time and created_at should be the same => classification_id should be the same.

So it tracked down cases where user_name and image_url are identical but created_at is not identical. Those would be the same image presented to the same user at different times. I found 4812 such cases. I looked at a few such cases and indeed, the classification_ids are different; so those are counted as actual classifications.

I plotted the occurence of such cases through time. Here is all occurrences:

diff_time

and only those for which the time difference is >1h and <3000h

diff_time-zoomed

There are a few in the early days of PP but since the new version of PP, the number of cases has largely increased. The shape of the difference in time is very regular which makes me think there is some kind of time shift problem going on somewhere.

The distribution of the difference in time between records involving the same person and image does not seem random at all, with one peak around 100 seconds and another one at 10^7.5 seconds, which is about one year.

diff_time-histogram

When the same image is seen several times by the same person, the most common scenario is that it is seen twice; but some times it actually occurs many more times, up to 6. The occurrences are:

   2    3    4    5    6 
4307  389   73   30   13

I hope this will help track down this strange and kind of nasty bug (which creates new classifications which are not actually "new"). In the meantime, we'll probably just exclude those multiple classifications from the data after the fact.

srallen commented 8 years ago

@jiho I've been thinking about the current codebase and what to potentially work on, but am still having difficulty replicating either duplicates or the 'out of subjects' message. Regarding the duplicates, I am wondering: Are the duplicates from the same subject group or tend to be from the same subject group? Do we know for sure that there aren't duplicates in the original set of subject images?

jiho commented 8 years ago

A "subject group" would be Mediterranean vs California? In that case, yes this is happening for both; more for the Med dataset since it launched but I am guessing this is also coming from the fact that is has been consulted more.

I recomputed the duplicates with both image_url (i.e. the image name on PP) and image_name (i.e. the original image name of the image we sent to PP) and the results are exactly the same. So indeed the same physical file is shown to several people. That does not eliminate the possibility of duplicates (two files with different names but the same content) but this is not related to this bug (and, for the Med dataset at least, I prepared the images and I don't see how duplicates would be possible with the code I used; so I am very confident there aren't any).

Regarding the "out of data" message, I have only rarely seen it, but every time I saw it it was on a "bad" internet connection. I never saw it in the lab were we, of course, have very high speed internet or at my home were I have a good DSL connection (~15 Mbps download). The persons reporting it the most also seem to be on not-ideal connections. So I would actually suspect something coming from a too short time out somewhere... But all this is not really quantitative evidence ;-) Maybe try on a slow tethered cellphone connection or a very bad wifi?

yshish commented 8 years ago

Update on the problems of getting the Out of Data message and duplicate images by suzeroo

For a very brief time after I posted this problme - for a few days - I was able to have a good run of classifying (maybe 20 min), before I would get the Out of Data message. Once that happens, I can sign out, then sign back in, and may be able to classify, but at that point I can count on getting duplicate images.

And once that Out of Data message occurs, I'll get that every time I access PP that day - either right away or after a minute or two - so I'm done classifying on PP for the day - because even if I sign out and back in, I might be able to classify for a minute or so, but it will be duplicate images. It's as if the system reaches into my history of classifications and pulls them out for me to classify again - I'll get recent and sometimes not so recent duplicate images.

The next day, I usually have better luck, maybe will be able to classify for a few minutes before getting Out of Data message.

I can go to Talk, then try back to classifying - and it would work minimally. Usually when I'd go back to classifying, from Talk, I'll get the Out of Data message. It pretty much stops me in my tracks.

For a while I noticed that if I was able to classify, then went to Talk to comment, when I tried to go back to classifying I'd get the Out of Data message. So I tried just marking photos as "favorite" and stayed within the classification interface. That did help - but after perhaps 20 minutes, I'd still get the Out of Data message.

I'm able to get into Talk, so then I'd go to my "favorites," review them and make comments - but any attempt to get back to classifying will be met with the Out of Data message - and that pretty much guarantees I'll get duplicate images.

This in not happening on Penguin Watch, Snapshot Serengeti or Wildcam Gorongosa, if that's helpful information.

My original post has the computer info listed, and the other efforts I've made (such as clearing the cache, etc). At this point, I'm nearly shut down on PP, getting the Out of Data message very quickly, if not right away.

I'm assuming it's not my computer, since I'm using three different ones, and experiencing the same problem on all. Let me know if there's anything else I can try! /Sue

yshish commented 8 years ago

User suzeroo complains about getting duplicates together with Out of data message:

I'm back to getting Out of Data message every time I login, no matter what computer. I've worked around it by signing out, classifying for a bit, then if I want to comment on an image, I click Discuss, sign in, and comment. If I click on Return to Classifying - I get Out of Data message.

Once in a while I get to classify while signed in, BUT, it's accompanied by images that I've classified before - as if the program is drawing from my history. Some are recent (so I recognize that I've just done that image a few days or weeks ago), and some are remote (looks suspiciously familiar - and when I look at in in the Talk page, I see my own comment from 1 yr, 2 yrs ago). Difficult to say if ALL the images are ones that I've done before, but I see enough that I do recognize.

I use PC's, Chrome, Windows 7 at work, Windows 10 at home. Happens on all computers.

I had a good run for a while, now I'm back down to out of commission

jiho commented 8 years ago

The issue of an image being presented to the same person twice popped up again on the forums and @yshish reported it in an email to @camallen and myself.

I've inspected the data again and, indeed, repeated classifications still occur since Feb 1st, when I first compiled the report. The plots and conclusions are very similar so I won't post them again.

I think it is separate from the "out of data" message (which should have its own issue, as @yshish knows ;-) ).

jiho commented 8 years ago

As of last Sunday, there were 11,449 such repeated classifications over a total of 1,885,657 classifications = a small percentage but a non negligible amount still.

zooniverse / planktonportal

Duplicate classifications #31