zooniverse / panoptes

Zooniverse API to support user defined volunteer research projects
Apache License 2.0
103 stars 41 forks source link

One annotatation in classification data dump CSV has wrong format #2034

Closed rafelafrance closed 6 years ago

rafelafrance commented 7 years ago

Project: "Notes from Nature" Workflow: 2563, "Herbarium_Arkansas Dendrology: Part 2: Magnolias, pawpaws, sassafras, and Dutchman's pipe -- 19 September 2016" Classification ID: 17187451

In the data dump this annotation is completely off. It has the form of

[{
    "task": "T3",
    "value": 1,
    "task_label": "Location"
}, {
    "task": "T28",
    "value": [{
        "task": "T0",
        "value": [{
            "value": "618fc8027f37a",
            "option": true
        }]
    }, {
        "task": "T1",
        "value": [{
            "value": "cd606cfabb922",
            "option": true
        }]
    }, {
        "task": "T2",
        "value": [{
            "value": "c9a03ad17da9a",
            "option": true
        }]
    }]
}, {
    "task": "T29",
    "value": [{
        "task": "T4",
        "value": [{
            "value": "40b15309a0d1b",
            "option": true
        }]
    }, {
        "task": "T5",
        "value": 0
    }, {
        "task": "T6",
        "value": 0
    }]
}, {
    "task": "T30",
    "value": [{
        "task": "T8",
        "value": "David M. Johnson"
    }, {
        "task": "T9",
        "value": ""
    }]
}, {
    "task": "T31",
    "value": [{
        "task": "T13",
        "value": ""
    }, {
        "task": "T14",
        "value": "pipevine"
    }, {
        "task": "T16",
        "value": 1
    }]
}, {
    "task": "T32",
    "value": [{
        "task": "T17",
        "value": 0
    }, {
        "task": "T18",
        "value": "HXC000875"
    }]
}]

When it really should look similar to

[{
    "task": "T5",
    "task_label": null,
    "value": [{
        "task": "T3",
        "value": "Arkansas Valley Sequatchi- Philo. T7N R14W1/2 N. W. 1/4 section 3. 5.5 mi. N. of Wooster, Alluvial woods along N. Cadron Creek, Mallet Town bridge area.",
        "task_label": "Location"
    }, {
        "task": "T4",
        "value": "Alluvial woods. Canopy-- Quercus, Acer, Celtis; Subcanopy-- Cornus, asmina, Ulmus, Lindera; Herb cover-- Cassia, Campsis, Eupatorium, Perilla, Uniola, and dense Urtica. Red Berries, spicy aroma.",
        "task_label": "Habitat \u0026 Description"
    }]
}, {
    "task": "T11",
    "task_label": null,
    "value": [{
        "task": "T6",
        "value": "Tim Jessup",
        "task_label": "Collected By"
    }, {
        "task": "T7",
        "value": "16",
        "task_label": "Collector Number"
    }, {
        "task": "T8",
        "value": [{
            "select_label": "Month",
            "option": true,
            "value": 9,
            "label": "9 - September"
        }]
    }, {
        "task": "T9",
        "value": [{
            "select_label": "Day",
            "option": true,
            "value": 7,
            "label": "7"
        }]
    }, {
        "task": "T10",
        "value": [{
            "select_label": "Year",
            "option": true,
            "value": 1976,
            "label": "1976"
        }]
    }]
}]
camallen commented 7 years ago

I can't see any tasks above T11 for that workflow. Annotations for tasks like T28 have no reference on the workflow tasks so we can't cross reference them,when this happens we return the raw annotation in it's place.

Not sure where those annotation task ids are coming from, do you have other workflows with those task id's? It may be a bad client submitting incorrect data, how common is this error in your dump?

rafelafrance commented 7 years ago

So far it's a single bad record in 10K - 20K records processed. We have another 50K+ records to process.

There are other workflows with these task numbers and in this order. However, they all have two or more task_labels whereas this record has only one. So the forms don't match exactly.

My question to you is: Does this record look like this in the DB? I want to narrow down where in the pipeline the error occurred. Pre or Post data entry... or even possibly post-delivery

FYI:

~/notesFromNature/label_reconciliations/temp$ grep -P '""T3"".+""T28"".+""T1"".+""T2"".+""T29"".+""T5"".+""T6"".+""T30"".+""T8"".+""T9"".+""T31"".+""T13"".+""T14"".+""T16"".+""T32"".+""T17"".+""T18""' notes-from-nature-classifications.csv |wc
   1949  146376 5256187
~/notesFromNature/label_reconciliations/temp$ grep -P '""T3"".+""T28"".+""T1"".+""T2"".+""T29"".+""T5"".+""T6"".+""T30"".+""T8"".+""T9"".+""T31"".+""T13"".+""T14"".+""T16"".+""T32"".+""T17"".+""T18""' notes-from-nature-classifications.csv | grep -P '""task_label"".+""task_label""' | wc
   1948  146335 5253622
camallen commented 7 years ago

@rafelafrance as the project owner / collaborator, you should be able to get the classification resource representation via the api, http://docs.panoptes.apiary.io/#reference/classification/classification/retrieve-a-single-classification

Also we have a python client too https://github.com/zooniverse/panoptes-python-client/ classification = Classification.find("17187451")

rafelafrance commented 7 years ago

So the data is in Panoptes' in the wrong format and your data dumps are faithful. My data reconciliation programs will definitely pickup cases where the data is really wrong (like this one) but if there are more subtle data issues we in a very bad place.

denslowm commented 7 years ago

So it sounds like we can delete this one record and proceed with the reconcilation this time. If this comes up again in the future then maybe it will require a deeper dive into the data. Rafe, if that seems reasonable to you then I am fine with it as well.

rafelafrance commented 7 years ago

I've already put the code in reconciler.py to skip (with an error message) records like this a while back. This issue is not about fixing reconciler.py, this is about Panoptes.

camallen commented 7 years ago

@rafelafrance the api stores the annotation data as is upon receipt from the client. This could be an issue with the front end submitting malformed annotations?

rafelafrance commented 7 years ago

Possibly. However, given the shape of the bad data it's unlikely directly related to that. It could be in the panoptes-client code too. Maybe it has a race condition with multiple tabs... we don't know and I'm still trying to gather information.

All we know for sure is that it is not in the data distribution end.

front-end --> panoptes-client --> panoptes --> data dump etc.
possibly      possibly            possibly     no from this point forward
camallen commented 7 years ago

Just to reiterate that in the panoptes api, classifications are not touched after the metadata updates here. It's write once and read from then on and we never touch the task annotations, they are stored how they are received from the client.

rafelafrance commented 7 years ago

If it's impossible for the bug to be in Panoptes then where in the system did it go wrong? Where does this issue get posted? PanoptesFrontend?

camallen commented 7 years ago

Not saying it's impossible just highly unlikely to be an api issue. As i stated before:

This could be an issue with the front end submitting malformed annotations?

Try reporting on the front end repo and seeing if you get some traction there. Do you have any fequence stats on this type of malformed classification event? And is there a coherent time window for them? If it was a bad deploy / bug then you should see the offending classifications only in the timewindow between bad deploy / fix going out.

eatyourgreens commented 7 years ago

Could this classification have been generated by a staged version of the PFE dropdown tool, testing against the production database? option and value are keys generated by the react-select component.

Anyway, as @camallen says this is more than likely a bug in zooniverse/Panoptes-Front-End since the annotations are generated there, and not modified by Panoptes.

camallen commented 7 years ago

Certainly could have (i'm not sure we track metadata about the url src for an old event like this), however i checked the user and it doesn't seem like someone on the dev team...but they may have got a link from github / shared, etc.

camallen commented 7 years ago

@eatyourgreens any thoughts on storing some pfe source origin indicator in the classification metadata. We could use the referer header api side to mark it as well.

eatyourgreens commented 7 years ago

@camallen PFE records the user agent string here https://github.com/zooniverse/Panoptes-Front-End/blob/master/app/pages/project/classify.cjsx#L155 It doesn't know anything about the environment (production vs. staging) as that's all handled by the API client and hidden from the classification. The only other variable I can think of tracking would be the version of PFE, maybe via a git commit hash or something? (eg. https://github.com/zooniverse/Panoptes-Front-End/blob/master/views/index.ejs#L22)

This could well be a bug in the react-select module or the dropdown task. I did a bunch of work, with Mark, updating those before Christmas, which might also have resolved this. See zooniverse/Panoptes-Front-End#3233

camallen commented 7 years ago

thanks @eatyourgreens i'll have a think about this api side instead.

camallen commented 6 years ago

closing this for now, for the same reasons on this post https://github.com/zooniverse/Panoptes/issues/2491#issuecomment-342492355