line aggregation not working correctly

vrooje commented 8 years ago

Just started going through the aggregations for GZ Bar Lengths (project_id == 3). In early spot-checking I came across this subject (493223): uds_17083_standard_zoom60pct

which has 49 line drawings in the relevant workflow (workflow_id = 3, workflow_version = 56.13). Here they are in raw form: gzbl_493223_with_markings

I can't show the aggregated lines because the aggregations report no lines for this subject (despite overall reporting 21,554 aggregated line markings for 7,716 subjects).

Something's gone wrong and I'm not sure what because I haven't checked many other subjects, but I wanted to report this right away.

Really hope this is minor and just because I'm doing something wrong.

vrooje commented 8 years ago

Some further checks have shed a bit more light on this.

I've found several more examples where the bar is obvious, p_bar is high, and I'd have expected the clustering performed in the aggregations to have found something, but it didn't: (Subject IDs 493235, 493211, 465153, 465079) gzbl_493235_with_markings gzbl_493211_with_markings gzbl_465153_with_markings gzbl_465079_with_markings

And here are 3 where the p_bar is actually lower so there were fewer lines drawn total (in general), but the aggregations still managed to return something (yellow are aggregated lines, with line transparency scaled by p_true_positive): (subject_ids 465075, 465074, 493231) gzbl_465075_with_markings gzbl_465074_with_markings gzbl_493231_with_markings

The aggregated lines sometimes miss one of the measurements as well (there's no yellow line showing width immediately above, or length directly above that) -- but in these cases it's not because the lines aren't detected after clustering, it's because multiple lines with very similar properties were detected at roughly equal (and low) probability.

This is best shown by plotting all the lines reported by the aggregations for the last galaxy above in the same strength of yellow regardless of their p_true_positive: gzbl_493231_with_markings_3

The aggregation for this subject has detected multiple lines for both the length and the width, and the lines are almost on top of one another. With so many detections, the probability is diluted, so none of them come up as real. For any reasonable cutoff on p_true_positive, I'm often going to miss a measurement that's there in the raw data.

To me this says the criteria for defining a cluster are too restrictive. I think the raw (green) line drawings are really very good in all these subjects (especially in the one where there are 2 barred galaxies and every single volunteer actually followed our instructions and only classified the main galaxy), so I suspect this tightness of line clustering is pretty typical of a project involving line drawings. Thus I'm concerned the line aggregations need more tweaking before they should be used more generally with confidence.

(Side note: we should really give projects the option of tweaking their aggregation parameters... but that's an enhancement.)

vrooje commented 8 years ago

After looking through things here I'm wondering if this is related to #144 ? (cc @alexbfree @chrislintott) These marks (in this project) are overlapping by request - but also I don't understand why that should be a problem as they are very different from each other. If I plot raw marks for any of these by, e.g., [origin, r, theta], or [slope, intercept, length], they separate pretty well into 2 different areas of parameter space. So for reasonable choices for distance cutoff, a hierarchical clustering algorithm ought to do the trick.

I've tested this by applying my own hierarchical clustering, selecting the cutoff distance for defining a cluster as the maximum distance permitted while still requiring that the number of markings registered does not exceed the total number of classifications. That's slightly different than requiring that a cluster not contain 2 marks from the same user, but I did that because we know we have a (small but) non-zero duplication rate, and I haven't removed duplicates yet.

Taking the last example shown above, here's how that comes out: gzbl_493231_with_markings_mine

In my spot-checking, there's still the occasional case where either the bar length or the bar width has more than 1 clustered result with p_true_positive > 0.3 (where I'm defining that as the number of marks in the cluster divided by the number of classifications where the user made any marks), but it's ~0.5% of the total number of subjects with at least 2 detected clusters above that threshold, so I think that's quite good.

This has made me think that:

my side note above is actually really important - my method of finding d_max for a given subject is good for this project, where I know there are likely to be only 2 clusters per subject, but it's probably pretty bad for other projects. Perhaps it's possible to intuit what the researchers need/want some of the time, but that seems very complex -- we could just ask them for each project.
this bit is completely different from the first set of subjects I showed above that are totally missing clustering results. Here's one of those subjects, clustered with my method:

gzbl_465153_with_markings_mine

No problems with any of the others, either, and I checked across the range of p_bar values within the set of ~500 subjects where the aggregations reported a single line with "NA" for the median/mean probability. So I'm left thinking that is really a bug rather than a choice of method. If the clustering depends on having no duplicate marks from the same user in the same cluster, and given that some of the early stages of live Panoptes had much higher duplicate rates, could that be causing a problem here?

zooniverse / aggregation

line aggregation not working correctly #165