zooniverse / planet-four

Identify and measure features on the surface of Mars
https://www.planetfour.org/
Apache License 2.0
2 stars 0 forks source link

Mongo dump should include version column #112

Closed michaelaye closed 10 years ago

michaelaye commented 10 years ago

The new 'version' column to identify the version of the tools is not yet included in the database dump for us: "classification_id","created_at","image_id","image_name","image_url","user_name","marking","x_tile","y_tile","acquisition_date","local_mars_time","x","y","image_x","image_y","radius_1","radius_2","distance","angle","spread"

Could it be added, please?

mschwamb commented 10 years ago

Perhaps this should go in with ticket #107 . In addition to adding the version ID, [edit should have been been a comma rather than a period. sorry about that. ] rather than having different science team members have different codes to correct this, I think we should do it one place in the csv file and better yet . it should have the database corrected for this and store the old incorrect fan angles in a separate variable. If someone else gets a copy of the database, we make the data public (which is the plan with the first paper) or a new science team member comes on they then have the correct fan angles,

@chrissnyder, @parrish, @brian-c, @stuartlynn what do you think?

michaelaye commented 10 years ago
  1. You write "in addition to adding the version ID"? What is the difference of version ID and the version of tools I am asking for here?
  2. Why would different science team members have different codes for correction?
  3. I am fine with the correction done for us, but the date of change of the tool still needs to be recorded somewhere visible.
mschwamb commented 10 years ago

Because @michaelaye I do also my own plots and checks of the data using different methods than you, so I need to now need write a function to correct the angle for plotting things for talks and say other analysis I might do with the gold standard data to help with the first paper that is not using your clustering code. So that would mean I would be writing my own function to correct this angle, and you would have your own in your clustering code.

That in some entries in the database the spread means the spread and in other cases it doesn't. I think that's bad data legacy. If we don't fix this in the database that means that someone in the future needs to keep track of the fact this happened to know to apply the correction to the fan angles with version 1.0. currently the only way to know would be a git hub ticket and commit entry.

mschwamb commented 10 years ago

I'm just advocating for having the correction made in the database with the version id outputted as well to the csv file, and the original incorrect angle also stored in a new variable in the database, but that way at least the spread variable in the database and csv would be consistent across all the markings and mean the same thing.

chrissnyder commented 10 years ago

Our current strategy for this will be to add a version number to the fan tool data export and also correct the spread value to be correct on any fan data with version 1. Note this would only be corrected within the CSV output, and not mongodb.

I think it would be smart of us to treat the database classifications as immutable. On the really off chance we have to make another change to the output value, having to go through a couple steps from database value to output value would be a nightmare. I realize that's unlikely, but having to go through multiple transformations of the data in multiple locations would be nasty.

To answer the original issue, I've poked the appropriate people to get the version number added. We'll have it to you shortly.

mschwamb commented 10 years ago

Having the transformation in the code that makes the csv is something I can live with. Can you also output the original bad value as some other column in the csv for posterity?

mschwamb commented 10 years ago

@parrish and @chrissnyder Looking at the csv file output from the August 24th email, I can't find the version number so I'm presuming it hasn't been updated so we get the correct fan spreads.

Here's the current header:

"classification_id","created_at","image_id","image_name","image_url","user_name","marking","x_tile","y_tile","acquisition_date","local_mars_time","x","y","image_x","image_y","radius_1","radius_2","distance","angle","spread"

can you please add the version number column and correct the spread value to be correct on any fan data with version 1 using on Brian's function in coffeescript - see issue #107

Many Thanks.

parrish commented 10 years ago

Ah, sorry, I had migrated the data, but forgotten to add the column. It'll be fixed on the next run.

mschwamb commented 10 years ago

Thanks @parrish