Closed ggdhines-zz closed 9 years ago
I initially agree with @aliburchard that one file is easier to work with and keep track of. BUT, if it's just csv, with no JSON embedded, then I think those csv files are going to be very unwieldy if they combine different shapes.
That could look like:
subject_id shape_type x1 y1 x2 y2
846583 rectangle 33 45 91 99
846583 ellipse 66 67 19 7
(confusing!) or
subject_id rect_x1 rect_y1 rect_x2 rect_y2 ell_x ell_y ell_maj ell_min
846583 33 45 91 99
846583 66 67 19 7
(bulky)
So my vote is to have different shapes in different files, since whatever post-processing is going to be done by the scientists is going to have to parse by shape anyway.
PS. What are these files going to look like for polygons? Many columns of x and y?
I'm against multiple files, the first type as shown by @mkosmala is very much the standard of any relational database(provided actual commas are being used and not irregular whitespace sizes), where one column serve as a guide how to interpret what else is in the row. And I don't see the confusion, as the difference in meaning for x1/2, y1/2 clearly stems from the difference in mathematical definition. Obviously, though, a CSV file should have a comma to actually separate the actual fields. Please avoid a mix of \tab and spaces as delimiters at all costs and make the fields fully CSV compliant, so that all CSV parsers can read them without any extra manual parsing required. As a proper formatted CSV is so easily imported into anything else, I don't really agree with any 'human-readable' argument as a requirement on CSV files (apart from the fact they really would be NOT CSV files anyhow anymore, were such requirements applied).
The polygon question is interesting though. As it indeed usually requires a column pair for each extra point, how to keep the CSV file format consistent? It would mean to carry a whole bunch of columns filled with N/As for all shapes that are not polygons. I think in this case one might want an extra file for the polygons, so this looks like potentially different marking tasks/projects required different output formats anyway, I'd say that is to be expected, though, so shouldn't come as a surprise, as the tasks can be wildly different. In P4 we were just lucky that it could be matched in one file without too much overhead. So don't take my preference for one file as something that should be applicable to all projects, I think one should weigh pros and cons in all cases.
@michaelaye The format I was used was just so we could read it easily here on GitHub, not an actual expected csv format (which is hard to read in plaintext).
But no, it's not like a relational database. If you were using a relational database, you'd have different tables for different shapes, because (1) they don't all have exactly 4 parameters (circle has just 3, polygon has an arbitrary number); and (2) they actually mean very different things. In a database, you have different values for each field, but not different types. Each field is a single type, and I think that philosophy should apply here, as well. A radius is not the same thing as an x-coordinate.
Why I will easily admit that I don't have any formal SQL education, I'm working with data tables and their different layouts for quite some years and am very much used to columns carrying the keys how to treat a row. I thought that is one of the definitions of the narrow layout scheme, which has the advantage of being able to add a new type without any problems? (As discussed here for example http://stackoverflow.com/questions/16447903/table-design-wide-table-vs-columns-as-properties)
I agree with the radius not being the same as a positional coordinate, that's why we have both positions and radii in our P4 database and fill either one with N/A when it's not applicable to this shape. Storage doesn't cost anything these days. ;)
The stackoverflow question and answers lays out the situation nicely. Yes, you can put it in a wide format. But then it's not in a "normal form", which for those of us used to databases, feels icky. The semantics comes down to "does each row represent a single 'thing'?" I argue that no, a rectangle!=ellipse. And if you're going to use different columns for different shapes (my bottom solution above), it would be much cleaner to just put them in separate files. (Storage may not cost much, but bandwidth does.)
I think about what am I going to use the results for. My first reaction is 'visualization'. If I'm going to visualize the results, I'll need to have separate routines for each shape. Since I have to do that, it's computationally faster (and cleaner to code) to run all the "draw circle" routines on one file rather than parse each line to determine what routine I need to send that row's data to. The same will be true no matter what your end application. You'll either need to send each separate file through different processing or else parse one big file row-by-row to determine the correct processing.
I'll add that the multi-file solution is more extensible. If Zooniverse wanted to add a five-pointed-star tool, for example, they'd only need to create a new file format for that new tool -- and not have to worry about how to fit it into the single existing file format.
The last point is the winner argument, potentially. As sometimes one learns about a missing thing in the marking tool design during a project. If it would be easy to add something if the next new thing
just becomes one more output file without cramping it into an existing structure. As I stated above, more than one file totally can make sense, depending on the project and I absolutely agree that I'd prefer to have more than one file to having that nightmare column design of your second example above in one file.
But you haven't convinced me (yet) with the comment before about type-dependent visualization (or other processing routines). I am routinely (pardon the pun) using groupby operations to solve this (and I thought extensive groupby-s is the recent fashion of any proper database system?) So, I'm having a little dictionary that maps the type-key to the correct visualization routine and then I do a simple dataframe.groupby(shape_type).apply(dictionary)
. Sure, something
has to look row-by-row what type it is, but I thought that's an offered (and praised?) feature of modern databases, so if they can do that, why should I avoid it? Unless you say that's not advisable for some reason?
I guess to your last point, I'd just ask: why not have all that grouping-by already done for you by @ggdhines? Simplifies your code and saves you time.
I can see both sides here. Part of me prefers just putting everything in one file because that's cleaner and only requires a couple of lines of code for me to separate (probably fewer lines than reading multiple files).
Then again, everyone discussing this here is the type to just use the original classification exports and aggregations as JSON. The CSV exports aren't really for that user type. They're for someone new to this type of data reduction, possibly new to programming, i.e. quite possibly trying to use Excel or similar. Which works best for them?
I was originally advocating for a single file, but realistically I think there's no reason the end-user-in-Excel would need to have everything in the same file, and actually separating things by shape might be doing them a favor.
I prefer one file. We have entries in the database with blank columns because it's setup for an ellipse and fan marking took which have different properties, so the planet four team got used to dealing with one file.
Why can't this be a user preference option in the project builder?
People that are more able to program and sort huge number of rows files will have no problem with a single file, but others might want them separated out if that kinda of parsing task is difficult or for the points that Margaret raised.
At least to me it seems straightforward enough to accommodate both output formats and have a slider on the project builder to set the flag for single versus multiple files
One clarification - the P4 team gets csv files not json if that matters.
So json is not a native format for me, I mainly learned to parse it in earnest for the project builder outputs.
We asked for that CSV format since it was easier for us to digest as a team. I put it into MySQL, and @michaelaye uses pandas primarily I believe
I somehow was not expecting Zooniverse to cater for a end-user-in-Excel (and the P4 dataset would certainly crash that easily), but if to take that into account, I agree with the statement of @vrooje that multiple output files don't matter, especially considering that another type of marking could easily be in another sheet but the same Excel file, that's kinda ideal for that kind of data keeping.
To @mschwamb 's comments: I think the discussion I had with @mkosmala makes it clear that there's always some overhead coding, either for combining properly different files (and merging stuff CAN be a nightmare, but doesn't have to be if all indexes are healthy), or by filtering the data for being essentially different things that need different treatment codes.
As there seems to be no clear or easy preference from these points of view, maybe we should look at it the other way around and ask what's easiest for the production system? I could always create a pipeline that deals with anything that comes my way, as long as it's stable and reliable, so maybe the way to go is to find a mechanism that's most stable/least effort on the Zooniverse system/team?
I would just add that since I'm dealing with polygons, I'd rather parse a JSON format than an arbitary-number-of-columns csv. If we go with a JSON format, I'm all for a single file. But for flat files, I see one-file-per-shape to be the cleaner solution.
I think (I hope) the JSON will always be available in the raw exports. Now that I'm more familiar with them I think I prefer them too. But I'm imagining someone who's got a smaller project and who doesn't know how to code.
I think @mschwamb's idea of giving the project builder the choice is intriguing. @ggdhines how much effort would that be? Or does it make sense to, say, include everything in 1 file by default until the number of columns exceeds some number, and then split into different files?
Sorry - had not expected such a good discussion on this :) Taking the comments in reverse order - @mschwamb and @vrooje - I'm not sure how difficult it would be to include such options in the project builder. We are going to have to offer some options to the user - I think putting in an extra button or two wouldn't be hard. I don't have an ETA on that happening. It would however to be a real pain for me to support both approaches. The real difficulty with my code right now is not supporting one type analysis for one task or another, or one approach for output or another, but supporting all of them in combination. This is is making my code slightly unwieldy and I'm keen to keep this to a minimum.
I'm not a fan of having everything in one file until a maximum number of columns is exceeded - that means people running two different projects could be getting two very differently structured output files. I think that could get confusing.
@mkosmala for polygons I am going to give two files - one with summary stats per subject (for example the % of an image highlighted by each type of polygon - that way if you don't care about xy coords of the polygons, you could for example quickly determine how much kelp is in an image). I'll also provide a second file which contains the detailed xy coordinates for every polygon. I'll list the coordinates for each polygon as a json list with commas and surround that list with quotation marks which should make everything fit into a fixed set of columns.
@michaelaye from my point of view i think both approaches are reasonably similar for me to implement. I do think that flexibility down the road for additional tools is important (I know we have some ideas.) What exactly do you mean "group by" in a single file solution? Do you mean list all of the circles first, then the rectangles etc?
I just meant the analysis group-by operation, like SQL's "GROUP BY" (http://www.w3schools.com/sql/sql_groupby.asp) or the respective implementations in other systems (pandas and R dataframes have a groupby) as well.
Question - if the end solution is one-file-per-shape, how hard would it be to concatenate them into the same (e.g.) dataframe in python, i.e. to get back to the single-file data structure? I haven't tried much concatenation in pandas.
@michaelaye I think that sort of analysis would be left up to the scientists. (In response to @mkosmala 's comment unless there is something specific). @vrooje I don't know much about pandas but I don't think concatenation in theory would be too hard. It would however be impossible (I think) if objects had different dimensions.
@ggdhines @mkosmala was just pointing out that if I have one shape per file I don' t need the groupby. @vrooje When concatenating dataframes with asymmetric columns you have to come up with the best way how to concat / merge them. pandas is almost too powerful / flexible in this though, there's always more than one way to do it, one just has to find it through a quite complex syntax at times.
@ggdhines it sounds like from your reply that you've got your answer. if it's difficult for your code to output a single file then support the multiple files But I do think a csv output (some json elements in it if needed like for polygons) , perhaps in addition to a json version that Margaret wants, should exist since that's what most scientists are familiar with and use rather than just json only. I'm out in Oxford near the end of October, so happy to talk more then about how I use the output from the p4 csvs if that's useful.
sorry for the misunderstanding @michaelaye, I think we are on the same page yeah, @mschwamb I think I do have my answer
fwiw, after joining the party late (but helping to instigate it in my conversation with greg), I'm increasingly coming around to the multiple file approach. @ggdhines I'm pretty sure that's the answer you've arrived at?
@ggdhines: re: polygon files -- sounds great!
Looks like the discussion reached a conclusion. Closing, reopen if need be.
I am creating the csv files that will provide the basic aggregation results. For marking tasks you can different shapes in the same task. How would people want this represented in the csv output? One option is to have the clustering results for different shapes in the same file. This would mean that the column headers might have slightly ambiguous titles - i.e. for ellipses we would have x1 and y1 repersent the center of the ellipse but for rectangles x1,y1 would be a corner of the rectangle. The alternative would be to have separate csv files for each shape. Either way I will be providing a README file with every aggregation export explaining the structure of the files. Ali prefers the first approach because the second increases the amount of work you have to do combining results over multiple files. thoughts? @mschwamb and @michaelaye - thoughts on this for Planet 4? @mkosmala? @aliburchard and @vrooje have given their thoughts already but are free to join in in any conversation