Closed michaelaye closed 10 years ago
Additionally, there is a lot of binary stuff in the last line that is parsed as an empty line by standard CSV parsers.
Looking into this.
FYI: I confirmed the exact same things for yesterdays's dump file. So it's not a one-off hickup.
I think this has to do with a problem we ran into last week during data processing.
If this problem persists this week, I'll look into it again.
But you saw that I wrote above, it's both in the 2nd and 9th of June data dump? So the problem must have existed more than a week?
Same story:
2014-06-15_planet_four_classifications.csv0000644000000000000003122202146112347321372017203 0ustar rootroot"classification_id","created_at","image_id","image_name","image_url","user_name","marking","x_tile","y_tile","acquisition_date","local_mars_time","x","y","image_x","image_y","radius_1","radius_2","distance","angle","spread"
So weirdly not having this issue. So I took a deeper look. When I start uncompressing the .gz I get 2014-06-15_planet_four_classifications.csv which has that weirdness that Michael reported. But my Archive Utility says it is still going and exapnding 2014-06-15_planet_four_classifications.csv . Interestingly after expanding it which takes some time another file appears '2014-06-15_planet_four_classifications.csv 2' without that problem and 2014-06-15_planet_four_classifications.csv disappears . Is this just weird uncompressing issues/behavior between the Archive Utility versus gzip because the file is so big? Michael what are using to unzip it?
I've put my uncompressed version of 2014-06-15_planet_four_classifications.csv 2 on dropbox. It's a ~3.3 GB file https://dl.dropboxusercontent.com/u/56971802/2014-06-15_planet_four_classifications%202.csv
Also give it like 3 hours from now for the file to upload to Dropbox before trying to grab the file.
I'm using gunzip
on linux and I did not look at anything before gunzip
was finished.
And thanks, but you didn't need to do the Dropbox upload. I coded work-arounds that replaces the header line with what I know it should be and I am scanning the whole database for any empty lines, which gets rid of the empty line at the end (Of course I could just remove the last line, but that's too hackish for my taste). These work-arounds don't remove the requirement for solving this, though, IMHO. Because one day we might add other columns in the database dump, and the reduction routines should just be able to read the header lines from the first row.
the file might need to be uncompressed again with gunzip once it's a .csv. You might see if that works. Mac disk utility seems to work. Anyway I'll let the dropbox upload continue in case you want to compare and see if there are any other differences or the development team wants to look at it.
That's it:
$ file 2014-06-15_planet_four_classifications.csv
2014-06-15_planet_four_classifications.csv: POSIX tar archive (GNU)
So the solution is to provide the files with the correct and standard extension .tar.gz
.
Once I used tar zxvf
and not only gunzip
, all is fine.
This also solved the last empty line, by the way.
I think this is due to something getting switched when I was updating the backup scripts recently. Today’s backup is already in progress, but I’ll fix it for tomorrow’s.
Oh, wait, ignore me :) I thought this was referring to the raw database dumps. It’s not related to what I was working on.
Got me then. @parrish ?
Seems to be linked to here. I think the file extension should be '.tar.gz' as the compress method uses tar.
I can't really understand why you're ending up with a mangled header. The format is tar/gzip, which hasn't changed. Archive utility always appends a "2" to the filename for some random reason, but still produces a valid file. tar xzvf
also produces valid output too. I'll try changing the extension to be more explicit.
It only was a 'mangled' header when expecting text. I was looking at the tar itself, which in this case for one file is basically the text file with the tar header and tail attached. The 'corruption' was the tar header. All this because I only used gunzip to unpack, because the extension did not indicate that the format was tar/gzip but only gzip.
Last Sunday's header line is corrupted in the data file. It looks like this:
while before it was looking cleanly like this: