Closed xrotwang closed 8 years ago
Can you provide a minimal working example?
I suggest breaking gl_languages3.csv into halves to locate the source of the issue.
Without gl_languages3.csv (or another example) it's difficult to reproduce and diagnose this error.
Ok, I followed your advice splitting the file into ever smaller chunks. It turns out that in two rows a single " (double quote) appears. So apparently, the two rows in between these quotes are hidden, which may be correct behaviour:
$ csvstat -H --count problem.csv
Row count: 8
$ csvstat -H -c 1 problem.csv
1. column1
<type 'unicode'>
Nulls: False
Unique values: 8
Max length: 8
Row count: 8
But the mystery remains, why csvstat --count
on the big file reported a different row count.
It is also strange, that I couldn't get csvstat
to ignore the " (double quote) as quote character:
$ csvstat -b -H --count problem.csv
Row count: 8
$ csvstat -q _ -H --count problem.csv
Row count: 8
The 10-line chunk looks as follows:
asss1237,As,Austronesian,asz,Papunesia,0,-0.75,131.73
asum1237,Asumboa,Austronesian,aua,Papunesia,0,-11.3,166.5
auhe1237,"Auhelawa,Austronesian,kud,Papunesia,0,-10.10,151.00
bwan1241,Bwanabwana,Austronesian,tte,Papunesia,2,-10.66,151.03
oyao1237,Oyaoya",Austronesian,oyy,Papunesia,0,-10.54,150.66
aulu1238,Aulua,Austronesian,aul,Papunesia,2,-16.36,167.7
aust1304,Austral,Austronesian,aut,Papunesia,4,-23.38,-149.46
avau1237,Avau,Austronesian,avb,Papunesia,1,-6.24,150.43
awad1244,Awad Bing,Austronesian,bcu,Papunesia,5,-5.6,146.35
axam1237,Axamb,Austronesian,ahb,Papunesia,0,-16.48,167.71
So the real issue seems to be that with --count
all other arguments except for the file, and "No header row" (such as the delimeter) are ignored.
This occurs because of the short circuit logic around line 60 of csvstat.py.
Makes sense. Meanwhile I narrowed the test file down to 360 lines. Cutting off 20 lines at the beginning makes the issue disappear, as well as cutting off 20 lines at the end.
And maybe I stumbled upon another issue: The double quotes in my example above have been inserted by csvsort
(see #421). A diff of the sort
ed lines of my minimal example and a csvsort
ed file contains:
237,240c237,240
< ormu1248,Ormu,Austronesian,orz,Papunesia,0,-2.51,140.60
< otda1235,Ot Danum,Austronesian,otd,Papunesia,10,-0.40,113.47
< ouma1237,Ouma,Austronesian,oum,Papunesia,0,-10.23,149.10
< oyao1237,Oya'oya,Austronesian,oyy,Papunesia,0,-10.54,150.66
---
> ormu1248,Ormu,Austronesian,orz,Papunesia,0,-2.51,140.6
> otda1235,Ot Danum,Austronesian,otd,Papunesia,10,-0.4,113.47
> ouma1237,Ouma,Austronesian,oum,Papunesia,0,-10.23,149.1
> oyao1237,Oyaoya",Austronesian,oyy,Papunesia,0,-10.54,150.66
The diff is a bit messy, though, because csvsort
does also normalize the float representation.
Again, I couldn't easily minimize this example beyond the 360 lines ...
This last bit does also mean, that the minimal example exhibiting the original issue doesn't actually contain any double quotes :) - just single quotes, which seem to have a special meaning for csvkit
.
Probably the original issue had to do with the "CSV sniffing"; it tries to detect what the separator and quote are. I guess csvstat -c 1
guessed the separator as ', and csvstat --count
guessed the separator as ". This is somewhat related to #166.
Maybe csvstat -c 1 -q '"' gl_languages3.csv
would give the correct result.
@edwardros Indeed:
$ csvstat -c 1 -q '"' gl_languages3.csv
1. id
<type 'unicode'>
Nulls: False
Unique values: 7941
Max length: 8
Row count: 7941
$ csvstat -c 1 gl_languages3.csv
1. id
<type 'unicode'>
Nulls: False
Unique values: 7939
Max length: 8
Row count: 7939
But having the sniffing behaviour on by default seems wrong to me. It may lead to subtle bugs and errors which are hard to debug.
Agreed that CSV sniffing produces subtle bugs that are hard to debug. Centralizing discussion in #166.
Running
csvstat
with different options reports inconsistent row counts:vs.