wireservice / csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.
https://csvkit.readthedocs.io
MIT License
6k stars 605 forks source link

Inconsistent row counts reported by csvstat #433

Closed xrotwang closed 8 years ago

xrotwang commented 9 years ago

Running csvstat with different options reports inconsistent row counts:

$ csvstat -c 1 gl_languages3.csv 
  1. id
    <type 'unicode'>
    Nulls: False
    Unique values: 7939
    Max length: 8

Row count: 7939

vs.

$ csvstat --count gl_languages3.csv 
Row count: 7941
edwardros commented 9 years ago

Can you provide a minimal working example?

I suggest breaking gl_languages3.csv into halves to locate the source of the issue.

Without gl_languages3.csv (or another example) it's difficult to reproduce and diagnose this error.

xrotwang commented 9 years ago

Ok, I followed your advice splitting the file into ever smaller chunks. It turns out that in two rows a single " (double quote) appears. So apparently, the two rows in between these quotes are hidden, which may be correct behaviour:

$ csvstat -H --count problem.csv 
Row count: 8
$ csvstat -H -c 1 problem.csv 
  1. column1
    <type 'unicode'>
    Nulls: False
    Unique values: 8
    Max length: 8

Row count: 8

But the mystery remains, why csvstat --count on the big file reported a different row count. It is also strange, that I couldn't get csvstat to ignore the " (double quote) as quote character:

$ csvstat -b  -H --count problem.csv 
Row count: 8
$ csvstat -q _  -H --count problem.csv 
Row count: 8

The 10-line chunk looks as follows:

asss1237,As,Austronesian,asz,Papunesia,0,-0.75,131.73
asum1237,Asumboa,Austronesian,aua,Papunesia,0,-11.3,166.5
auhe1237,"Auhelawa,Austronesian,kud,Papunesia,0,-10.10,151.00
bwan1241,Bwanabwana,Austronesian,tte,Papunesia,2,-10.66,151.03
oyao1237,Oyaoya",Austronesian,oyy,Papunesia,0,-10.54,150.66
aulu1238,Aulua,Austronesian,aul,Papunesia,2,-16.36,167.7
aust1304,Austral,Austronesian,aut,Papunesia,4,-23.38,-149.46
avau1237,Avau,Austronesian,avb,Papunesia,1,-6.24,150.43
awad1244,Awad Bing,Austronesian,bcu,Papunesia,5,-5.6,146.35
axam1237,Axamb,Austronesian,ahb,Papunesia,0,-16.48,167.71
edwardros commented 9 years ago

So the real issue seems to be that with --count all other arguments except for the file, and "No header row" (such as the delimeter) are ignored.

This occurs because of the short circuit logic around line 60 of csvstat.py.

xrotwang commented 9 years ago

Makes sense. Meanwhile I narrowed the test file down to 360 lines. Cutting off 20 lines at the beginning makes the issue disappear, as well as cutting off 20 lines at the end.

And maybe I stumbled upon another issue: The double quotes in my example above have been inserted by csvsort (see #421). A diff of the sorted lines of my minimal example and a csvsorted file contains:

237,240c237,240
< ormu1248,Ormu,Austronesian,orz,Papunesia,0,-2.51,140.60
< otda1235,Ot Danum,Austronesian,otd,Papunesia,10,-0.40,113.47
< ouma1237,Ouma,Austronesian,oum,Papunesia,0,-10.23,149.10
< oyao1237,Oya'oya,Austronesian,oyy,Papunesia,0,-10.54,150.66
---
> ormu1248,Ormu,Austronesian,orz,Papunesia,0,-2.51,140.6
> otda1235,Ot Danum,Austronesian,otd,Papunesia,10,-0.4,113.47
> ouma1237,Ouma,Austronesian,oum,Papunesia,0,-10.23,149.1
> oyao1237,Oyaoya",Austronesian,oyy,Papunesia,0,-10.54,150.66

The diff is a bit messy, though, because csvsort does also normalize the float representation.

Again, I couldn't easily minimize this example beyond the 360 lines ...

xrotwang commented 9 years ago

This last bit does also mean, that the minimal example exhibiting the original issue doesn't actually contain any double quotes :) - just single quotes, which seem to have a special meaning for csvkit.

edwardros commented 9 years ago

Probably the original issue had to do with the "CSV sniffing"; it tries to detect what the separator and quote are. I guess csvstat -c 1 guessed the separator as ', and csvstat --count guessed the separator as ". This is somewhat related to #166.

Maybe csvstat -c 1 -q '"' gl_languages3.csv would give the correct result.

xrotwang commented 9 years ago

@edwardros Indeed:

$  csvstat -c 1 -q '"' gl_languages3.csv
  1. id
    <type 'unicode'>
    Nulls: False
    Unique values: 7941
    Max length: 8

Row count: 7941
$ csvstat -c 1 gl_languages3.csv
  1. id
    <type 'unicode'>
    Nulls: False
    Unique values: 7939
    Max length: 8

Row count: 7939

But having the sniffing behaviour on by default seems wrong to me. It may lead to subtle bugs and errors which are hard to debug.

jpmckinney commented 8 years ago

Agreed that CSV sniffing produces subtle bugs that are hard to debug. Centralizing discussion in #166.