wireservice / csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.
https://csvkit.readthedocs.io
MIT License
5.97k stars 608 forks source link

Show sniffed delimiter on exception #1011

Open wataash opened 5 years ago

wataash commented 5 years ago
# colA,colB
# aaaaa...aaaaa zzzzz...zzzzz  \
# ...                           } 10 or 100 rows
# aaaaa...aaaaa zzzzz...zzzzz  /
#
# \___________/ \___________/
#  1000chars     1000chars

# 10 rows
# "," is used as delimiter
python3 -c "print('colA,colB') ; [print('a'*1000 + ' ' + 'z'*1000) for _ in range(10)]" | csvstat
# => ok

# 100 rows
# " " is used as delimiter
python3 -c "print('colA,colB') ; [print('a'*1000 + ' ' + 'z'*1000) for _ in range(100)]" | csvstat
# => Row 0 has 3 values, but Table only has 2 columns.

In the latter case, sample is trimmed, losing the header colA,colB, thus white space " " is used as the delimiter.

It was tough for me to figure out this behavior. So how about showing "what delimiter is used" in:

  1. Debug output
$ csvstat -v ...
inferred delimiter: ' '
  1. Error message
$ csvstat -v ...
Row 0 has 3 values, but Table only has 2 columns (delimiter: ' ').

and, how about showing warning of excessing SNIFF_LIMIT?:

$ csvstat -v ...
warning: input (XXX bytes) exceeds SNIFF_LIMIT (YYY bytes), delimiter guessing may be incorrect (NOTE: SNIFF_LIMIT can be changed by -y flag)
warning: guessed delimiter: ' '
Row 0 has 3 values, but Table only has 2 columns.
jpmckinney commented 5 years ago

Thanks - we'll try to do this as part of the next version.

jpmckinney commented 11 months ago

Hmm, agate raises ValueError for "Row 0 has 3 values, but Table only has 2 columns." type errors in agate/table/__init__.py. We'd have to introduce a new error class (subclass'ing ValueError, in case anyone catches these). We'd also have to handle it all over the place, because we need access to the reader to print the dialect.

Debug output

This is a good idea. As above, we'd have to add it in a lot of places. Happy to merge a PR!

and, how about showing warning of excessing SNIFF_LIMIT?:

The snifflimit was reduced in 1.0.7 to avoid sniffing huge files (which is very slow). So, this warning would now be emitted too frequently to be useful.