wireservice / csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.
https://csvkit.readthedocs.io
MIT License
6.03k stars 603 forks source link

Dialect Sniffer warning on innocuous operation(s) #1198

Closed steve-estes closed 1 year ago

steve-estes commented 1 year ago

I have a very small (~1 kb) test file, 4 columns with 9 data rows, chopped down to minimal size, which yields a weird Agate sniffer error when doing a normal csvsort operation. The file is attached --> csvkit-error-poc-3.csv <-- and contains nothing sensitive, but in its entirety is:

source_file,SA_ID,Rate Plan,Mail Address
Point-StartServicesFrom7dayspriortolastdatarefresh2023.02.17.csv,2224704,W-NRESC ,PO BOX 48620   CUMBERLAND
Point-StartServicesFrom7dayspriortolastdatarefresh2023.02.17.csv,6587198,W-RESC  ,4671 Revelstroke Rd   HOPE MILLS
Point-StartServicesFrom7dayspriortolastdatarefresh2023.02.17.csv,4184127,W-RESC  ,4671 Revelstroke Rd   HOPE MILLS
Point-StartServicesFrom7dayspriortolastdatarefresh2023.02.17.csv,2645511,W-RESC  ,
Point-StartServicesFrom7dayspriortolastdatarefresh2023.02.17.csv,4257540,W-RESC  ,639 EXECUTIVE PL SUITE 400   Fayetteville
Point-StartServicesFrom7dayspriortolastdatarefresh2023.02.17.csv,9647673,WW-RESC ,837 Shaw Mill Rd Apt C   Fayetteville
Point-StartServicesFrom7dayspriortolastdatarefresh2023.02.17.csv,5308257,101TC   ,4671 Revelstroke Rd   HOPE MILLS
Point-StartServicesFrom7dayspriortolastdatarefresh2023.02.17.csv,8111490,110TC   ,2735 FREEDOM PARKWAY DR - UNIT C-7   Fayetteville
Point-StartServicesFrom7dayspriortolastdatarefresh2023.02.17.csv,2505637,W-RESC  ,2000 BEDLOE ST   Fayetteville

And I get this error just sorting by anything:

(base) data % csvsort -c 1 csvkit-error-poc-3.csv > sorted-csvkit-error-poc.csv
/Users/steve/anaconda3/lib/python3.8/site-packages/agate/table/from_csv.py:74: RuntimeWarning: Error sniffing CSV dialect: Could not determine delimiter

Notes:

jpmckinney commented 1 year ago

The linked issue is unrelated. (Sniffing is deterministic, so intermittent failures have another cause.)

In this case, it's known that sniffing can be wrong. From Python's docs:

This method is a rough heuristic and may produce both false positives and negatives.

The solution is to either skip sniffing (-y 0) or sniff the entire file (-y -1).

steve-estes commented 1 year ago

To be clear, this warning is raised even when sniffing the entire file. I reduced my original input file to the smallest file I could and still have it exhibit the behavior, but it doesn't matter how much you sniff, unless it's smaller than this amount.

jpmckinney commented 1 year ago

Yes - in some circumstances, Python's sniffer is always wrong.