wireservice / csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.
https://csvkit.readthedocs.io
MIT License
6.03k stars 603 forks source link

Numerical csvsort with --no-inference #1125

Closed niklaswallerstedt closed 3 years ago

niklaswallerstedt commented 3 years ago

I'm trying to sort a list of customer ids numerically, a sort of similar issue is #637.

Given this made up list

id,date,boolean
5009,2021-06-08 15:09:11,true
515,2020-11-08 15:09:11,false

If I run this:

csvsort -c 1 input.csv

the sorting is working as expected, except for the type-inference messing with the values.

id,date,boolean
515,2020-11-08T15:09:11,False
5009,2021-06-08T15:09:11,True

Tried this:

csvsort -c 1 -I input.csv

However, quite expected the sorting is no longer numerical.

id,date,boolean
5009,2021-06-08 15:09:11,true
515,2020-11-08 15:09:11,false

How do I keep the numerical sort but preserve the values that --no-inference gets me?

And.. as I was typing this up, this does what I want:

sort -t, -n input.csv
brewingcode commented 3 years ago

I ran into this same thing, the type inference is wrecking the output, BUT the type inference is required for sorting to infer numeric values. @niklaswallerstedt thanks for posting the sort version, I just ended up re-ordering my csv columns so that the field I wanted to sort by was first, and sort -nt, did the job. Bummer that csvsort couldn't do it.

niklaswallerstedt commented 3 years ago

@brewingcode I'm glad I could help.

I saw this mentioned in the docs

If your file is large, try sort -t, file.csv instead.

but I agree it would be great to be able to pass --numerical in conjunction with -I (-n is already in use) or something similar to preserve the output.

jpmckinney commented 3 years ago

I think we would need to be able to enable inference on only specific columns, which is the topic of #151. Adding something like --numerical would essentially be the same thing. #151 has the advantage that it would build on existing code, rather than adding a specific code path just for csvsort.

Closing in favor of #151.