wireservice / csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.
https://csvkit.readthedocs.io
MIT License
6.03k stars 603 forks source link

Custom delimiter for grouping option #1253

Closed lukaspirpamer closed 1 month ago

lukaspirpamer commented 4 months ago

When the grouping option is enabled, "," is used as a field delimiter and the delimiter of the input csv file is ignored.
Would it be possible to use the automatically determined delimiter or delimiter of the csv-input file?

https://github.com/wireservice/csvkit/blob/f73742fc0ec4c993b5f76809ee15dfab8a0cef10/csvkit/utilities/csvstack.py#L50

jpmckinney commented 4 months ago

Why?

lukaspirpamer commented 4 months ago

Hi James, thanks for your prompt reply!

Because the grouping option converts the delimiter to be "," For example, when using csvstack and the delimiter is ";" in the input csv files, in the output file it will be converted to "," when the grouping option is applied. I would have expected the same behaviour as without using the grouping option. What do you think?

Best, Lukas

jpmckinney commented 4 months ago

Can you provide a sample command, with sample input?

All CSV Kit commands assume that comma is used as the delimiter, except for in2csv.

If you do the following, the semi-colons are preserved only because csvstack considers them to be part of the data, rather than considering them delimiters:

$ printf 'a;b;c\n1;2;3' | csvstack
a;b;c
1;2;3

You can set a custom delimiter with -d:

$ printf 'a;b;c\n1;2;3' | csvstack -d ';'
a,b,c
1,2,3

You'll see that, now, csvstack understands that ; is the delimiter, and therefore uses comma in the output.

To get output that uses a different delimiter, you must use csvformat.

The reason for this design decision, is that all tools use a common format, and only in2csv controls modifying the input format (along with options like -d), and only csvformat controls modifying the output format. This avoids having to continuously reconfigure the input/output in every single command, when piping output between commands.

jpmckinney commented 4 months ago

Basically, if you are currently doing csvstack a.csv lot.csv of.csv files.csv that.csv use.csv semicolons.csv, then you are effectively doing the same as cat .... csvstack doesn't recognize the semicolons as delimiters, unless you use -d (in which case, the output will use commas, as described above).