wireservice / csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.
https://csvkit.readthedocs.io
MIT License
5.9k stars 605 forks source link

csvclean: Doesn't behave as expected if header row is too short #1237

Closed lamyergeier closed 2 months ago

lamyergeier commented 2 months ago

For example if delimeter is comma

a,b,c,d
e,f
g,h,i
j
k,l,m,n,o

I would like to get

a,b,c,d,
e,f,,,
g,h,i,,
j,,,,
k,l,m,n,o

This is necessary as I want to convert csv to json but it doesn't work as rows don't same number of entries for columns. Also I would like to make header with first column as "name" and rest of the column headers are tag array

In above example I want to do the following for each row:

[
{
"name": "a",
"tag": [ "b", "c", "d" ]
}
]
jpmckinney commented 2 months ago

See this example: https://csvkit.readthedocs.io/en/latest/scripts/csvjoin.html#examples

lamyergeier commented 2 months ago

@jpmckinney As per the given example above, there are different numbers of values on each row and I just wanted to add empty values at the end with delimeters so that all rows have same number of columns. I am not sure the example in the url that you gave would work in this case. Could you please write the command for above example?

jpmckinney commented 2 months ago

If you manually add a comma to the first row, you can then do csvcut myfile.csv and it will fill in the missing commas in the other rows.

jpmckinney commented 2 months ago

I'm reopening this issue as csvclean should also be able to fix this, but there's a bug.

jpmckinney commented 2 months ago

csvclean is a streaming tool, but adding a missing column requires (in worst case) reading the entire file once, and then constructing the output. Since csvkit works with standard input, we would need to keep the entire file in memory, which is something we try to avoid.

I'll add this to the docs.