wireservice / csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.
https://csvkit.readthedocs.io
MIT License
5.98k stars 608 forks source link

added 'column numbers everywhere' to csvcut #1209

Closed athalhammer closed 12 months ago

athalhammer commented 1 year ago

I've run frequently into the problem that I'm facing wide CSV files where there is some interesting string/text somewhere in a long line of empty fields or numbers.

Example:

In which columns is ipsum (first row, after header) actually showing up?

field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field20,field21,field22,field23,field24,field25,field26,field27,field28,field29,field30,field31,field32,field33,field34,field35,field36,field37,field38,field39,field40,field41,field42,field43,field44,field45,field46,field47,field48,field49,field50,field51,field52,field53,field54,field55,field56,field57,field58,field59,field60,field61,field62,field63,field64,field65,field66,field67,field68,field69,field70,field71,field72,field73,field74,field75
,,,,"lorem",,,,,,,,,,3,,,,,,,,,,43,,,,,,,,,,,,,,,0.95,,34,,,"evil, ipsum, with commas,",,,,,,,,,,,,,,,,,,,,,"ipsum",,,,,,334,,ipsum,4

We want to approach this without manually counting the commas.

The new addition provides this and in combination with grep the result will be highlighted:

$ csvcut -N test.csv | grep -e "\(,\"[0-9]\+~.*ipsum.*\"\)\|,[0-9]\+~[^,]*ipsum[^,]*," -i
1~,2~,3~,4~,5~lorem,6~,7~,8~,9~,10~,11~,12~,13~,14~,15~3,16~,17~,18~,19~,20~,21~,22~,23~,24~,25~43,26~,27~,28~,29~,30~,31~,32~,33~,34~,35~,36~,37~,38~,39~,40~0.95,41~,42~34,43~,44~,"45~evil, ipsum, with commas,",46~,47~,48~,49~,50~,51~,52~,53~,54~,55~,56~,57~,58~,59~,60~,61~,62~,63~,64~,65~,66~ipsum,67~,68~,69~,70~,71~,72~334,73~,74~ipsum,75~4

$ csvcut -c 45,66,74 test.csv 
field45,field66,field74
"evil, ipsum, with commas,",ipsum,ipsum

This is not a breaking change as it keeps all columns where they are. It alters all the fields but it's sole purpose is for exploratory reasons.

Also, please let me know if you know a better or any solution how the above can be achieved without this modification.

athalhammer commented 1 year ago

@jpmckinney I put it for csvcut as it was the most straightforward code change. Maybe agate would also be a good place to put it but it probably triggers more code changes for a feature that is not frequently sought after (see https://stackoverflow.com/questions/55253173/return-the-column-name-for-a-column-containig-a-text-value)

athalhammer commented 1 year ago

@jpmckinney fixed the flake8 issue

athalhammer commented 1 year ago

@jpmckinney for some reason SQLAlchemy 2.x.x is installed in the runner - that triggers the job fails. In the previous jobs it was SQLAlchemy 1.4.x

https://github.com/wireservice/csvkit/actions/runs/6371290078/job/17361039598#step:4:186

athalhammer commented 1 year ago

culprit here: https://github.com/wireservice/agate-sql/pull/40

athalhammer commented 1 year ago

@jpmckinney I pinned agate-sql to <0.6.0 to make the tests run for this PR. I believe we should address the migration to agate-sql 0.6.0 and SQLAlchemy>=2.x.x in a separate PR.

coveralls commented 1 year ago

Coverage Status

coverage: 88.239% (-0.2%) from 88.443% when pulling 61eed27a72b785da4bf26cd858f70855aec9119f on athalhammer:master into fa9e0db1cdd30757835912e48467d6ab2be02c14 on wireservice:master.

coveralls commented 1 year ago

Coverage Status

coverage: 88.362% (-0.08%) from 88.443% when pulling 61eed27a72b785da4bf26cd858f70855aec9119f on athalhammer:master into fa9e0db1cdd30757835912e48467d6ab2be02c14 on wireservice:master.

athalhammer commented 1 year ago

Sorry @jpmckinney , should have added a test on this feature earlier. Coverage should go up again with the last commit

jpmckinney commented 12 months ago

Hmm, what about putting this in csvlook?

athalhammer commented 12 months ago

I think having the option to combine it with large tables is intriguing, e.g. csvcut -c 700-800 -N ... The output can always be piped to csvlook.

One way how I have come to love this feature is by "collecting" relevant columns in the following way:

$ csvcut -C 45 -N test.csv | grep -i ipsum | head | grep -i ipsum

In that way I can quickly collect all relevant "ipsum" columns in a long and wide table and eventually just leave the -N away and switch -C for -c

(see above)

$ csvcut -c 45,66,74 test.csv
jpmckinney commented 12 months ago

As implemented, this seems more like a new utility – like csvprefix – than a logical addition to csvcut. (Changing prefix to cut is the only additional work when switching between commands.) However, csvkit isn't accepting new utilities.

For your example, you can achieve the same with existing tools:

$ csvgrep -m ipsum -c 1-75 -a test.csv | tail +2 | csvgrep -n | grep ipsum
 45: evil, ipsum, with commas,
 66: ipsum
 74: ipsum
athalhammer commented 12 months ago

I agree with you, my example was not comprehensive enough (single line only):

Let's extend to the following (still not big enough, but I hope it transmits the point)

$ cat test.csv 
field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field20,field21,field22,field23,field24,field25,field26,field27,field28,field29,field30,field31,field32,field33,field34,field35,field36,field37,field38,field39,field40,field41,field42,field43,field44,field45,field46,field47,field48,field49,field50,field51,field52,field53,field54,field55,field56,field57,field58,field59,field60,field61,field62,field63,field64,field65,field66,field67,field68,field69,field70,field71,field72,field73,field74,field75
,,,,"lorem",,,,,,,,,,3,,,,,,,,,,43,,,,,,,,,,,,,,,0.95,,34,,,"evil, ipsum, with commas,",,,,,,,,,,,,,,,,,,,,,"ipsum",,,,,,334,,ipsum,4
,,,,"ipsum",,,,,,,,,,4,,,,,,,,,,41,,,,,,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5

Then I can see in one "glimpse" (without working through the data line by line with tail +x) which numbers pop up in the first x (head -n x) lines:

$ csvcut -C 45 -N test.csv | grep -i ipsum | head | grep -e "\(,\"[0-9]\+~.*ipsum.*\"\)\|,[0-9]\+~[^,]*ipsum[^,]*," -i
1~,2~,3~,4~,5~lorem,6~,7~,8~,9~,10~,11~,12~,13~,14~,15~3,16~,17~,18~,19~,20~,21~,22~,23~,24~,25~43,26~,27~,28~,29~,30~,31~,32~,33~,34~,35~,36~,37~,38~,39~,40~0.95,41~,42~34,43~,44~,46~,47~,48~,49~,50~,51~,52~,53~,54~,55~,56~,57~,58~,59~,60~,61~,62~,63~,64~,65~,66~ipsum,67~,68~,69~,70~,71~,72~334,73~,74~ipsum,75~4
1~,2~,3~,4~,5~ipsum,6~,7~,8~,9~,10~,11~,12~,13~,14~,15~4,16~,17~,18~,19~,20~,21~,22~,23~,24~,25~41,26~,27~,28~,29~,30~,31~,32~,33~,34~,35~,36~,37~,38~,39~,40~0.98,41~,42~34,43~,44~,46~,47~lorem,48~,49~,50~lorem,51~,52~,53~,54~ipsum,55~,56~,57~,58~,59~,60~,61~,62~,63~,64~,65~,66~,67~,68~,69~,70~,71~,72~3,73~,74~lorem,75~5

Then I can easily add 66,74,5,54 to my -C and I execute again (on a file with > 1000s of lines) to see whether I caught the "bulk of columns where 'ipsum' may show up". Empty result basically means: you have all the "ipsums" and then I can continue working with some "lorems" etc. to make sure I have identified all the columns where "a user gave negative feedback" (or something like that)

athalhammer commented 12 months ago

Maybe it fits more naturally to csvgrep as a parameter which indicates: "match any character but remove columns that don't have a match in any line"... But then you want be able to specify e.g. the primary key column and others to be always there (even though there is no match). It will also work only with -a, so something like:

csvgrep -a -m ipsum -c 1-75 --remove-non-matching-except 1,2
jpmckinney commented 12 months ago

csvgrep streams the data, so it can't remove non-matching columns.

To work on multiple lines, you can add xargs. I use \x1e (record separator character) in case the cells contain newlines.

$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo _ | csvcut -n' | grep ipsum
 46:  ipsum
 69: ipsum
 77: ipsum
  5: ipsum
 54: ipsum
athalhammer commented 12 months ago

okay, point taken, so the conclusion would be to add a new tool named csvprefix that has the option to use column names or numbers as prefixes in cells and maybe some user-provided string arguments and a separator parameter (default ~) and maybe a "--include-header" parameter?

csvprefix --column-numbers  --include-header test.csv
...

csvprefix --column-numbers --separator "=" test.csv
...

csvprefix --column-names test.csv

csvprefix --prefix "asdf" test.csv
...
jpmckinney commented 12 months ago

Is there an issue with using xargs? csvkit's philosophy is to compose commands with pipes and to not invent new commands if existing ones are fine.

csvkit is in maintenance mode, so only existing commands are being maintained. Additional commands would have to be created as independent Python packages.

athalhammer commented 12 months ago

Hm, I'm not an expert and probably there's a way to fix it but I could not get it run with:

$ cat test.csv 
field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field20,field21,field22,field23,field24,field25,field26,field27,field28,field29,field30,field31,field32,field33,field34,field35,field36,field37,field38,field39,field40,field41,field42,field43,field44,field45,field46,field47,field48,field49,field50,field51,field52,field53,field54,field55,field56,field57,field58,field59,field60,field61,field62,field63,field64,field65,field66,field67,field68,field69,field70,field71,field72,field73,field74,field75
,,,,"lorem",,,,,,,,,,3,,,,,,,,,,43,,,,,,,,,,,,,,,0.95,,34,,,"evil, ipsum, with commas,",,,,,,,,,,,,,,,,,,,,,"ipsum",,,,,,334,,ipsum,4
,,,,"ipsum",,,,,,,,,,4,,,,,,,,,,41,,,,"fdsa

 ipsum",,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5

At the same time, it revealed that it may not be the best idea either as it tries to execute content as code:

$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo _ | csvcut -n' | grep ipsum
 46:  ipsum
 69: ipsum
 77: ipsum
,,,,ipsum,,,,,,,,,,4,,,,,,,,,,41,,,,fdsa
sh: 3: ipsum,,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5: not found
StopIteration: 
jpmckinney commented 12 months ago

You just need to quote the replstr in xargs:

csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo "_" | csvcut -n' | grep ipsum
athalhammer commented 12 months ago

That again returns invalid results for me

$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo "_" | csvcut -n' | grep ipsum
 46:  ipsum
 69: ipsum
 77: ipsum
  5: ipsum

(the file only has 75 columns)

jpmckinney commented 12 months ago

Ah, yeah, need to use single quotes:

csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c "echo '_' | csvcut -n" | grep ipsum
jpmckinney commented 12 months ago

Even more robust:

csvgrep -m ipsum -a -c 1- test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep ipsum
athalhammer commented 12 months ago

The last one does the trick. Thanks for adding it to the docs - I will need to look this up now every time.

Still believe that the -N option (for the 10 loc it adds) was more elegant, has probably many more use cases (in addition to the examined one), and is not completely off to hang out in csvcut (one could ask why -n is there and not in some csvinfo tool, too)

jpmckinney commented 12 months ago

This would have been better as an issue, but I'll close this PR for now, and perhaps if there is any interest from other users they will still find the PR discussion.

athalhammer commented 12 months ago

I agree with you, @jpmckinney, that this would have been better as an issue. Thanks for your time!