Closed athalhammer closed 12 months ago
@jpmckinney I put it for csvcut
as it was the most straightforward code change. Maybe agate
would also be a good place to put it but it probably triggers more code changes for a feature that is not frequently sought after (see https://stackoverflow.com/questions/55253173/return-the-column-name-for-a-column-containig-a-text-value)
@jpmckinney fixed the flake8
issue
@jpmckinney for some reason SQLAlchemy 2.x.x
is installed in the runner - that triggers the job fails. In the previous jobs it was SQLAlchemy 1.4.x
https://github.com/wireservice/csvkit/actions/runs/6371290078/job/17361039598#step:4:186
culprit here: https://github.com/wireservice/agate-sql/pull/40
@jpmckinney I pinned agate-sql to <0.6.0
to make the tests run for this PR. I believe we should address the migration to agate-sql 0.6.0
and SQLAlchemy>=2.x.x
in a separate PR.
Sorry @jpmckinney , should have added a test on this feature earlier. Coverage should go up again with the last commit
Hmm, what about putting this in csvlook?
I think having the option to combine it with large tables is intriguing, e.g. csvcut -c 700-800 -N ...
The output can always be piped to csvlook
.
One way how I have come to love this feature is by "collecting" relevant columns in the following way:
$ csvcut -C 45 -N test.csv | grep -i ipsum | head | grep -i ipsum
In that way I can quickly collect all relevant "ipsum" columns in a long and wide table and eventually just leave the -N
away and switch -C
for -c
(see above)
$ csvcut -c 45,66,74 test.csv
As implemented, this seems more like a new utility – like csvprefix – than a logical addition to csvcut. (Changing prefix
to cut
is the only additional work when switching between commands.) However, csvkit isn't accepting new utilities.
For your example, you can achieve the same with existing tools:
$ csvgrep -m ipsum -c 1-75 -a test.csv | tail +2 | csvgrep -n | grep ipsum
45: evil, ipsum, with commas,
66: ipsum
74: ipsum
I agree with you, my example was not comprehensive enough (single line only):
Let's extend to the following (still not big enough, but I hope it transmits the point)
$ cat test.csv
field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field20,field21,field22,field23,field24,field25,field26,field27,field28,field29,field30,field31,field32,field33,field34,field35,field36,field37,field38,field39,field40,field41,field42,field43,field44,field45,field46,field47,field48,field49,field50,field51,field52,field53,field54,field55,field56,field57,field58,field59,field60,field61,field62,field63,field64,field65,field66,field67,field68,field69,field70,field71,field72,field73,field74,field75
,,,,"lorem",,,,,,,,,,3,,,,,,,,,,43,,,,,,,,,,,,,,,0.95,,34,,,"evil, ipsum, with commas,",,,,,,,,,,,,,,,,,,,,,"ipsum",,,,,,334,,ipsum,4
,,,,"ipsum",,,,,,,,,,4,,,,,,,,,,41,,,,,,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5
Then I can see in one "glimpse" (without working through the data line by line with tail +x
) which numbers pop up in the first x
(head -n x
) lines:
$ csvcut -C 45 -N test.csv | grep -i ipsum | head | grep -e "\(,\"[0-9]\+~.*ipsum.*\"\)\|,[0-9]\+~[^,]*ipsum[^,]*," -i
1~,2~,3~,4~,5~lorem,6~,7~,8~,9~,10~,11~,12~,13~,14~,15~3,16~,17~,18~,19~,20~,21~,22~,23~,24~,25~43,26~,27~,28~,29~,30~,31~,32~,33~,34~,35~,36~,37~,38~,39~,40~0.95,41~,42~34,43~,44~,46~,47~,48~,49~,50~,51~,52~,53~,54~,55~,56~,57~,58~,59~,60~,61~,62~,63~,64~,65~,66~ipsum,67~,68~,69~,70~,71~,72~334,73~,74~ipsum,75~4
1~,2~,3~,4~,5~ipsum,6~,7~,8~,9~,10~,11~,12~,13~,14~,15~4,16~,17~,18~,19~,20~,21~,22~,23~,24~,25~41,26~,27~,28~,29~,30~,31~,32~,33~,34~,35~,36~,37~,38~,39~,40~0.98,41~,42~34,43~,44~,46~,47~lorem,48~,49~,50~lorem,51~,52~,53~,54~ipsum,55~,56~,57~,58~,59~,60~,61~,62~,63~,64~,65~,66~,67~,68~,69~,70~,71~,72~3,73~,74~lorem,75~5
Then I can easily add 66,74,5,54
to my -C
and I execute again (on a file with > 1000s of lines) to see whether I caught the "bulk of columns where 'ipsum' may show up". Empty result basically means: you have all the "ipsums" and then I can continue working with some "lorems" etc. to make sure I have identified all the columns where "a user gave negative feedback" (or something like that)
Maybe it fits more naturally to csvgrep
as a parameter which indicates: "match any character but remove columns that don't have a match in any line"... But then you want be able to specify e.g. the primary key column and others to be always there (even though there is no match). It will also work only with -a
, so something like:
csvgrep -a -m ipsum -c 1-75 --remove-non-matching-except 1,2
csvgrep streams the data, so it can't remove non-matching columns.
To work on multiple lines, you can add xargs. I use \x1e
(record separator character) in case the cells contain newlines.
$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo _ | csvcut -n' | grep ipsum
46: ipsum
69: ipsum
77: ipsum
5: ipsum
54: ipsum
okay, point taken, so the conclusion would be to add a new tool named csvprefix
that has the option to use column names or numbers as prefixes in cells and maybe some user-provided string arguments and a separator parameter (default ~
) and maybe a "--include-header" parameter?
csvprefix --column-numbers --include-header test.csv
...
csvprefix --column-numbers --separator "=" test.csv
...
csvprefix --column-names test.csv
csvprefix --prefix "asdf" test.csv
...
Is there an issue with using xargs
? csvkit's philosophy is to compose commands with pipes and to not invent new commands if existing ones are fine.
csvkit is in maintenance mode, so only existing commands are being maintained. Additional commands would have to be created as independent Python packages.
Hm, I'm not an expert and probably there's a way to fix it but I could not get it run with:
$ cat test.csv
field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field20,field21,field22,field23,field24,field25,field26,field27,field28,field29,field30,field31,field32,field33,field34,field35,field36,field37,field38,field39,field40,field41,field42,field43,field44,field45,field46,field47,field48,field49,field50,field51,field52,field53,field54,field55,field56,field57,field58,field59,field60,field61,field62,field63,field64,field65,field66,field67,field68,field69,field70,field71,field72,field73,field74,field75
,,,,"lorem",,,,,,,,,,3,,,,,,,,,,43,,,,,,,,,,,,,,,0.95,,34,,,"evil, ipsum, with commas,",,,,,,,,,,,,,,,,,,,,,"ipsum",,,,,,334,,ipsum,4
,,,,"ipsum",,,,,,,,,,4,,,,,,,,,,41,,,,"fdsa
ipsum",,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5
At the same time, it revealed that it may not be the best idea either as it tries to execute content as code:
$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo _ | csvcut -n' | grep ipsum
46: ipsum
69: ipsum
77: ipsum
,,,,ipsum,,,,,,,,,,4,,,,,,,,,,41,,,,fdsa
sh: 3: ipsum,,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5: not found
StopIteration:
You just need to quote the replstr in xargs:
csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo "_" | csvcut -n' | grep ipsum
That again returns invalid results for me
$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo "_" | csvcut -n' | grep ipsum
46: ipsum
69: ipsum
77: ipsum
5: ipsum
(the file only has 75 columns)
Ah, yeah, need to use single quotes:
csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c "echo '_' | csvcut -n" | grep ipsum
Even more robust:
csvgrep -m ipsum -a -c 1- test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep ipsum
The last one does the trick. Thanks for adding it to the docs - I will need to look this up now every time.
Still believe that the -N
option (for the 10 loc it adds) was more elegant, has probably many more use cases (in addition to the examined one), and is not completely off to hang out in csvcut
(one could ask why -n
is there and not in some csvinfo
tool, too)
This would have been better as an issue, but I'll close this PR for now, and perhaps if there is any interest from other users they will still find the PR discussion.
I agree with you, @jpmckinney, that this would have been better as an issue. Thanks for your time!
I've run frequently into the problem that I'm facing wide CSV files where there is some interesting string/text somewhere in a long line of empty fields or numbers.
Example:
We want to approach this without manually counting the commas.
The new addition provides this and in combination with
grep
the result will be highlighted:This is not a breaking change as it keeps all columns where they are. It alters all the fields but it's sole purpose is for exploratory reasons.
Also, please let me know if you know a better or any solution how the above can be achieved without this modification.