Open wesley-dean-flexion opened 2 months ago
Thank you for your interest!
There is an open issue to change csvclean's behavior somewhat, so you might have different ideas on how to integrate with that proposal (or you might have new proposals): https://github.com/wireservice/csvkit/issues/195#issuecomment-275879192 I think this issue would simplify your integration - let me know what you think and/or if any other changes would help.
@jpmckinney Very nice! Yes, those would be terrific.
I'm reticent to add more work to your plate and the idea of changing an interface that others may be reliant upon makes me nervous. That's why I was thinking of a wrapper as opposed to changing the interface.
But yeah, absolutely, those changes would be huge.
I'm reticent to add more work to your plate and the idea of changing an interface that others may be reliant upon makes me nervous. That's why I was thinking of a wrapper as opposed to changing the interface.
Well, #195 changes the interface, so it will require a major version – if there are any other interface changes we might as well do them at the same time, as I won't be making another major version in a long time, I suspect.
I saw that there's an image (looks like a telephone pole) that only seems to exist in the repository configuration and not in the repository itself. For MegaLinter, one can provide a link to a logo; is there a link to an image you would like used for the MegaLinter documentation (link to docs)?
I think https://avatars.githubusercontent.com/u/17111824 is all we have. Thank you :)
Okay, 2.0.0 is ready to be released. You can read the changes at https://csvkit.readthedocs.io/en/latest/changelog.html#unreleased
If no other suggestions, I'll date the changelog and release.
I think https://avatars.githubusercontent.com/u/17111824 is all we have. Thank you :)
Absolutely perfect. Thank you! :smile:
from the (un)release notes:
csvclean no longer fixes errors by default. Opt in to the original behavior using the --join-short-rows option.
So I understand -- and this is me being dense and is not a reflection on you -- the following will be true:
[basename]_out.csv
while leaving the base file unchanged--dry-run
--join-short-rows
is passedAgain, so I understand correctly...
A. if modifying options other than --join-short-rows
(e.g., --header-normalize-space
, --fill-short-rows
, etc.) are used, do those trigger writing?
B. I didn't get into agate's reader / writer functionality implementation, so.. I'm lazy. Are the entire contents of the CSV file read into memory and the file handle closed before the reader returns? I'm wondering because I'm curious to see if something like this would be problematic:
filename="/path/to/the/file.csv"
csvclean "$filename" --join-short-rows > "$filename"
I'm thinking that it would be safer to do something like:
filename="/path/to/the/file.csv"
tmp_fileame="$(mktemp)"
csvclean "$filename" --join-short-rows > "$tmp_filename"
mv -f "$tmp_filename" "$filename"
(so the new file writes don't clobber the original file if it hasn't been completely read)
C. will running csvclean
"normally" (e.g., no flags) will still write out errors that are detected (i.e., like 1.5.0's --dry-run
)?
If so, does it still use the previous format? (sample follows)
$ csvclean --dry-run app/test_data.csv
Line 1: Expected 4 columns, found 5 columns
Line 2: Expected 4 columns, found 5 columns
Line 3: Expected 4 columns, found 5 columns
Line 4: Expected 4 columns, found 5 columns
Line 5: Expected 4 columns, found 5 columns
Line 6: Expected 4 columns, found 5 columns
(yes, I know you know what the output looked like.. I'm putting this more for my own reference and for anyone who comes along after wanting to see the history of the change)
D. follow-up, can I still use this regex to capture errors?:
^Line[[:space:]]*([[:digit:]]+)[^:]*:[[:space:]]*(.*)
(\1
would be the line number, \2
would be the error message)
I was going to ask question E but decided not to because it can be done outside of csvclean
. Question E was going to be, "can we add the filename after the line number and before the colon?
filename="/path/to/filename.csv"
csvclean --join-short-rows "$filename" | sed -Ene "s|^Line[[:space:]]*([[:digit:]]+)[^:]*:[[:space:]]*(.*)|Line \1 $filename : \2|p'
..just in case anyone else's IDE gets confused when parsing warnings..
csvclean no longer fixes errors by default. Opt in to the original behavior using the --join-short-rows option.
- the 1.5.0 default was to fix errors
- when fixing errors in 1.5.0, it would write to
[basename]_out.csv
while leaving the base file unchanged- the 2.0.0 default is not to fix errors (i.e., don't write anything)
- because the default in 2.0.0 is not to write anything, there's no need for
--dry-run
- csvclean will only fix errors it encounters when
--join-short-rows
is passed
... or when --header-normalize-space
or --fill-short-rows
are passed.
- csvclean writes updated contend to STDOUT and errors to STDERR; the original files -- even if flawed -- are not touched even when correcting errors
Yes, all correct. This brings csvclean in line with other csvkit tools, which all write to standard output.
Again, so I understand correctly...
A. if modifying options other than
--join-short-rows
(e.g.,--header-normalize-space
,--fill-short-rows
, etc.) are used, do those trigger writing?
Those also only write to standard output. The only thing that writes to a file is when using in2csv with --write-sheets
(since each sheet needs to be written separately).
B. I didn't get into agate's reader / writer functionality implementation, so.. I'm lazy. Are the entire contents of the CSV file read into memory and the file handle closed before the reader returns?
Tools read line by line wherever possible. The tools (or scenarios) that do or don't are listed here: https://csvkit.readthedocs.io/en/latest/contributing.html#streaming-versus-buffering
I'm wondering because I'm curious to see if something like this would be problematic:
filename="/path/to/the/file.csv" csvclean "$filename" --join-short-rows > "$filename"
Yes, that would be problematic.
I'm thinking that it would be safer to do something like:
filename="/path/to/the/file.csv" tmp_fileame="$(mktemp)" csvclean "$filename" --join-short-rows > "$tmp_filename" mv -f "$tmp_filename" "$filename"
Yes, I think you have to do it that way. There are a bunch of options at https://backreference.org/2011/01/29/in-place-editing-of-files/ https://serverfault.com/q/135507 and https://unix.stackexchange.com/a/204378 but I think the simple mv
is easiest to understand and maintain.
C. will running
csvclean
"normally" (e.g., no flags) will still write out errors that are detected (i.e., like 1.5.0's--dry-run
)?
It will write out line length errors, like before. --empty-columns
opts in to additional errors about empty columns (it's not included by default, because #962 and #426 did not have a lot of demand. I can maybe add a --all-checks
option to opt-in to all checks (only 2 now, but maybe more later).
If so, does it still use the previous format? (sample follows)
$ csvclean --dry-run app/test_data.csv Line 1: Expected 4 columns, found 5 columns Line 2: Expected 4 columns, found 5 columns Line 3: Expected 4 columns, found 5 columns Line 4: Expected 4 columns, found 5 columns Line 5: Expected 4 columns, found 5 columns Line 6: Expected 4 columns, found 5 columns
No, it uses the format that was used for the _err.csv
file that it used to write (where a,b,c are the columns from the original file):
line_number,msg,a,b,c
1,"Expected 3 columns, found 2 columns",1,cat
2,"Expected 3 columns, found 2 columns",dog,c
D. follow-up, can I still use this regex to capture errors?:
^Line[[:space:]]*([[:digit:]]+)[^:]*:[[:space:]]*(.*)
(
\1
would be the line number,\2
would be the error message)
No, you'll have to adjust to something like ^(\d+),"([^"]+)"
can we add the filename after the line number and before the colon?
I can't get your command to run (sed
on macOS is not the same as on Linux), but if you mean "can the filename be added to standard error", then yes, we can add an option like that.
csvstack, for example, has --filenames
to add the names of the CSV files that were stacked. For csvclean, it only works on one file at a time, but we can maybe add --filename
which accepts a value (in the case the command receives stdin, or where the user for some reason wants to use a different value), and if that value is -
then it defaults to the filename (or stdin
if it's standard input).
I can't get your command to run
yeah, sorry about that.
can the filename be added
Rock on. That would be nice. Thank you.
... or when --header-normalize-space or --fill-short-rows are passed.
Cool. I got the --join-short-rows
from the release notes; I suspected your response was the case but wanted to make sure
Tools read line by line wherever possible.
Yeah, I suspected that as well (hence the mktemp
swap in a subsequent comment). Slurping the whole thing into memory seemed to be unlikely and option-limiting. Thank you for confirming.
I can maybe add a --all-checks option to opt-in to all checks (only 2 now, but maybe more later).
Yes, please. That would be most excellent.
No, it uses the format that was used for the _err.csv
Cool. Got it. Thank you.
No, you'll have to adjust to something like
^(\d+),"([^"]+)"
Yuppers.
--enable-all-checks
and --label
added.
I made some other changes. Please see the updated draft release notes.
I made some other changes
Thank you so, so much. Your dedication to csvkit and open source is admirable and exemplary.
@jpmckinney I just read through the 2.0.0 release notes and I'm very excited.
If someone would have told me when I was a little kid that I would get this excited about a release of a tool that cleans up files on a computer that represent data as a series of newline-terminated rows with individual values separated into columns by commas...
Yay :D 2.0.0 is released! 🎉
Hello!
First of, a great many thanks for your work with csvkit! I appreciate you and your contributions to making this world a better place! (seriously!!)
I'm working on adding
csvclean
into MegaLinter as a new linter. MegaLinter already includes a bunch of linters for various programming languages, tools, and file formats, including JSON, TOML, XML, and YAML. The effort is being tracked in oxsecurity/megalinter#3493Typically, when MegaLinter runs, it'll run linters to find issues and report the findings. For linters that support it, it can also run with
APPLY_FIXES
where a tool (e.g., shfmt for reformatting Bash scripts) will update the source file and push the results back to the repository either as a commit on a branch or in a new PR.I'm planning on incorporating that functionality with the use of
--dry-run
when not fixing things (e.g., whenAPPLY_FIXES
is false, usecsvclean --dry-run
; when true, usecsvclean
without the extra flag).When
csvclean
is run with--dry-run
on a file with known issues, it looks like this:(all sent to STDOUT)
When run without
--dry-run
, it looks like this:with the errors sent to
acronyms_err.csv
and the updated file written toacronyms_out.csv
. This is all expected given the documentation.When MegaLinter runs, it can look for errors with a regex (e.g.,
(?i)^line[[:space:]]*([[:digit:]]*):
). I could use a regex that would match either (e.g.,(?i)^(line[[:space:]]*([[:digit:]]*):|([[:digit:]]*)[[:space:]]*errors?[[:space:]]logged)
on STDOUT (i.e., with or without--dry-run
would still match if errors were found). In all cases, regardless of whether there were errors or not,csvclean
return a0
response.I can wrap
csvclean
with some extra shell magic:(return non-zero if
csvclean
throws an error; if it's successful, return non-zero when the_err.csv
file exists (so, if no_err.csv
file exists, return 0))But really where I'm focusing is having a unified output (i.e., the same regardless of the presence of
--dry-run
) that I could easily parse on the other side. Something like:(I just put that together in this issue -- I don't know if it's syntactically correct or if it'll even run, but hopefully that's sufficient to get the point across)
So... my questions...