Integrating with MegaLinter

wesley-dean-flexion commented 2 months ago

Hello!

First of, a great many thanks for your work with csvkit! I appreciate you and your contributions to making this world a better place! (seriously!!)

I'm working on adding csvclean into MegaLinter as a new linter. MegaLinter already includes a bunch of linters for various programming languages, tools, and file formats, including JSON, TOML, XML, and YAML. The effort is being tracked in oxsecurity/megalinter#3493

Typically, when MegaLinter runs, it'll run linters to find issues and report the findings. For linters that support it, it can also run with APPLY_FIXES where a tool (e.g., shfmt for reformatting Bash scripts) will update the source file and push the results back to the repository either as a commit on a branch or in a new PR.

I'm planning on incorporating that functionality with the use of --dry-run when not fixing things (e.g., when APPLY_FIXES is false, use csvclean --dry-run ; when true, use csvclean without the extra flag).

When csvclean is run with --dry-run on a file with known issues, it looks like this:

$ csvclean --dry-run acronyms.csv
Line 289: Expected 4 columns, found 5 columns
Line 1196: Expected 4 columns, found 6 columns
Line 1241: Expected 4 columns, found 8 columns
Line 1242: Expected 4 columns, found 2 columns
Line 1307: Expected 4 columns, found 3 columns

(all sent to STDOUT)

When run without --dry-run, it looks like this:

$ csvclean acronyms.csv
5 errors logged to acronyms_err.csv

with the errors sent to acronyms_err.csv and the updated file written to acronyms_out.csv . This is all expected given the documentation.

When MegaLinter runs, it can look for errors with a regex (e.g., (?i)^line[[:space:]]*([[:digit:]]*):). I could use a regex that would match either (e.g., (?i)^(line[[:space:]]*([[:digit:]]*):|([[:digit:]]*)[[:space:]]*errors?[[:space:]]logged) on STDOUT (i.e., with or without --dry-run would still match if errors were found). In all cases, regardless of whether there were errors or not, csvclean return a 0 response.

I can wrap csvclean with some extra shell magic:

csvclean filename.csv && [ ! -e filename_err.csv ]

(return non-zero if csvclean throws an error; if it's successful, return non-zero when the _err.csv file exists (so, if no _err.csv file exists, return 0))

But really where I'm focusing is having a unified output (i.e., the same regardless of the presence of --dry-run) that I could easily parse on the other side. Something like:


my_csvclean() {

  declare -i return_value=0

  APPLY_FIXES="${APPLY_FIXES:-1}"

  filename="$1"
  shift

  if [ -z "$filename" ] ; then
    echo "'${filename}' doesn't exist" 1>&2
    return 1
  fi

  # get the errors, if any, and store the result code for later
  csvclean --dry-run "$filename" | grep -Eie '^line[[:space:]]+[[:digit:]]*:'
  return_value=$?

  # if we're asked to fix stuff and there's stuff to fix, then
  # apply the fixes and replace the original with the fixed version
  if [ $APPLY_FIXES -eq 0 ] \
  && [ $return_value  - eq 0 ] ; then
    csvclean "$filename"
    mv -f "${filename%%.*}_out.csv" "${filename}"
  fi

  return $return_value
}

(I just put that together in this issue -- I don't know if it's syntactically correct or if it'll even run, but hopefully that's sufficient to get the point across)

So... my questions...

Do you have any thoughts or reactions?
Do you have any suggestions on how to (more-) cleanly integrate this with MegaLinter?
I saw that there's an image (looks like a telephone pole) that only seems to exist in the repository configuration and not in the repository itself. For MegaLinter, one can provide a link to a logo; is there a link to an image you would like used for the MegaLinter documentation (link to docs)?

jpmckinney commented 2 months ago

Thank you for your interest!

There is an open issue to change csvclean's behavior somewhat, so you might have different ideas on how to integrate with that proposal (or you might have new proposals): https://github.com/wireservice/csvkit/issues/195#issuecomment-275879192 I think this issue would simplify your integration - let me know what you think and/or if any other changes would help.

wesley-dean-flexion commented 2 months ago

@jpmckinney Very nice! Yes, those would be terrific.

I'm reticent to add more work to your plate and the idea of changing an interface that others may be reliant upon makes me nervous. That's why I was thinking of a wrapper as opposed to changing the interface.

But yeah, absolutely, those changes would be huge.

jpmckinney commented 2 months ago

I'm reticent to add more work to your plate and the idea of changing an interface that others may be reliant upon makes me nervous. That's why I was thinking of a wrapper as opposed to changing the interface.

Well, #195 changes the interface, so it will require a major version – if there are any other interface changes we might as well do them at the same time, as I won't be making another major version in a long time, I suspect.

jpmckinney commented 2 months ago

I saw that there's an image (looks like a telephone pole) that only seems to exist in the repository configuration and not in the repository itself. For MegaLinter, one can provide a link to a logo; is there a link to an image you would like used for the MegaLinter documentation (link to docs)?

I think https://avatars.githubusercontent.com/u/17111824 is all we have. Thank you :)

jpmckinney commented 2 months ago

Okay, 2.0.0 is ready to be released. You can read the changes at https://csvkit.readthedocs.io/en/latest/changelog.html#unreleased

If no other suggestions, I'll date the changelog and release.

wesley-dean-flexion commented 2 months ago

I think https://avatars.githubusercontent.com/u/17111824 is all we have. Thank you :)

Absolutely perfect. Thank you! :smile:

wesley-dean-flexion commented 2 months ago

from the (un)release notes:

csvclean no longer fixes errors by default. Opt in to the original behavior using the --join-short-rows option.

So I understand -- and this is me being dense and is not a reflection on you -- the following will be true:

the 1.5.0 default was to fix errors
when fixing errors in 1.5.0, it would write to [basename]_out.csv while leaving the base file unchanged
the 2.0.0 default is not to fix errors (i.e., don't write anything)
because the default in 2.0.0 is not to write anything, there's no need for --dry-run
csvclean will only fix errors it encounters when --join-short-rows is passed
csvclean writes updated contend to STDOUT and errors to STDERR; the original files -- even if flawed -- are not touched even when correcting errors

Again, so I understand correctly...

A. if modifying options other than --join-short-rows (e.g., --header-normalize-space, --fill-short-rows, etc.) are used, do those trigger writing? B. I didn't get into agate's reader / writer functionality implementation, so.. I'm lazy. Are the entire contents of the CSV file read into memory and the file handle closed before the reader returns? I'm wondering because I'm curious to see if something like this would be problematic:

filename="/path/to/the/file.csv"
csvclean "$filename" --join-short-rows > "$filename"

I'm thinking that it would be safer to do something like:

filename="/path/to/the/file.csv"
tmp_fileame="$(mktemp)"
csvclean "$filename" --join-short-rows > "$tmp_filename"
mv -f "$tmp_filename" "$filename"

(so the new file writes don't clobber the original file if it hasn't been completely read)

C. will running csvclean "normally" (e.g., no flags) will still write out errors that are detected (i.e., like 1.5.0's --dry-run)?

If so, does it still use the previous format? (sample follows)

$ csvclean --dry-run app/test_data.csv 
Line 1: Expected 4 columns, found 5 columns
Line 2: Expected 4 columns, found 5 columns
Line 3: Expected 4 columns, found 5 columns
Line 4: Expected 4 columns, found 5 columns
Line 5: Expected 4 columns, found 5 columns
Line 6: Expected 4 columns, found 5 columns

(yes, I know you know what the output looked like.. I'm putting this more for my own reference and for anyone who comes along after wanting to see the history of the change)

D. follow-up, can I still use this regex to capture errors?:

^Line[[:space:]]*([[:digit:]]+)[^:]*:[[:space:]]*(.*)

(\1 would be the line number, \2 would be the error message)

wesley-dean-flexion commented 2 months ago

I was going to ask question E but decided not to because it can be done outside of csvclean. Question E was going to be, "can we add the filename after the line number and before the colon?

filename="/path/to/filename.csv"
csvclean --join-short-rows "$filename" | sed -Ene "s|^Line[[:space:]]*([[:digit:]]+)[^:]*:[[:space:]]*(.*)|Line \1 $filename : \2|p'

..just in case anyone else's IDE gets confused when parsing warnings..

jpmckinney commented 2 months ago

csvclean no longer fixes errors by default. Opt in to the original behavior using the --join-short-rows option.

the 1.5.0 default was to fix errors

when fixing errors in 1.5.0, it would write to [basename]_out.csv while leaving the base file unchanged

the 2.0.0 default is not to fix errors (i.e., don't write anything)

because the default in 2.0.0 is not to write anything, there's no need for --dry-run

csvclean will only fix errors it encounters when --join-short-rows is passed

... or when --header-normalize-space or --fill-short-rows are passed.

csvclean writes updated contend to STDOUT and errors to STDERR; the original files -- even if flawed -- are not touched even when correcting errors

Yes, all correct. This brings csvclean in line with other csvkit tools, which all write to standard output.

Again, so I understand correctly...

A. if modifying options other than --join-short-rows (e.g., --header-normalize-space, --fill-short-rows, etc.) are used, do those trigger writing?

Those also only write to standard output. The only thing that writes to a file is when using in2csv with --write-sheets (since each sheet needs to be written separately).

B. I didn't get into agate's reader / writer functionality implementation, so.. I'm lazy. Are the entire contents of the CSV file read into memory and the file handle closed before the reader returns?

Tools read line by line wherever possible. The tools (or scenarios) that do or don't are listed here: https://csvkit.readthedocs.io/en/latest/contributing.html#streaming-versus-buffering

I'm wondering because I'm curious to see if something like this would be problematic:
filename="/path/to/the/file.csv"
csvclean "$filename" --join-short-rows > "$filename"

Yes, that would be problematic.

I'm thinking that it would be safer to do something like:

filename="/path/to/the/file.csv"
tmp_fileame="$(mktemp)"
csvclean "$filename" --join-short-rows > "$tmp_filename"
mv -f "$tmp_filename" "$filename"

Yes, I think you have to do it that way. There are a bunch of options at https://backreference.org/2011/01/29/in-place-editing-of-files/ https://serverfault.com/q/135507 and https://unix.stackexchange.com/a/204378 but I think the simple mv is easiest to understand and maintain.

C. will running csvclean "normally" (e.g., no flags) will still write out errors that are detected (i.e., like 1.5.0's --dry-run)?

It will write out line length errors, like before. --empty-columns opts in to additional errors about empty columns (it's not included by default, because #962 and #426 did not have a lot of demand. I can maybe add a --all-checks option to opt-in to all checks (only 2 now, but maybe more later).

If so, does it still use the previous format? (sample follows)

$ csvclean --dry-run app/test_data.csv 
Line 1: Expected 4 columns, found 5 columns
Line 2: Expected 4 columns, found 5 columns
Line 3: Expected 4 columns, found 5 columns
Line 4: Expected 4 columns, found 5 columns
Line 5: Expected 4 columns, found 5 columns
Line 6: Expected 4 columns, found 5 columns

No, it uses the format that was used for the _err.csv file that it used to write (where a,b,c are the columns from the original file):

line_number,msg,a,b,c
1,"Expected 3 columns, found 2 columns",1,cat
2,"Expected 3 columns, found 2 columns",dog,c

D. follow-up, can I still use this regex to capture errors?:
^Line[[:space:]]*([[:digit:]]+)[^:]*:[[:space:]]*(.*)
(\1 would be the line number, \2 would be the error message)

No, you'll have to adjust to something like ^(\d+),"([^"]+)"

jpmckinney commented 2 months ago

can we add the filename after the line number and before the colon?

I can't get your command to run (sed on macOS is not the same as on Linux), but if you mean "can the filename be added to standard error", then yes, we can add an option like that.

csvstack, for example, has --filenames to add the names of the CSV files that were stacked. For csvclean, it only works on one file at a time, but we can maybe add --filename which accepts a value (in the case the command receives stdin, or where the user for some reason wants to use a different value), and if that value is - then it defaults to the filename (or stdin if it's standard input).

wesley-dean-flexion commented 2 months ago

I can't get your command to run

yeah, sorry about that.

can the filename be added

Rock on. That would be nice. Thank you.

... or when --header-normalize-space or --fill-short-rows are passed.

Cool. I got the --join-short-rows from the release notes; I suspected your response was the case but wanted to make sure

Tools read line by line wherever possible.

Yeah, I suspected that as well (hence the mktemp swap in a subsequent comment). Slurping the whole thing into memory seemed to be unlikely and option-limiting. Thank you for confirming.

I can maybe add a --all-checks option to opt-in to all checks (only 2 now, but maybe more later).

Yes, please. That would be most excellent.

No, it uses the format that was used for the _err.csv

Cool. Got it. Thank you.

No, you'll have to adjust to something like ^(\d+),"([^"]+)"

Yuppers.

jpmckinney commented 2 months ago

--enable-all-checks and --label added.

I made some other changes. Please see the updated draft release notes.

wesley-dean-flexion commented 2 months ago

I made some other changes

Thank you so, so much. Your dedication to csvkit and open source is admirable and exemplary.

wesley-dean-flexion commented 2 months ago

@jpmckinney I just read through the 2.0.0 release notes and I'm very excited.

If someone would have told me when I was a little kid that I would get this excited about a release of a tool that cleans up files on a computer that represent data as a series of newline-terminated rows with individual values separated into columns by commas...

jpmckinney commented 2 months ago

Yay :D 2.0.0 is released! 🎉

wireservice / csvkit

Integrating with MegaLinter #1239