wireservice / csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.
https://csvkit.readthedocs.io
MIT License
5.9k stars 605 forks source link

Add "\N" as a null value #1208

Closed steve-estes closed 8 months ago

steve-estes commented 9 months ago

The handling / detection of nulls within csvkit (not sure if it's properly within Agate or not) includes a list of values that are interpreted as nulls. You can see them in the parameters list for --blank, which turns off the default behavior that

coerce[s] empty, "na", "n/a", "none", "null", "." strings to NULL values

I would like this list to include "\N", which is something commonly done in old DB systems, and sometimes written out to CSV when the producers of the CSV wish to distinguish between empty strings and actual nulls. This could be as simple as a one-liner in the code, I don't know, but either way it'd be handy.

jpmckinney commented 9 months ago

This is easy to do if using agate directly. Is that an option for you?

Otherwise, we would need a command-line argument to specify null values.

tacman commented 8 months ago

Indeed, "\N" is used in the movie database in the IMDB datasets: https://datasets.imdbws.com/

My current solution is to use sed as a pre-processor:

sed -i  "s/\\\N//g" $filename

But as you mention, that doesn't allow for distinguishing between nulls and blanks. And it's an extra step, so certainly having csvkit recognize it would be better.

steve-estes commented 8 months ago

Thanks for such a fast turnaround - and so flexible a solution.