wireservice / csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.
https://csvkit.readthedocs.io
MIT License
5.9k stars 605 forks source link

Add "Values are unique" to --csv #1217

Closed tacman closed 8 months ago

tacman commented 8 months ago

It would be valuable to me to know if the all the values are unique. There is a number of unique values, and at the of the csvstat command there is

 Row count: 1856
csvstat data/subtitles_day.tsv 
  1. "IDSubtitle"

    Type of data:          Number
    Contains null values:  False
    Unique values:         1856
    Smallest value:        9,747,231
    Largest value:         9,749,339
    Sum:                   18,092,851,467
    Mean:                  9,748,303.592
    Median:                9,748,352.5
    StDev:                 628.279
    Most common values:    9,747,231 (1x)
                           9,747,232 (1x)
                           9,747,233 (1x)
                           9,747,234 (1x)
                           9,747,235 (1x)

That's lost with --csv, along with the frequency, so there's no easy way to know if the values are unique. For my purposes, I'm trying to find the primary key from a set of files, so knowing that the values are unique would be enormously helpful.

csvstat data/subtitles_day.tsv --csv | csvjson | jq

 {
    "column_id": 1,
    "column_name": "IDSubtitle",
    "type": "Number",
    "nulls": false,
    "unique": 1856,
    "min": "9,747,231",
    "max": "9,749,339",
    "sum": 18092851467,
    "mean": 9748303.592,
    "median": 9748352.5,
    "stdev": 628.279,
    "len": null,
    "freq": "9747231, 9747232, 9747233, 9747234, 9747235"
  },

If the frequency count were included in the "freq" key, I could parse that and see if the top one was just 1, but adding "Values are unique" would be better. Of course, to determine primary key I'd also check "Contains null values".

jpmckinney commented 8 months ago

In HEAD, I instead added a "Non-null values" statistic (also appears in the --csv output). This information is useful for this use case as well as others.

You can thus compare non-null values to unique values. Note that if the column contains nulls, then NULL counts as one additional unique value.

tacman commented 8 months ago

Thanks!

I've been installing this via "sudo apt install csvkit" but I think I need a ppm in order to get the latest version. Is one available?

I've had trouble following the installation instructions on Ubuntu via pip.

jpmckinney commented 8 months ago

I only manage the PyPI package. Packages in Linux distributions are created independently.

I'll make a new release of the PyPI package shortly.