add --json option to csvstats with frequency details

wireservice / csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.

https://csvkit.readthedocs.io

MIT License

5.98k stars 608 forks source link

add --json option to csvstats with frequency details #1216

Closed tacman closed 11 months ago

tacman commented 11 months ago

First, a big thanks for publishing csvkit! I only wish I had discovered it earlier.

I like the --csv option for csvstats, and would love to see a --json (or --format=json|csv|text) option. That would allow piping the results to another script, since importing JSON is much easier and faster than importing csv.

I know there are various scripts out there for csvtojson, but as I'm integrating this toolkit into our workflow I'd like to avoid having another tool to install.

Thanks for your consideration.

tacman commented 11 months ago

DOH! This is what I was looking for:

csvstat data.csv --csv | csvjson | jq
[
  {
    "column_id": 1,
    "column_name": "state",
    "type": "Text",
    "nulls": false,
    "unique": 1,
    "min": null,
    "max": null,
    "sum": null,
    "mean": null,
    "median": null,
    "stdev": null,
    "len": 2,
    "freq": "NE"
  },
  {
    "column_id": 2,
    "column_name": "county",
    "type": "Text",

tacman commented 11 months ago

Actually, I'm going to reopen this, and ask that you consider adding a --json option that includes the frequency stats.

csvstat data/subtitles_day.tsv 

15. "movieKind"

    Type of data:          Text
    Contains null values:  False
    Unique values:         2
    Longest value:         5 characters
    Most common values:    tv (1443x)
                           movie (413x)

csvstat data/subtitles_day.tsv | csvjson | jq 

{
    "column_id": 15,
    "column_name": "movieKind",
    "type": "Text",
    "nulls": false,
    "unique": 2,
    "min": null,
    "max": null,
    "sum": null,
    "mean": null,
    "median": null,
    "stdev": null,
    "len": 5,
    "freq": "tv, movie"
  },
  {

So we lost the frequency data that exists in the text report. Yeah, it's a pain to add that in CSV, but it'd be natural in JSON, e.g.

{ "freq": { "tv": 1443, "movie": 413 }}

jpmckinney commented 11 months ago

Added

csvstat --json examples/realdata/ks_1033_data.csv -i 2

...
  {
    "column_id": 14,
    "column_name": "federal_supply_class_name",
    "type": "Text",
    "nulls": false,
    "nonnulls": 1575,
    "unique": 30,
    "len": 63,
    "freq": [
      {
        "value": "Guns, through 30 mm",
        "count": 1427
      },
      {
        "value": "Trucks and Truck Tractors, Wheeled",
        "count": 33
      },
      {
        "value": "Vehicular Cab, Body, and Frame Structural Components",
        "count": 28
      },
      {
        "value": "Surface Use Explosive Ordnance Disposal Tools and Equipment",
        "count": 14
      },
      {
        "value": "Medical and Surgical Instruments, Equipment, and Supplies",
        "count": 12
      }
    ]
  }
]