Support "Integer" data types when data has no fractional numbers (floats, decimals)

tacman commented 1 year ago

When the values are all integer, it would be nice to know that, instead of just "Number"

The daily subtitle file from opensubtitles.org is less than 100k is size, about 2000 lines, and is a nice dataset for showing off csvkit.

curl "http://dl.opensubtitles.org/addons/export/subtitles_day.txt.gz" |  gunzip -c > subtitles.txt

csvcut -c IDSubtitle,MovieYear subtitles.txt -t | csvstat -y 0
  1. "IDSubtitle"

    Type of data:          Number
    Contains null values:  False
    Unique values:         1856
    Smallest value:        9,747,231
    Largest value:         9,749,339
    Sum:                   18,092,851,467
    Mean:                  9,748,303.592
    Median:                9,748,352.5
    StDev:                 628.279
    Most common values:    9,747,231 (1x)
                           9,747,232 (1x)
                           9,747,233 (1x)
                           9,747,234 (1x)
                           9,747,235 (1x)

  2. "MovieYear"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         76
    Smallest value:        1,931
    Largest value:         2,023
    Sum:                   3,619,369
    Mean:                  2,016.362
    Median:                2,022
    StDev:                 14.175
    Most common values:    2,023 (861x)
                           2,016 (115x)
                           2,015 (95x)
                           2,021 (93x)
                           2,019 (85x)

Row count: 1856

csvsql subtitles.txt 
CREATE TABLE subtitles (
    "IDSubtitle" DECIMAL NOT NULL, 
    "MovieName" VARCHAR NOT NULL, 
    "MovieYear" DECIMAL, 
    "LanguageName" VARCHAR NOT NULL, 
    "ISO639" VARCHAR NOT NULL, 
    "SubAddDate" TIMESTAMP, 
    "ImdbID" DECIMAL NOT NULL, 
    "SubFormat" VARCHAR NOT NULL, 
    "SubSumCD" DECIMAL NOT NULL, 
    "MovieReleaseName" VARCHAR, 
    "MovieFPS" DECIMAL NOT NULL, 
    "SeriesSeason" DECIMAL, 
    "SeriesEpisode" DECIMAL, 
    "SeriesIMDBParent" DECIMAL, 
    "MovieKind" VARCHAR NOT NULL, 
    "URL" VARCHAR NOT NULL
);

IDSubtitle, ImdbID and MovieYear would be better represented in the database as integer values, rather than decimal.

Thanks for your consideration.

jpmckinney commented 1 year ago

Transferred to agate-sql, as it's responsible for the csvsql database schema (DECIMAL vs INTEGER).

However, agate has only a number type, which doesn't track decimals observed.

So, agate probably needs to change first.

Ah, so looking at agate's history, I see that there had been an IntColumn, but it was removed https://github.com/wireservice/agate/issues/64 The explanation is in https://github.com/wireservice/agate/issues/35. Some of the relevant discussion:

I think it makes more sense to just have NumberColumn treat everything as Decimal type

The only thing that comes to mind is somtimes folk from the not-so-computational side on NICAR list get finicky about the display of things/getting rid of unwanted decimal bits and/or leading/trailing zeroes, which I guess ends up being either an issue of handling in the templating or just having to grok the difference between integer and decimal from the outset.

I am sensitive to the "no unnecessary rounding/conversion" issues, however, I'm less worried about that with a purely analytical library than I would be if there was a presentation component to this. Eliminating the extra type has one major benefit: you couldn't specify the wrong type when computing a new column. So for instance, if you create a column by dividing two integers, you need to make sure you specify the new column is Decimal. I can see people getting this wrong and it creating errors.

As best I can discern, Excel, OO, Google, etc all treat numbers internally as decimals and it's only at presentation time that something is "int-ifed".

the only reason I've been able to come up with not to eliminate IntColumn is that sometimes it might be more performant. That's not a good enough reason, so I'll probably pull the trigger on this today.

So, I think you'll just have to run an ALTER TABLE if you want INTEGER columns.

Reintroducing an integer type in agate is too major. As described in the original issue, the main downside could be performance, which in SQL is that DECIMAL takes up more space than INTEGER, so if you're doing huge computations, a DECIMAL-based table might not fit in memory where an INTEGER-based table can.

But... csvkit is designed for relatively small data. For big data, you don't want to use Python at all (unless you're handing all the computation to C, like numpy does, etc.). qsv uses Rust and has a to command to convert to SQL. I can't tell if it handles integers, but it might be worth looking into if you do need the columns to be integers.

tacman commented 1 year ago

Bummer. :-(

What about another boolean (like "Contains nulls"), that is "Contains fractions" that would only be true if type were numeric and there was at least one data value wasn't a whole number. Then users can create their own SQL, but at least know which fields were actually integers.

Combined with a "Unique" boolean, the developer could determine the appropriate primary key (and unique indexes).

jpmckinney commented 1 year ago

Sure, agate has a MaxPrecision aggregation, so I now added that to the output.

If it's 0, then there are only whole numbers.

jpmckinney commented 1 year ago

Looks like there was an earlier open issue here: https://github.com/wireservice/csvkit/issues/1070

wireservice / agate-sql

Support "Integer" data types when data has no fractional numbers (floats, decimals) #43