Closed tacman closed 1 year ago
Transferred to agate-sql, as it's responsible for the csvsql
database schema (DECIMAL
vs INTEGER
).
However, agate has only a number type, which doesn't track decimals observed.
So, agate probably needs to change first.
Ah, so looking at agate's history, I see that there had been an IntColumn, but it was removed https://github.com/wireservice/agate/issues/64 The explanation is in https://github.com/wireservice/agate/issues/35. Some of the relevant discussion:
I think it makes more sense to just have NumberColumn treat everything as Decimal type
The only thing that comes to mind is somtimes folk from the not-so-computational side on NICAR list get finicky about the display of things/getting rid of unwanted decimal bits and/or leading/trailing zeroes, which I guess ends up being either an issue of handling in the templating or just having to grok the difference between integer and decimal from the outset.
I am sensitive to the "no unnecessary rounding/conversion" issues, however, I'm less worried about that with a purely analytical library than I would be if there was a presentation component to this. Eliminating the extra type has one major benefit: you couldn't specify the wrong type when computing a new column. So for instance, if you create a column by dividing two integers, you need to make sure you specify the new column is Decimal. I can see people getting this wrong and it creating errors.
As best I can discern, Excel, OO, Google, etc all treat numbers internally as decimals and it's only at presentation time that something is "int-ifed".
the only reason I've been able to come up with not to eliminate IntColumn is that sometimes it might be more performant. That's not a good enough reason, so I'll probably pull the trigger on this today.
So, I think you'll just have to run an ALTER TABLE
if you want INTEGER columns.
Reintroducing an integer type in agate is too major. As described in the original issue, the main downside could be performance, which in SQL is that DECIMAL takes up more space than INTEGER, so if you're doing huge computations, a DECIMAL-based table might not fit in memory where an INTEGER-based table can.
But... csvkit is designed for relatively small data. For big data, you don't want to use Python at all (unless you're handing all the computation to C, like numpy does, etc.). qsv uses Rust and has a to
command to convert to SQL. I can't tell if it handles integers, but it might be worth looking into if you do need the columns to be integers.
Bummer. :-(
What about another boolean (like "Contains nulls"), that is "Contains fractions" that would only be true if type were numeric and there was at least one data value wasn't a whole number. Then users can create their own SQL, but at least know which fields were actually integers.
Combined with a "Unique" boolean, the developer could determine the appropriate primary key (and unique indexes).
Sure, agate has a MaxPrecision aggregation, so I now added that to the output.
If it's 0, then there are only whole numbers.
Looks like there was an earlier open issue here: https://github.com/wireservice/csvkit/issues/1070
When the values are all integer, it would be nice to know that, instead of just "Number"
The daily subtitle file from opensubtitles.org is less than 100k is size, about 2000 lines, and is a nice dataset for showing off csvkit.
IDSubtitle, ImdbID and MovieYear would be better represented in the database as integer values, rather than decimal.
Thanks for your consideration.