Open yanniskatsaros opened 5 years ago
Since dataclasses are available as a python package, So users are not forced to upgrade to python 3.7 :beers: .
@Glyphack good point! Glad to know that this change won't limit users to only Python 3.7+
I have a question it may help me out to resolve this issue, I have not worked with pandas so What does pandas has to do with dataclasses or namedtuples? is there any data structure in pandas that can be replaced with these?
Here's some background on pandas
and the DataFrame
object:
The pandas.DataFrame
object is essentially a way to represent tabular ("tidy") data that can be accessed by a column name, filtered by a particular value etc. It's very popular among the data science community for working with tabular data in-memory to explore, manipulate, and visualize it.
Right now, using a DataFrame
is convenient for faro
because pandas
has great support for I/O via their read_csv
, read_json
, etc. parsers which was the main reason I chose it to begin with. However, one of the main purposes of this project is to build a package that provides an interface to easily manipulate tabular data using SQL (not some pseudo, SQL-like syntax) on a Python object (via an SQLite in-memory database) instead of the syntax, mentality, and operations that pandas
imposes on users.
My proposed solution (I just haven't had time to work on it) is to develop a simple, but hopefully robust parser for I/O with data from files (such as delimited, .xlsx, or JSON) that will easily map into SQL tables (with their correct types). There is likely a lot of overlap here with @derrickturk 's project antibiotics.
Currently, faro
's implementation for adding a table to a faro.Database
simply hands off the hard work to the pandas
parsers. (see: faro.Database.add_table
) The parsers are good but there's a few issues with using them:
DataFrame
which then has to be once again transferred over into SQL. There's too many intermediate steps. Furthermore their parsers make different assumptions about the types they have to parse than what I want for faro
. (pandas
pandas
which is a very bloated package overall.Hopefully this helps explain and frame the problem a little better.
In order to minimize "bloat" in the library, it is possible to make
faro
a "pure-Python" package by removing thepandas
dependency for the underlying operations and instead opt for customized data structures such asnamedtuple
ordataclass
for Python 3.7. This would mainly affect the underlying implementation of thefaro.Table
class.This decision would affect the direction of the package in two major ways.
dataclasses
)pandas
dependent operations.Conversion from a
faro.Table
to anumpy.ndarray
or apandas.DataFrame
would still be supported, but with optional dependencies for the user.