yanniskatsaros / faro

An SQL-focused data analysis library for Python.
MIT License
6 stars 5 forks source link

Remove underlying `pandas` dependency. #6

Open yanniskatsaros opened 5 years ago

yanniskatsaros commented 5 years ago

In order to minimize "bloat" in the library, it is possible to make faro a "pure-Python" package by removing the pandas dependency for the underlying operations and instead opt for customized data structures such as namedtuple or dataclass for Python 3.7. This would mainly affect the underlying implementation of the faro.Table class.

This decision would affect the direction of the package in two major ways.

  1. It would restrict users to Python >= 3.7 (due to use of dataclasses)
  2. It would require a re-write of all pandas dependent operations.

Conversion from a faro.Table to a numpy.ndarray or a pandas.DataFrame would still be supported, but with optional dependencies for the user.

Glyphack commented 5 years ago

Since dataclasses are available as a python package, So users are not forced to upgrade to python 3.7 :beers: .

yanniskatsaros commented 5 years ago

@Glyphack good point! Glad to know that this change won't limit users to only Python 3.7+

Glyphack commented 5 years ago

I have a question it may help me out to resolve this issue, I have not worked with pandas so What does pandas has to do with dataclasses or namedtuples? is there any data structure in pandas that can be replaced with these?

yanniskatsaros commented 5 years ago

Here's some background on pandas and the DataFrame object: The pandas.DataFrame object is essentially a way to represent tabular ("tidy") data that can be accessed by a column name, filtered by a particular value etc. It's very popular among the data science community for working with tabular data in-memory to explore, manipulate, and visualize it.

Right now, using a DataFrame is convenient for faro because pandas has great support for I/O via their read_csv, read_json, etc. parsers which was the main reason I chose it to begin with. However, one of the main purposes of this project is to build a package that provides an interface to easily manipulate tabular data using SQL (not some pseudo, SQL-like syntax) on a Python object (via an SQLite in-memory database) instead of the syntax, mentality, and operations that pandas imposes on users.

My proposed solution (I just haven't had time to work on it) is to develop a simple, but hopefully robust parser for I/O with data from files (such as delimited, .xlsx, or JSON) that will easily map into SQL tables (with their correct types). There is likely a lot of overlap here with @derrickturk 's project antibiotics.

Currently, faro's implementation for adding a table to a faro.Database simply hands off the hard work to the pandas parsers. (see: faro.Database.add_table) The parsers are good but there's a few issues with using them:

  1. They read the data into a DataFrame which then has to be once again transferred over into SQL. There's too many intermediate steps. Furthermore their parsers make different assumptions about the types they have to parse than what I want for faro. (pandas
  2. Using them imposes the use of pandas which is a very bloated package overall.

Hopefully this helps explain and frame the problem a little better.