xanderyzwich / datapy

File based database without external dependencies.
Apache License 2.0
3 stars 2 forks source link

SQL to pandas conversions #10

Open drkennetz opened 4 years ago

drkennetz commented 4 years ago

Create an SQL to Pandas converter so that SQL statements can be directly applied to pandas dataframes.

A simple example would be:

SQL:

select table.column from schema.table where table.colum = "Funny"

Pandas:

df[df['column'] == "Funny"]
xanderyzwich commented 4 years ago

this is something that seems to be quite desirable with Pandas users. This could be a big help!

ericwburden commented 4 years ago

Serious question: Is it really necessary to convert SQL queries to pandas syntax (or vice versa)? Since the sqlite3 library is included in base Python (reference), you can spin up an in-memory sqlite database on command (conn = sqlite3.connect(':memory:')), and you can pretty seamlessly pull database tables into a dataframe (pandas.read_sql()) or write them to a dataframe (df.to_sql()), it seems like we wouldn't need to re-invent the wheel for this.

ericwburden commented 4 years ago

I wrote up a POC implementation to read data from a .csv file into an in-memory (or on file) sqlite database and select it back out again. Theoretically, you could read in a number of tables, do your selects/inserts/joins, etc, then write your results back out the the .csv files. It introduces a bit of overhead, but you do get the option to use an on-disk sqlite file (for larger datasets) and you don't need the pandas dependency.

drkennetz commented 4 years ago

I like the idea of leveraging sqlite3 here. Good thinking @ericwburden

xanderyzwich commented 4 years ago

If that's as easy as you make it sound then we may be able to get that in during October for the first release.

ericwburden commented 4 years ago

Should I PR the POC so you can check it out?

xanderyzwich commented 4 years ago

I'd like some thoughts from @drkennetz on that

drkennetz commented 4 years ago

I think it would be good to do a PR for the POC. I checked out the repo and it looks solid. We can always review

drkennetz commented 4 years ago

@ericwburden could we please open a PR for this to go into development. If this is going to be an implementation we use then I cannot begin future work until this PR is in place.

ericwburden commented 4 years ago

@drkennetz Just did. Thanks for the reminder.

drkennetz commented 4 years ago

@ericwburden could you pull into dev?

drkennetz commented 4 years ago

@xanderyzwich @ericwburden began having some fun with this this morning and realized that I do believe we can utilize sqlite3 to create tables, but you cannot create a schema with sqlite3 https://stackoverflow.com/questions/30897377/python-sqlite3-create-a-schema-without-having-to-use-a-second-database.

However, the idea of a schema in our context is simply a top level directory, and I do think it is possible to make our own schema object.

xanderyzwich commented 4 years ago

@drkennetz that sounds good to me.

drkennetz commented 4 years ago

@xanderyzwich @ericwburden to comment further, I can see how these will play together. I think @ericwburden's code will be very beneficial once we have already created a schema and data in tables as it is accessing the data, but those need to be in place first. I'll be working on a schema class and a table class which inherits the schema class.