yhat / pandasql

sqldf for pandas
MIT License
1.31k stars 184 forks source link

sqldf does not support large data sets #80

Open dqaudithillstone opened 4 years ago

dqaudithillstone commented 4 years ago

My dataframe records more than 120,000. When I use sqldf to query, if I set limit 20000, the program does not error. but If you remove limit 20000, the program fails. The error message is as follows.

the df records is more than 120,000.

source code: df=pd.read_csv('./dataAnalyse/ht.csv',encoding='utf_8_sig') #支持中文路径 q=""" select a.[合同编号] as htbh1,b.[合同编号] as htbh2,a.[合同名称] as htmc1,b.[合同名称] as htmc2 from df a left join df b on a.年度=b.年度 and a.公司名称=b.公司名称 and a.[合同编号]<>b.[合同编号] limit 20000; """ df0=pysqldf(q) print(df0.info())

The error message is as follows:

PS D:\python\myworks> & C:/python/Python37-32/python.exe d:/python/myworks/DataAnalyse/xsd.py Traceback (most recent call last): File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 3206, in fetchall l = self.process_rows(self._fetchall_impl()) File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 3173, in _fetchall_impl return self.cursor.fetchall() sqlite3.OperationalError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\python\Python37-32\lib\site-packages\pandasql\sqldf.py", line 61, in call result = read_sql(query, conn) File "C:\python\Python37-32\lib\site-packages\pandas\io\sql.py", line 438, in read_sql chunksize=chunksize, File "C:\python\Python37-32\lib\site-packages\pandas\io\sql.py", line 1231, in read_query data = result.fetchall() File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 3212, in fetchall self.cursor, self.context) File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 1843, in _handle_dbapi_exception from e sqlalchemy.exc.OperationalError: (OperationalError) None None

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "d:/python/myworks/DataAnalyse/xsd.py", line 20, in df0=pysqldf(q) File "d:/python/myworks/DataAnalyse/xsd.py", line 7, in pysqldf return sqldf(q,globals()) File "C:\python\Python37-32\lib\site-packages\pandasql\sqldf.py", line 156, in sqldf return PandaSQL(db_uri)(query, env) File "C:\python\Python37-32\lib\site-packages\pandasql\sqldf.py", line 63, in call raise PandaSQLException(ex) pandasql.sqldf.PandaSQLException: (OperationalError) None None

zbrookle commented 3 years ago

@dqaudithillstone Part of the performance issue with this package is that it uses SqLite as a backend, which means that you're essentially not using one of the main benefits of pandas, which is in memory computation. As a solution to this problem, a few months back I created a package called dataframe_sql, which solves this problem by parsing the sql and translating it to native pandas operations. Hope this helps!