yhat / pandasql

sqldf for pandas
MIT License
1.33k stars 185 forks source link

pandasql does not work in a function #42

Closed cardosan closed 8 years ago

cardosan commented 9 years ago

Hi, I made a function like the one below to query a panda dataframe but it tells me 'Exception: UNECE_csv not found'. Cannot understand where is the prob since if I run the same scripts outside the function it works perfectly.

def import_csv_and_query():

filepath='C:\\DATA\\Work\\.....csv
import pandas as pd
from pandasql import sqldf
UNECE_csv=pd.read_csv(filepath,dtype=object,na_filter=False)
q="""

MY LONG QUERY
"""

UNECE_clean=sqldf(q, globals())

UNECE_colums=UNECE_clean.columns.tolist()
UNECE_array=UNECE_clean.values.tolist()
UNECE=[UNECE_colums]+UNECE_array   
return UNECE
stonebig commented 9 years ago

maybe try:

UNECE_clean=sqldf(q, dict(globals(),**locals()))
figgytime commented 8 years ago

@cardosan @stonebig Hello, I am loading data via a Flask 'request' and get the same error when using pandasql inside a function. Tried stonebig's suggestion but still same error. Does anyone have a solution to this?

stonebig commented 8 years ago

you may try to pass your "target" variables to the function, let say 'var1' and 'var2', then do in the function:

UNECE_clean=sqldf(q, {'var1':var1,'var2':var2})

or even more tricky (not sure of the syntax, you may have to correct me):

UNECE_clean=sqldf(q,dict(globals(),**locals(), {'var1':var1,'var2':var2})
figgytime commented 8 years ago

Thank you for the ideas. the top recommendation does not output an error when executing the python file (second one does). However, when testing the output of the Flask response, "similar error: global name 'customer_id' is not defined".

Would you please clarify what "target" variables would be in my case? the data is being loaded as a Json script via a Flask request, converted to a pandas dataframe, then columns formatted.

@app.route('/results', methods=['POST'])
def load():
    data = request.get_json()
    df = pd.io.json.json_normalize(data)
    df.columns = df.columns.map(lambda x: x.split(".")[-1])

    q = """ Select * from df """
    query = sqldf(q, {'customer_id': customer_id, 'verified': verified, 'token': token, 'version': version})

    resp = make_response(query.to_json())
    resp.mimetype = 'application/json'
    return resp

Json script:

{
    "response": {
        "version": "1.1",
        "token": "dsfgf",
        "body": {
            "customer": {
                "customer_id": "1234567",
                "verified": "true"
            }
        }
    }
}
stonebig commented 8 years ago

target variable are the one you want to query. Example:

import pandas as pd
import numpy as np
from pandasql import sqldf
global_df= pd.DataFrame({ 'A' : pd.Series(1,index=list(range(888)),dtype='float32'),
                     'B' : pd.Series(1,index=list(range(888)),dtype='float32')})
def rien(q, ext_df):
    local_df= pd.DataFrame({ 'B' : pd.Series(1,index=list(range(888)),dtype='float32'),
                     'D' : pd.Series(1,index=list(range(0, 888, 2)),dtype='float32')})
    return sqldf(q, {'ext_df':ext_df, 'local_df':local_df})

rien("select * FROM ext_df INNER JOIN local_df on ext_df.B=local_df.B limit 3", global_df) 
figgytime commented 8 years ago

Merci Beaucoup! Thank you very very much, you did it. I am very grateful.

As a follow up, I am still unsure why we have to create this additional function [ rien(q, ext_df) ] especially when in my case i only have (1) dataframe so no joining; and then why the query selects from ext_df and not our dataframe

stonebig commented 8 years ago

Maybe this simpler solution would work on your simple case.

UNECE_clean=sqldf(q, locals())

Problem is that the definition of "globals()" and "locals()" is subtle, and frequently doesn't fit what we want, as soon as we try to place pandasql inside a bigger function.

glamp commented 8 years ago

Going to close as this isn't an issue with pandasql itself.

TheGlobalist commented 4 years ago

target variable are the one you want to query. Example:

import pandas as pd
import numpy as np
from pandasql import sqldf
global_df= pd.DataFrame({ 'A' : pd.Series(1,index=list(range(888)),dtype='float32'),
                     'B' : pd.Series(1,index=list(range(888)),dtype='float32')})
def rien(q, ext_df):
    local_df= pd.DataFrame({ 'B' : pd.Series(1,index=list(range(888)),dtype='float32'),
                     'D' : pd.Series(1,index=list(range(0, 888, 2)),dtype='float32')})
    return sqldf(q, {'ext_df':ext_df, 'local_df':local_df})

rien("select * FROM ext_df INNER JOIN local_df on ext_df.B=local_df.B limit 3", global_df) 

sorry for posting on a 4 year old issue, but just wanted to highlight that the proposed solution also worked for me using Sanic. Just wanted to leave this here so that others could know it