zilliztech / GPTCache

Semantic cache for LLMs. Fully integrated with LangChain and llama_index.
https://gptcache.readthedocs.io
MIT License
6.96k stars 490 forks source link

[Bug]: langchain, postgresql and pgvector looks up answer using distance, not id - always returns None #511

Closed RayceRossum closed 11 months ago

RayceRossum commented 11 months ago

When using gptcache w/ langchain, postgresql and pgvector. I'm running into an issue where the correct answer is being found, but the search in the data manager is looking up the data by the distance (0.0) instead of looking up the id.

def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    onnx = Onnx()
    cache_base = CacheBase("postgresql", sql_url=settings.DATABASE_URL)
    vector_base = VectorBase(
        "pgvector",
        url=settings.DATABASE_URL,
        collection_name="gptcache_" + hashed_llm,
        dimension=onnx.dimension,
    )
    data_manager = get_data_manager(cache_base, vector_base)
    init_similar_cache(
        cache_obj=cache_obj,
        data_manager=data_manager,
        pre_func=get_input,
        embedding=onnx,
    )

langchain.llm_cache = GPTCache(init_gptcache)

res_data is passed into the data managers get_scalar_data as a <class 'sqlalchemy.engine.row.Row'> containing:

(<gptcache.manager.vector_data.pgvector._get_model_and_index.<locals>.VectorStoreTable object at 0x7f917a4f2ce0>, 0.0)

get_scalar_data uses res_data[1] which is representing the added distances column value from the previous search_data step in the adapter, not the id.

cache_data = self.s.get_data_by_id(res_data[1[) https://github.com/zilliztech/GPTCache/blob/8f7d36a97ca5cd55930a73323ea67c6fbd607e4f/gptcache/manager/data_manager.py#L341C56-L341C56

A simple fix that shows the correct answer in my debugger is:

cache_data = self.s.get_data_by_id(res_data[0].id)

but I imagine this has additional impacts so likely there is something that needs to be done with the search function for pgvector

UPDATE: Taking a quick look, I can see faiss returns return list(zip(dist[0], ids)) from its search function, which makes sense whereas the pgvector search function doesn't quite lineup. Likely just an edit here is needed to match up the formats

with self._session() as session:
    session.execute(text(f"SET LOCAL ivfflat.probes = {self.index_params['params']['probes'] or 10};"))
    search_result = self._query(session).add_columns(
        similarity.label("distances")
    ).order_by(
        similarity
    ).limit(top_k).all()

return search_result

https://github.com/zilliztech/GPTCache/blob/8f7d36a97ca5cd55930a73323ea67c6fbd607e4f/gptcache/manager/vector_data/pgvector.py#L145

SimFG commented 11 months ago

@RayceRossum Thanks your feedback. And i will try to fix it

SimFG commented 11 months ago

@RayceRossum the 0.1.39 has been released, and you can try it. Looking forward to your feedback.

RayceRossum commented 11 months ago

@SimFG, this is still broken. I believe it is backwards. id should be res_data[1], right now you have res_data[0] as the id.

            ).order_by(
                similarity
            ).limit(top_k).all()
            search_result = [(r[0].id, r[1]) for r in search_result]

        return search_result
RayceRossum commented 11 months ago

I've tested this on my end and lookup seems to be working correctly aside from now running into this issue with langchain generating Generation instead of ChatGeneration when using GPTCache and then putting those Generations into a ChatResult when it should be using ChatGenerations into a ChatResult.

https://github.com/langchain-ai/langchain/pull/8603

SimFG commented 11 months ago

@RayceRossum I'm so sorry. It was my carelessness.

SimFG commented 11 months ago

@RayceRossum I have fixed it. Thanks your feedback, and i will close the issue.