zilliztech / GPTCache

Semantic cache for LLMs. Fully integrated with LangChain and llama_index.
https://gptcache.readthedocs.io
MIT License
7.18k stars 502 forks source link

[Enhancement]: Return different cached output based on system prompt used #482

Open transhapHigsn opened 1 year ago

transhapHigsn commented 1 year ago

What would you like to be added?

Support specifing a handler similar to pre_embedding_func which returns a version / unique tag for system prompt used for request, and which is then stored alongside with cache data & embedding data. This version / unique tag should also be considered when searching embedding data & fetching final cache data.

Why is this needed?

I am working on a project where different type of information is extracted from same user prompt based on task at hand. For instance, one API call will be used to identify user intent, and next API call will be used to generate a YAML using intent identified in last request. Now, in this scenario both time user prompt is same but system prompt is different. Since, system prompt is much larger compared to user prompt, a very high threshold need to be set for similarity evaluation in this case (cache embedding is calculated for combined system prompt + user prompt text) which is often much worse than performing similarity evaluation on just user prompt.

Anything else?

No response

SimFG commented 1 year ago

Hi, @transhapHigsn Are there many different system prompts you used?

transhapHigsn commented 1 year ago

Yes, generally I use two sets of system prompts, and both system prompts have different response structure defined. One for identifying user intent (there are lot of actions that can be performed, but I can't include all of that information in single system prompt, so I'm using this approach), and other for performing actual task, that is, generating YAML from user prompt.

Also, there is scenario of fixing issues in system prompt. In such cases, I have to remove everything from vector DB to ensure it doesn't return invalid model response.

Edit:

SimFG commented 1 year ago

Solution that we came up to counter this is to use multiple cache instance based on objectives.

This is also a temporary method I can think of.

I will plan how to develop this feature.

transhapHigsn commented 1 year ago

Thank you for considering this. I will be happy to help out here if you can guide me how you want this to be developed.

wwzeng1 commented 1 year ago

Hi! I’m one of the founders of Sweep, a github app that solves issues by writing pull requests. This looks like a good issue for Sweep https://github.com/sweepai/sweep to try. We have onboarding instructions here, I’m also happy to help you onboard directly :)

grski commented 1 year ago

Hey, we had a little bit of a similar thing: depending on the locale, which we only know of when the user makes a request and it's part of the locale, we need to dynamically decide which context to fetch from qdrant (each locale has it's onw context that's significantly different think eg. tax rates per country)

  1. I'm experimenting a bit with using a decorator based approach of a sort that requires reintiialisation of the cache each time we process the request and dynamically changes the collection name in qdrant store
  2. pre-allocation of the cache for all the locales and keeping them in memory permanently, each one with a different collection nam
  3. most hacky-buggy i think: adding a hash of the locale in front of the request so that the general similarity goes down, so even though they'd be in one collection, they wouldn't get fetched.

For us the dream solution would be to able to dynamically change the collection_name that gptcache uses for lookup in qdrant.

I'm happy to help with the implementation and share my findings.

SimFG commented 1 year ago

For us the dream solution would be to able to dynamically change the collection_name that gptcache uses for lookup in qdrant.

This is similar to multiple cache objects.

Of course, I am also looking forward to your implementation plan.