zilliztech / GPTCache

Semantic cache for LLMs. Fully integrated with LangChain and llama_index.
https://gptcache.readthedocs.io
MIT License
6.89k stars 480 forks source link
aigc autogpt babyagi chatbot chatgpt chatgpt-api dolly gpt langchain llama llama-index llm memcache milvus openai redis semantic-search similarity-search vector-search

GPTCache : A Library for Creating Semantic Cache for LLM Queries

Slash Your LLM API Costs by 10x πŸ’°, Boost Speed by 100x ⚑

Release pip download Codecov License Twitter Discord

πŸŽ‰ GPTCache has been fully integrated with πŸ¦œοΈπŸ”—LangChain ! Here are detailed usage instructions.

🐳 The GPTCache server docker image has been released, which means that any language will be able to use GPTCache!

πŸ“” This project is undergoing swift development, and as such, the API may be subject to change at any time. For the most up-to-date information, please refer to the latest documentation and release note.

Quick Install

pip install gptcache

πŸš€ What is GPTCache?

ChatGPT and various large language models (LLMs) boast incredible versatility, enabling the development of a wide range of applications. However, as your application grows in popularity and encounters higher traffic levels, the expenses related to LLM API calls can become substantial. Additionally, LLM services might exhibit slow response times, especially when dealing with a significant number of requests.

To tackle this challenge, we have created GPTCache, a project dedicated to building a semantic cache for storing LLM responses.

😊 Quick Start

Note:

dev install

# clone GPTCache repo
git clone -b dev https://github.com/zilliztech/GPTCache.git
cd GPTCache

# install the repo
pip install -r requirements.txt
python setup.py install

example usage

These examples will help you understand how to use exact and similar matching with caching. You can also run the example on Colab. And more examples you can refer to the Bootcamp

Before running the example, make sure the OPENAI_API_KEY environment variable is set by executing echo $OPENAI_API_KEY.

If it is not already set, it can be set by using export OPENAI_API_KEY=YOUR_API_KEY on Unix/Linux/MacOS systems or set OPENAI_API_KEY=YOUR_API_KEY on Windows systems.

It is important to note that this method is only effective temporarily, so if you want a permanent effect, you'll need to modify the environment variable configuration file. For instance, on a Mac, you can modify the file located at /etc/profile.

Click to SHOW example code #### OpenAI API original usage ```python import os import time import openai def response_text(openai_resp): return openai_resp['choices'][0]['message']['content'] question = 'whatβ€˜s chatgpt' # OpenAI API original usage openai.api_key = os.getenv("OPENAI_API_KEY") start_time = time.time() response = openai.ChatCompletion.create( model='gpt-3.5-turbo', messages=[ { 'role': 'user', 'content': question } ], ) print(f'Question: {question}') print("Time consuming: {:.2f}s".format(time.time() - start_time)) print(f'Answer: {response_text(response)}\n') ``` #### OpenAI API + GPTCache, exact match cache > If you ask ChatGPT the exact same two questions, the answer to the second question will be obtained from the cache without requesting ChatGPT again. ```python import time def response_text(openai_resp): return openai_resp['choices'][0]['message']['content'] print("Cache loading.....") # To use GPTCache, that's all you need # ------------------------------------------------- from gptcache import cache from gptcache.adapter import openai cache.init() cache.set_openai_key() # ------------------------------------------------- question = "what's github" for _ in range(2): start_time = time.time() response = openai.ChatCompletion.create( model='gpt-3.5-turbo', messages=[ { 'role': 'user', 'content': question } ], ) print(f'Question: {question}') print("Time consuming: {:.2f}s".format(time.time() - start_time)) print(f'Answer: {response_text(response)}\n') ``` #### OpenAI API + GPTCache, similar search cache > After obtaining an answer from ChatGPT in response to several similar questions, the answers to subsequent questions can be retrieved from the cache without the need to request ChatGPT again. ```python import time def response_text(openai_resp): return openai_resp['choices'][0]['message']['content'] from gptcache import cache from gptcache.adapter import openai from gptcache.embedding import Onnx from gptcache.manager import CacheBase, VectorBase, get_data_manager from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation print("Cache loading.....") onnx = Onnx() data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension)) cache.init( embedding_func=onnx.to_embeddings, data_manager=data_manager, similarity_evaluation=SearchDistanceEvaluation(), ) cache.set_openai_key() questions = [ "what's github", "can you explain what GitHub is", "can you tell me more about GitHub", "what is the purpose of GitHub" ] for question in questions: start_time = time.time() response = openai.ChatCompletion.create( model='gpt-3.5-turbo', messages=[ { 'role': 'user', 'content': question } ], ) print(f'Question: {question}') print("Time consuming: {:.2f}s".format(time.time() - start_time)) print(f'Answer: {response_text(response)}\n') ``` #### OpenAI API + GPTCache, use temperature > You can always pass a parameter of temperature while requesting the API service or model. > > The range of `temperature` is [0, 2], default value is 0.0. > > A higher temperature means a higher possibility of skipping cache search and requesting large model directly. > When temperature is 2, it will skip cache and send request to large model directly for sure. When temperature is 0, it will search cache before requesting large model service. > > The default `post_process_messages_func` is `temperature_softmax`. In this case, refer to [API reference](https://gptcache.readthedocs.io/en/latest/references/processor.html#module-gptcache.processor.post) to learn about how `temperature` affects output. ```python import time from gptcache import cache, Config from gptcache.manager import manager_factory from gptcache.embedding import Onnx from gptcache.processor.post import temperature_softmax from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation from gptcache.adapter import openai cache.set_openai_key() onnx = Onnx() data_manager = manager_factory("sqlite,faiss", vector_params={"dimension": onnx.dimension}) cache.init( embedding_func=onnx.to_embeddings, data_manager=data_manager, similarity_evaluation=SearchDistanceEvaluation(), post_process_messages_func=temperature_softmax ) # cache.config = Config(similarity_threshold=0.2) question = "what's github" for _ in range(3): start = time.time() response = openai.ChatCompletion.create( model="gpt-3.5-turbo", temperature = 1.0, # Change temperature here messages=[{ "role": "user", "content": question }], ) print("Time elapsed:", round(time.time() - start, 3)) print("Answer:", response["choices"][0]["message"]["content"]) ```

To use GPTCache exclusively, only the following lines of code are required, and there is no need to modify any existing code.

from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()

More Docs:

πŸŽ“ Bootcamp

😎 What can this help with?

GPTCache offers the following primary benefits:

πŸ€” How does it work?

Online services often exhibit data locality, with users frequently accessing popular or trending content. Cache systems take advantage of this behavior by storing commonly accessed data, which in turn reduces data retrieval time, improves response times, and eases the burden on backend servers. Traditional cache systems typically utilize an exact match between a new query and a cached query to determine if the requested content is available in the cache before fetching the data.

However, using an exact match approach for LLM caches is less effective due to the complexity and variability of LLM queries, resulting in a low cache hit rate. To address this issue, GPTCache adopt alternative strategies like semantic caching. Semantic caching identifies and stores similar or related queries, thereby increasing cache hit probability and enhancing overall caching efficiency.

GPTCache employs embedding algorithms to convert queries into embeddings and uses a vector store for similarity search on these embeddings. This process allows GPTCache to identify and retrieve similar or related queries from the cache storage, as illustrated in the Modules section.

Featuring a modular design, GPTCache makes it easy for users to customize their own semantic cache. The system offers various implementations for each module, and users can even develop their own implementations to suit their specific needs.

In a semantic cache, you may encounter false positives during cache hits and false negatives during cache misses. GPTCache offers three metrics to gauge its performance, which are helpful for developers to optimize their caching systems:

A sample benchmark is included for users to start with assessing the performance of their semantic cache.

πŸ€— Modules

GPTCache Struct

πŸ˜‡ Roadmap

Coming soon! Stay tuned!

😍 Contributing

We are extremely open to contributions, be it through new features, enhanced infrastructure, or improved documentation.

For comprehensive instructions on how to contribute, please refer to our contribution guide.