[FIX] Error: "Initial token count exceeds token limit"

DvdNss commented 4 months ago

Hey there,

I've seen several issues reported regarding the error mentioned above, so I wanted to share the fix I found.

SPECS:

private-gpt version: 0.5.0
LLM used: Mistral 7B Instruct v0.2 OR Mistral 7B Instruct v0.1
Store type: Postgres DB

ERROR ENCOUNTERED: When questioning the LLM about very long documents, it returns an empty result along with an error message on the Gradio UI: "Initial token count exceeds token limit."

ROOT CAUSE: From examining the code of both private-gpt and llama_index (cache), it appears that llama_index does not account for sliding-window attention (actually, Mistral used this mechanism in their models last year, but stopped this year). Also, please note that the memory buffer allocated to your context is based on the context_window parameter in your settings-xxx.yaml file --> this mean that if you set your context_window size to 1000, and pass a context of size 1001 to the buffer, it won't work.

# ~/.cache/pypoetry/virtualenvs/prigate-gpt-{your-cache-id-here}/lib/python3.11/site-packages/llama_index/core/context.py
# line 79
memory = memory or ChatMemoryBuffer.from_defaults(
    chat_history=chat_history, token_limit=llm.metadata.context_window - 256   <--- this here is your context_window value in settings-xxx.yaml file 
)

EDIT: Confirmed, sliding-window attention is not supported in llama_index, see https://github.com/ggerganov/llama.cpp/issues/3377

SOLUTION:

If you are using Mistral 7B Instruct v0.1: This LLM uses a sliding-window attention mechanism, where the context window repeats (or 'slides') across the context. According to the paper, this model's sliding-window size is 4096. However, its actual context window size is 8192 (see screenshot below). Therefore, the fix is to increase the context_window value in your settings-xxx.yaml file to 8192. Note that the theoretical attention span of this model is 131K according to the paper, so you can increase this value further, but it will be slower and the result will worsen as the context size increases.

If you are using Mistral 7B Instruct v0.2 (the default with PGPT 0.5.0 and local setup): This LLM does not use a sliding-window attention mechanism. In this case, simply increase the context_window value in your settings-xxx.yaml file to 32,000 (see https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2, it should be 32,768 = 2^15, but I couldn't find the paper with the exact scientific value, so I went with 32,000).

If you are trying to pass a context that is greater than the LLM's max values, I'm afraid the only solution would be to split the documents OR increase the context_window size again, but worst case scenario you get unexpected errors, best case scenario the model just doesn't care about what's greater than its max attention span.

Hope this helps.

marikan114 commented 3 months ago

Thank you for the simple solution. I also used the following to get an early indication of the new token limit.

https://github.com/zylon-ai/private-gpt/issues/1701#issuecomment-2027469953

anamariaUIC commented 3 months ago

@DvdNss Thank you so much for this post. Can you please let em know if max_new_tokens: value has to match context_window value? I'm using using Mistral 7B Instruct v0.2.

Hey there,

I've seen several issues reported regarding the error mentioned above, so I wanted to share the fix I found.

SPECS:
* private-gpt version: 0.5.0

* LLM used: Mistral 7B Instruct v0.2 OR Mistral 7B Instruct v0.1

* Store type: Postgres DB
ERROR ENCOUNTERED: When questioning the LLM about very long documents, it returns an empty result along with an error message on the Gradio UI: "Initial token count exceeds token limit."

ROOT CAUSE: From examining the code of both private-gpt and llama_index (cache), it appears that llama_index does not account for sliding-window attention (actually, Mistral used this mechanism in their models last year, but stopped this year). Also, please note that the memory buffer allocated to your context is based on the context_window parameter in your settings-xxx.yaml file --> this mean that if you set your context_window size to 1000, and pass a context of size 1001 to the buffer, it won't work.
# ~/.cache/pypoetry/virtualenvs/prigate-gpt-{your-cache-id-here}/lib/python3.11/site-packages/llama_index/core/context.py
# line 79
memory = memory or ChatMemoryBuffer.from_defaults(
    chat_history=chat_history, token_limit=llm.metadata.context_window - 256   <--- this here is your context_window value in settings-xxx.yaml file 
)
EDIT: Confirmed, sliding-window attention is not supported in llama_index, see ggerganov/llama.cpp#3377

SOLUTION:
* **If you are using Mistral 7B Instruct v0.1:**
  This LLM uses a sliding-window attention mechanism, where the context window repeats (or 'slides') across the context. According to the paper, this model's sliding-window size is 4096. **However, its actual context window size is 8192** (see screenshot below). Therefore, the fix is to increase the **context_window** value in your settings-xxx.yaml file to 8192. Note that the theoretical attention span of this model is 131K according to the paper, so you can increase this value further, but it will be slower and the result will worsen as the context size increases.
* **If you are using Mistral 7B Instruct v0.2 (the default with PGPT 0.5.0 and local setup):**
  This LLM does not use a sliding-window attention mechanism. In this case, simply increase the **context_window** value in your settings-xxx.yaml file to 32,000 (see https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2, it should be 32,768 = 2^15, but I couldn't find the paper with the exact scientific value, so I went with 32,000).
If you are trying to pass a context that is greater than the LLM's max values, I'm afraid the only solution would be to split the documents OR increase the context_window size again, but worst case scenario you get unexpected errors, best case scenario the model just doesn't care about what's greater than its max attention span.

Hope this helps.

DvdNss commented 2 months ago

@anamariaUIC AFAIK max_new_tokens defines the max number of tokens in the model's output, so it doesn't have to match the context_window value.

anamariaUIC commented 2 months ago

@DvdNss thank you so much. Which values you would recommend for max_new_tokens: context_window:

when querying CSV files?

right now I have it set to: max_new_tokens: 8000 context_window: 13000

And I am getting very poor results. Even the most basic questions like how many rows or how many columns are in the file can't be answered, also basic summary statistics questions are completely wrong. Any advice?

@anamariaUIC AFAIK max_new_tokens defines the max number of tokens in the model's output, so it doesn't have to match the context_window value.

zylon-ai / private-gpt

[FIX] Error: "Initial token count exceeds token limit" #1941