zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
Apache License 2.0
53.62k stars 7.21k forks source link

gpt_tokenize: unknown token '' #180

Closed rohanrichards closed 7 months ago

rohanrichards commented 1 year ago

Windows 10, python 3.10 after ingesting and writing my first prompt "what can you tell me about the state of the union address" I get the following output, followed by an extremely long wait where it uses ~30% of CPU and RAM continues to increase:

gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
gpt_tokenize: unknown token ''
haris525 commented 1 year ago

Hello, yes getting the same issue. Python 3.10.11, Windows 10 pro

In the .env file my model type is MODEL_TYPE=GPT4All

after running the ingest.py file, I run the privateGPT.py script, at the prompt I enter the the text: what can you tell me about the state of the union address, and I get the following

gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token '' gpt_tokenize: unknown token ''

Help is appreciated. Thank you

rohanrichards commented 1 year ago

To be clear as well I do eventually get output, its just taking an extremely long time, I'm thinking these messages are actually just a warning and its working as intended, albeit extremely slowly for some reason.

haris525 commented 1 year ago

To be clear as well I do eventually get output, its just taking an extremely long time, I'm thinking these messages are actually just a warning and its working as intended, albeit extremely slowly for some reason.

let me give it more time, I am waiting 10 minutes, and nothing happens after those warning messages, I will wait a bit longer and see what happens

haris525 commented 1 year ago

you are right, I get the response but it is very slow, and uses up around 18/19gb of memory, running another query gives me this memory related error:

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 8264657744, available 8257513008) Process finished with exit code -1073741819 (0xC0000005)

petragom commented 1 year ago

I have a very similar issue. But I get: gpt_tokenize: unknown token '?' (That just keeps repeating)

first time I did use a question mark so I exited and tried again without it, same error.

intel iMac, python 3 Tried with the provided State of the Union text, so not my own file.

ernestp commented 1 year ago

Here is my output on Windows with default data

gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö' gpt_tokenize: unknown token 'Ô' gpt_tokenize: unknown token 'Ç' gpt_tokenize: unknown token 'Ö'

loganrussell48 commented 1 year ago

I am getting similar results to @ernestp

Enter a query: who gave the state of the union address speech in 2023?
gpt_tokenize: unknown token 'Γ'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Γ'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Γ'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Γ'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Γ'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
prbrody commented 1 year ago

Same issue...has anyone managed to fix that?

maneaionut0 commented 1 year ago

Same issue, after 5-10 min i get the response for a simple query..

Pc:AMD Ryzen 5 5600X, 32GB RAM, GPU:Nvida GTX 1060 6GB

PulpCattel commented 1 year ago

This seems the same as https://github.com/imartinez/privateGPT/issues/13 and https://github.com/imartinez/privateGPT/issues/107

I think it would be better to keep only one issue open, otherwise it just makes it harder to debug with info spread all over multiple issues.

wiwomu commented 1 year ago

Has a root cause or solution been found? Just tried it last night and I get the string of '?' Unknown symbol errors. :(

maozdemir commented 1 year ago

Has a root cause or solution been found? Just tried it last night and I get the string of '?' Unknown symbol errors. :(

You can ignore them. There will be output.

oldbuilding commented 1 year ago

The process ends with "Killed" every time.

type GPT4All path ggml-gpt4all-j-v1.3-groovy.bin ctx 1000 Windows 10

python --version Python 3.10.6

which python ~/miniconda3/envs/privateGPT/bin/python

pip list

Package Version

aiohttp 3.8.4 aiosignal 1.3.1 alacritty-colorscheme 1.0.1 anyio 3.6.2 argilla 1.7.0 async-timeout 4.0.2 asyncio 3.4.3 attrs 23.1.0 backoff 2.2.1 beautifulsoup4 4.12.2 bracex 2.3.post1 certifi 2023.5.7 cffi 1.15.1 chardet 5.1.0 charset-normalizer 3.1.0 chromadb 0.3.23 click 8.1.3 clickhouse-connect 0.5.25 cmake 3.26.3 colorclass 2.2.2 commonmark 0.9.1 compressed-rtf 1.0.6 cryptography 40.0.2 dataclasses-json 0.5.7 Deprecated 1.2.13 duckdb 0.8.0 easygui 0.98.3 ebcdic 1.1.1 et-xmlfile 1.1.0 extract-msg 0.41.1 fastapi 0.95.2 filelock 3.12.0 frozenlist 1.3.3 fsspec 2023.5.0 ghp-import 2.1.0 gpt4all 0.2.3 greenlet 1.1.3.post0 h11 0.14.0 hnswlib 0.7.0 httpcore 0.16.3 httptools 0.5.0 httpx 0.23.3 huggingface-hub 0.14.1 idna 3.4 IMAPClient 2.3.1 Jinja2 3.1.2 joblib 1.2.0 jq 1.4.1 langchain 0.0.171 lark-parser 0.12.0 lit 16.0.5 llama-cpp-python 0.1.49 lxml 4.9.2 lz4 4.3.2 Markdown 3.3.7 MarkupSafe 2.1.2 marshmallow 3.19.0 marshmallow-enum 1.5.1 mergedeep 1.3.4 monotonic 1.6 mpmath 1.3.0 msg-parser 1.2.0 msgpack 1.0.4 msoffcrypto-tool 5.0.1 multidict 6.0.4 mypy-extensions 0.4.3 natsort 8.1.0 networkx 3.1 nltk 3.8.1 numexpr 2.8.4 numpy 1.23.5 nvidia-cublas-cu11 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 nvidia-cufft-cu11 nvidia-curand-cu11 nvidia-cusolver-cu11 nvidia-cusparse-cu11 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 olefile 0.46 oletools 0.60.1 openapi-schema-pydantic 1.2.4 openpyxl 3.1.2 packaging 23.1 pandas 1.5.3 pandoc 2.3 pcodedmp 1.2.6 pdfminer.six 20221105 Pillow 9.5.0 pip 23.0.1 plumbum 1.8.1 ply 3.11 posthog 3.0.1 pycparser 2.21 pydantic 1.10.8 Pygments 2.12.0 pygpt4all 1.1.0 pygptj 2.0.3 pyllamacpp 2.3.0 pymdown-extensions 9.5 pynvim 0.4.3 pypandoc 1.11 pyparsing 2.4.7 python-dateutil 2.8.2 python-docx 0.8.11 python-dotenv 1.0.0 python-magic 0.4.27 python-pptx 0.6.21 python-slugify 6.1.2 pytz 2023.3 pytz-deprecation-shim 0.1.0.post0 PyYAML 6.0 pyyaml_env_tag 0.1 red-black-tree-mod 1.20 regex 2023.5.5 requests 2.31.0 rfc3986 1.5.0 rich 13.0.1 RTFDE 0.0.2 ruamel.yaml 0.16.13 scikit-learn 1.2.2 scipy 1.10.1 sentence-transformers 2.2.2 sentencepiece 0.1.99 setuptools 66.0.0 six 1.16.0 sniffio 1.3.0 soupsieve 2.4.1 SQLAlchemy 2.0.15 starlette 0.27.0 sympy 1.12 tabulate 0.9.0 tenacity 8.2.2 termcolor 1.1.0 text-unidecode 1.3 threadpoolctl 3.1.0 tokenizers 0.13.3 torch 2.0.1 torchvision 0.15.2 tqdm 4.65.0 transformers 4.29.2 triton 2.0.0 typed-argument-parser 1.7.2 typer 0.9.0 typing_extensions 4.6.1 typing-inspect 0.8.0 tzdata 2023.3 tzlocal 4.2 unstructured 0.6.6 urllib3 2.0.2 uvicorn 0.22.0 uvloop 0.17.0 watchdog 2.1.9 watchfiles 0.19.0 wcmatch 8.4 websockets 11.0.3 wheel 0.38.4 wrapt 1.14.1 XlsxWriter 3.1.1 yarl 1.9.2 zstandard 0.21.0

conda list packages in environment at ~/miniconda3/envs/privateGPT:

Name Version Build Channel _libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
aiohttp 3.8.4 pypi_0 pypi aiosignal 1.3.1 pypi_0 pypi anyio 3.6.2 pypi_0 pypi argilla 1.7.0 pypi_0 pypi async-timeout 4.0.2 pypi_0 pypi attrs 23.1.0 pypi_0 pypi backoff 2.2.1 pypi_0 pypi beautifulsoup4 4.12.2 pypi_0 pypi bzip2 1.0.8 h7b6447c_0
ca-certificates 2023.01.10 h06a4308_0
certifi 2023.5.7 pypi_0 pypi cffi 1.15.1 pypi_0 pypi chardet 5.1.0 pypi_0 pypi charset-normalizer 3.1.0 pypi_0 pypi chromadb 0.3.23 pypi_0 pypi click 8.1.3 pypi_0 pypi clickhouse-connect 0.5.25 pypi_0 pypi cmake 3.26.3 pypi_0 pypi colorclass 2.2.2 pypi_0 pypi commonmark 0.9.1 pypi_0 pypi compressed-rtf 1.0.6 pypi_0 pypi cryptography 40.0.2 pypi_0 pypi dataclasses-json 0.5.7 pypi_0 pypi deprecated 1.2.13 pypi_0 pypi duckdb 0.8.0 pypi_0 pypi easygui 0.98.3 pypi_0 pypi ebcdic 1.1.1 pypi_0 pypi et-xmlfile 1.1.0 pypi_0 pypi extract-msg 0.41.1 pypi_0 pypi fastapi 0.95.2 pypi_0 pypi filelock 3.12.0 pypi_0 pypi frozenlist 1.3.3 pypi_0 pypi fsspec 2023.5.0 pypi_0 pypi gpt4all 0.2.3 pypi_0 pypi h11 0.14.0 pypi_0 pypi hnswlib 0.7.0 pypi_0 pypi httpcore 0.16.3 pypi_0 pypi httptools 0.5.0 pypi_0 pypi httpx 0.23.3 pypi_0 pypi huggingface-hub 0.14.1 pypi_0 pypi idna 3.4 pypi_0 pypi imapclient 2.3.1 pypi_0 pypi jinja2 3.1.2 pypi_0 pypi joblib 1.2.0 pypi_0 pypi jq 1.4.1 pypi_0 pypi langchain 0.0.171 pypi_0 pypi lark-parser 0.12.0 pypi_0 pypi ld_impl_linux-64 2.38 h1181459_1
libffi 3.4.4 h6a678d5_0
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
lit 16.0.5 pypi_0 pypi llama-cpp-python 0.1.49 pypi_0 pypi lxml 4.9.2 pypi_0 pypi lz4 4.3.2 pypi_0 pypi markupsafe 2.1.2 pypi_0 pypi marshmallow 3.19.0 pypi_0 pypi marshmallow-enum 1.5.1 pypi_0 pypi monotonic 1.6 pypi_0 pypi mpmath 1.3.0 pypi_0 pypi msg-parser 1.2.0 pypi_0 pypi msoffcrypto-tool 5.0.1 pypi_0 pypi multidict 6.0.4 pypi_0 pypi ncurses 6.4 h6a678d5_0
networkx 3.1 pypi_0 pypi nltk 3.8.1 pypi_0 pypi numexpr 2.8.4 pypi_0 pypi numpy 1.23.5 pypi_0 pypi nvidia-cublas-cu11 pypi_0 pypi nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi nvidia-cudnn-cu11 pypi_0 pypi nvidia-cufft-cu11 pypi_0 pypi nvidia-curand-cu11 pypi_0 pypi nvidia-cusolver-cu11 pypi_0 pypi nvidia-cusparse-cu11 pypi_0 pypi nvidia-nccl-cu11 2.14.3 pypi_0 pypi nvidia-nvtx-cu11 11.7.91 pypi_0 pypi olefile 0.46 pypi_0 pypi oletools 0.60.1 pypi_0 pypi openapi-schema-pydantic 1.2.4 pypi_0 pypi openpyxl 3.1.2 pypi_0 pypi openssl 1.1.1t h7f8727e_0
packaging 23.1 pypi_0 pypi pandas 1.5.3 pypi_0 pypi pandoc 2.3 pypi_0 pypi pcodedmp 1.2.6 pypi_0 pypi pdfminer-six 20221105 pypi_0 pypi pillow 9.5.0 pypi_0 pypi pip 23.0.1 py310h06a4308_0
plumbum 1.8.1 pypi_0 pypi ply 3.11 pypi_0 pypi posthog 3.0.1 pypi_0 pypi pycparser 2.21 pypi_0 pypi pydantic 1.10.8 pypi_0 pypi pygpt4all 1.1.0 pypi_0 pypi pygptj 2.0.3 pypi_0 pypi pyllamacpp 2.3.0 pypi_0 pypi pypandoc 1.11 pypi_0 pypi pyparsing 2.4.7 pypi_0 pypi python 3.10.11 h7a1cb2a_2
python-docx 0.8.11 pypi_0 pypi python-dotenv 1.0.0 pypi_0 pypi python-magic 0.4.27 pypi_0 pypi python-pptx 0.6.21 pypi_0 pypi pytz 2023.3 pypi_0 pypi pytz-deprecation-shim 0.1.0.post0 pypi_0 pypi pyyaml 6.0 pypi_0 pypi readline 8.2 h5eee18b_0
red-black-tree-mod 1.20 pypi_0 pypi regex 2023.5.5 pypi_0 pypi requests 2.31.0 pypi_0 pypi rfc3986 1.5.0 pypi_0 pypi rich 13.0.1 pypi_0 pypi rtfde 0.0.2 pypi_0 pypi scikit-learn 1.2.2 pypi_0 pypi scipy 1.10.1 pypi_0 pypi sentence-transformers 2.2.2 pypi_0 pypi sentencepiece 0.1.99 pypi_0 pypi setuptools 66.0.0 py310h06a4308_0
six 1.16.0 pypi_0 pypi sniffio 1.3.0 pypi_0 pypi soupsieve 2.4.1 pypi_0 pypi sqlalchemy 2.0.15 pypi_0 pypi sqlite 3.41.2 h5eee18b_0
starlette 0.27.0 pypi_0 pypi sympy 1.12 pypi_0 pypi tabulate 0.9.0 pypi_0 pypi tenacity 8.2.2 pypi_0 pypi threadpoolctl 3.1.0 pypi_0 pypi tk 8.6.12 h1ccaba5_0
tokenizers 0.13.3 pypi_0 pypi torch 2.0.1 pypi_0 pypi torchvision 0.15.2 pypi_0 pypi tqdm 4.65.0 pypi_0 pypi transformers 4.29.2 pypi_0 pypi triton 2.0.0 pypi_0 pypi typer 0.9.0 pypi_0 pypi typing-extensions 4.6.1 pypi_0 pypi tzdata 2023.3 pypi_0 pypi tzlocal 4.2 pypi_0 pypi unstructured 0.6.6 pypi_0 pypi urllib3 2.0.2 pypi_0 pypi uvicorn 0.22.0 pypi_0 pypi uvloop 0.17.0 pypi_0 pypi watchfiles 0.19.0 pypi_0 pypi websockets 11.0.3 pypi_0 pypi wheel 0.38.4 py310h06a4308_0
wrapt 1.14.1 pypi_0 pypi xlsxwriter 3.1.1 pypi_0 pypi xz 5.4.2 h5eee18b_0
yarl 1.9.2 pypi_0 pypi zlib 1.2.13 h5eee18b_0
zstandard 0.21.0 pypi_0 pypi

Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query: Is this a test? gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' ... gpt_tokenize: unknown token '�' Killed