'NoneType' object has no attribute 'tokenize'

weaviate / Verba

Retrieval Augmented Generation (RAG) chatbot powered by Weaviate

BSD 3-Clause "New" or "Revised" License

5.99k stars 639 forks source link

'NoneType' object has no attribute 'tokenize' #53

Closed micuentadecasa closed 1 week ago

micuentadecasa commented 9 months ago

I'm using Cohere and unstructured, and I'm receiving that error when trying to load a pdf. It works ok with the simple reader, but not with the options for PDF.

this is the log:

ℹ Received Data to Import: READER(PDFReader, Documents 1, Type Documentation) CHUNKER (TokenChunker, UNITS 250, OVERLAP 50), EMBEDDER (MiniLMEmbedder) ✔ Loaded ai-03-00057.pdf ✔ Loaded 1 documents Chunking documents: 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 37.20it/s] ✔ Chunking completed Vectorizing document chunks: 0%| | 0/1 [00:00<?, ?it/s] ✘ Loading data failed 'NoneType' object has no attribute 'tokenize'

Regards.

thomashacker commented 9 months ago

Thanks for the issue! It looks like you're using the SentenceTransformer MiniLM model to embed the chunks, is that intended? It might be that there are some missing dependencies, are you running Verba on a new python environment?

micuentadecasa commented 9 months ago

I tried all the possibilities, using the MiniLM was just a try.

this is the log I got in other try

ℹ Received Data to Import: READER(UnstructuredPDF, Documents 1, Type Documentation) CHUNKER (SentenceChunker, UNITS 3, OVERLAP 2), EMBEDDER (CohereEmbedder) ✔ Loaded xxx.pdf ✔ Loaded 1 documents Chunking documents: 100%|██████████| 1/1 [00:00<00:00, 28.90it/s] ✔ Chunking completed ℹ (1/1) Importing document xxxx.pdf with 2 batches ✘ {'errors': {'error': [{'message': 'update vector: API Key: no api key found neither in request header: X-Openai-Api-Key nor in environment variable under OPENAI_APIKEY'}]}, 'status': 'FAILED'} Importing batches: 100%|██████████| 2/2 [00:03<00:00, 1.80s/it] ✘ Loading data failed Document 09a44f39-fb85-4182-b853-b0990925f7fc not found None

it seems it is trying to use the OPENAI even it is set the Cohere one.

Regards.

thomashacker commented 9 months ago

Thanks for the insights! I'll look into fixing this 👍

thomashacker commented 9 months ago

We merged some fixes, are you still getting these errors?

f0rmiga commented 8 months ago

I was getting the same error and found out that it was due to:

⚠ Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate:
`pip install accelerate`

Perhaps adding accelerate as a direct dependency to Verba is desirable?

rayliuca commented 8 months ago

https://github.com/weaviate/Verba/blob/1c9d4b49385315883ba0027ac1772a8b448f6204/goldenverba/components/embedding/MiniLMEmbedder.py#L26-L42

device_map should be type str or dict not torch.device type

https://github.com/huggingface/transformers/blob/edb170238febf7fc3e3278ed5b9ca0b2c40c70e3/src/transformers/tools/base.py#L460-L461

moncefarajdal commented 6 months ago

I was getting the same error when using MiniLMEmbedder on my mac that doesn't have a cuda gpu. So I tried @f0rmiga solution and I updated my code like this:

from accelerate import Accelerator accelerator = Accelerator()

After self.device = get_device() I added self.device = accelerator.device

Now MiniLMEmbedder works fine and the document's chunks are being vectorized.

thomashacker commented 4 months ago

This should be fixed with the newest v1.0.0 version!

sbhadana commented 3 months ago

with AdaEmbedder in Azure Openai issue still persists

✘ {'errors': {'error': [{'message': "update vector: unmarshal response body: invalid character '<' looking for beginning of value"}]}, 'status': 'FAILED'}

I am using goldenverba Version: 1.0.1 Also inside schema_generation.py

"text2vec-openai": {"deploymentId": , "resourceName": }, "baseURL"} are defined and correct.

thomashacker commented 3 months ago

What openai version have you installed?

sbhadana commented 3 months ago

I have installed version: 0.27.9 however I tried with 1.30.1 also same error

thomashacker commented 3 months ago

Make sure to use the 0.27.9 version, I'll take a closer look at the Azure Implementation