Suggestions for speeding up ingestion?

pinballelectronica commented 1 year ago

I presume I must be doing something wrong, as it is taking hours to ingest a 500kbyte text on an i9-12900 with 128GB. In fact it's not even done yet. Using models are recommended.

Help?

Thanks

Some output:

llama_print_timings: load time = 674.34 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: prompt eval time = 12526.78 ms / 152 tokens ( 82.41 ms per token) llama_print_timings: eval time = 157.46 ms / 1 runs ( 157.46 ms per run) llama_print_timings: total time = 12715.48 ms

ayushGHub commented 1 year ago

I am facing the same issue, it's continuously running for more than half an hour on the text data from state_of_the_union. Did you resolve it?

imartinez commented 1 year ago

It depends on your hardware. On a M1 the state_of_the_union took around 1h for me

pinballelectronica commented 1 year ago

Should be able to use CUBLAS

walking-octopus commented 1 year ago

The best thing to speed up ingestion would be to abandon the idea of using LLaMA for embeddings. Just like using full GPT-3 davinci to generate embeddings is costlier and less accurate than BERT, the same applies here.

imartinez commented 1 year ago

The best thing to speed up ingestion would be to abandon the idea of using LLaMA for embeddings. Just like using full GPT-3 davinci to generate embeddings is costlier and less accurate than BERT, the same applies here.

Thanks. Any suggestions for an alternative embeddings model?

ottodevs commented 1 year ago

It depends on your hardware. On a M1 the state_of_the_union took around 1h for me

It took me exactly 52 29 minutes with an Intel Core i7-4790 from 2013, 16GB DDR3. How is it possible that it takes less time than a M1 released 2-3 years ago?

Edit: I noticed total time already includes prompt eval time so it was actually almost half time than I initially thought

walking-octopus commented 1 year ago

The best thing to speed up ingestion would be to abandon the idea of using LLaMA for embeddings. Just like using full GPT-3 davinci to generate embeddings is costlier and less accurate than BERT, the same applies here.

Thanks. Any suggestions for an alternative embeddings model?

Some version BERT... I guess something large enough and fine-tuned on question-answer.

Instructor-XL is also very promising, since it can generate task-specific embedding with just prompting.

Additionally, splitting text into more separate documents may make it take longer to process or search through. And also, higher dimensionality embeddings can possibly be more accurate up to a point, but will take longer to generate and compare.

Since people are concerned about multi-lingual support, Multilingual-MiniLM-L12-H384 might be a good choice.

anybam commented 1 year ago

this limitation makes this sw useless

Jeru2023 commented 1 year ago

Ingest is ridiculously slow, we don't have to use llama embedding, maybe try other sentence embedding model from hugging face.

ishu121992 commented 1 year ago

It's superslow. For reference, I embedded 3500 pages worth of pdf text under 1 minute using sentence transformers on my gpu. Using CPU for embedding is hara-kiri. This model is generating <100 tokens in few seconds. It will take eternity to really encode anything.

In case anyone is interested in my version, please use the notebook from below link https://github.com/ishu121992/Semantic-Search

ShJavokhir commented 1 year ago

Took about 15 minutes on i7 13700K

su77ungr commented 1 year ago

It's superslow. For reference, I embedded 3500 pages worth of pdf text under 1 minute using sentence transformers on my gpu. Using CPU for embedding is hara-kiri. This model is generating <100 tokens in few seconds. It will take eternity to really encode anything.

In case anyone is interested in my version, please use the notebook from below link https://github.com/ishu121992/Semantic-Search

you did not get the concept of an air-gapped system, aye. Idk what you are talking about just use llamacpp's convert.py for your ggml to get a ggjt v1 model. I'm real time with my i5-9600k see

ishu121992 commented 1 year ago

It's superslow. For reference, I embedded 3500 pages worth of pdf text under 1 minute using sentence transformers on my gpu. Using CPU for embedding is hara-kiri. This model is generating <100 tokens in few seconds. It will take eternity to really encode anything. In case anyone is interested in my version, please use the notebook from below link https://github.com/ishu121992/Semantic-Search

you did not get the concept of an air-gapped system, aye. Idk what you are talking about just use llamacpp's convert.py for your ggml to get a ggtj v1 model. I'm real time with my i5-9600k see

Lemme give it a try.

ishu121992 commented 1 year ago

It's superslow. For reference, I embedded 3500 pages worth of pdf text under 1 minute using sentence transformers on my gpu. Using CPU for embedding is hara-kiri. This model is generating <100 tokens in few seconds. It will take eternity to really encode anything. In case anyone is interested in my version, please use the notebook from below link https://github.com/ishu121992/Semantic-Search

you did not get the concept of an air-gapped system, aye. Idk what you are talking about just use llamacpp's convert.py for your ggml to get a ggtj v1 model. I'm real time with my i5-9600k see

The model I am using is ggml-model-q4_0.bin and It's already converted and 4 bit quantized. Why would you use convert.py. again? If you have code, please share.

su77ungr commented 1 year ago

It's superslow. For reference, I embedded 3500 pages worth of pdf text under 1 minute using sentence transformers on my gpu. Using CPU for embedding is hara-kiri. This model is generating <100 tokens in few seconds. It will take eternity to really encode anything. In case anyone is interested in my version, please use the notebook from below link https://github.com/ishu121992/Semantic-Search

you did not get the concept of an air-gapped system, aye. Idk what you are talking about just use llamacpp's convert.py for your ggml to get a ggtj v1 model. I'm real time with my i5-9600k see

The model I am using is ggml-model-q4_0.bin and It's already converted and 4 bit quantized. Why would you use convert.py. again? If you have code, please share.

Try running the vicuna model as the embeddings model too. also change the paramaters to increase batch size.

I'm using [this](wget https://huggingface.co/datasets/dnato/ggjt-v1-vic7b-uncensored-q4_0.bin/resolve/main/ggjt-v1-vic7b-uncensored-q4_0.bin)

And this is my repo

pinballelectronica commented 1 year ago

Cool concept but I think ATM using (local) vector stores and GPT 3 or 3.5 may be more practical (I know, not air gapped- whatever, write a proxy to fuzz the embeddings if you're worried about security). But you really do need a lot of power to build anything even marginally useful. I have the highest end (consumer) GPU's and CPU's/memory money can buy and I limp along like everyone else. Even 65B can't handle a torch to GPT 3 even. Forget about 4. I'll be a happy boy when they open up the GPT 4 API to more folks.

walking-octopus commented 1 year ago

Cool concept but I think ATM using (local) vector stores and GPT 3 or 3.5 may be more practical (I know, not air gapped- whatever, write a proxy to fuzz the embeddings if you're worried about security). But you really do need a lot of power to build anything even marginally useful. I have the highest end (consumer) GPU's and CPU's/memory money can buy and I limp along like everyone else. Even 65B can't handle a torch to GPT 3 even. Forget about 4. I'll be a happy boy when they open up the GPT 4 API to more folks.

Just using BERT and perhaps fine-tuning some LlaMA-based models for document-based Q&A and generation would be good enough for basic usage. These models are very quickly improving, not in parameter counts, but in datasets. They are already capable of basic conversational instruction-following and the entry bar is quite low, given that LoRA makes fine-tuning relatively affordable and ChatGPT can itself generate very high-quality datasets by chatting with itself.

Privacy and local inference is one of the only advantages we have that no big-tech company is capable of reproducing yet, and even if they could, that would mean we can do so too.

And about "fuzzing the embeddings", to generate the embeddings or send the relevant documents, you by definition have to send those documents to OpenAI in clear-text for processing. You can't have your cake and eat it too with this one.

If you look at enough previous history, you'd see that whenever your whole product is a thin client around a big-tech API, sooner or later, if it's useful, they are just going to do the same themselves, except better and more integrated. Congregating around needlessly large models trained on datasets too large to check is bound to result in subtle defects and very much visible monopolies.

Let's just give open-access LMs a chance. Until then, you can just replace two lines of code to use GPT-3.5 or 4. This project has many weird and needless inefficiencies, so it's not like it's the gold standard of local document Q&A.

yveshaag commented 1 year ago

i created embeddings with Instructor-XL. Now i get the error "Dimensionality of (4096) does not match index dimensionality (768)". Where do I need to adjust for the new embeddings used?

pseudotensor commented 1 year ago

Even just speeding up file handling is important. E.g. pdf consumption is single threaded, so can use joblib etc. See:

https://github.com/h2oai/h2ogpt

https://github.com/h2oai/h2ogpt/blob/main/gpt_langchain.py#L421

pinballelectronica commented 1 year ago

Cool concept but I think ATM using (local) vector stores and GPT 3 or 3.5 may be more practical (I know, not air gapped- whatever, write a proxy to fuzz the embeddings if you're worried about security). But you really do need a lot of power to build anything even marginally useful. I have the highest end (consumer) GPU's and CPU's/memory money can buy and I limp along like everyone else. Even 65B can't handle a torch to GPT 3 even. Forget about 4. I'll be a happy boy when they open up the GPT 4 API to more folks.

Just using BERT and perhaps fine-tuning some LlaMA-based models for document-based Q&A and generation would be good enough for basic usage. These models are very quickly improving, not in parameter counts, but in datasets. They are already capable of basic conversational instruction-following and the entry bar is quite low, given that LoRA makes fine-tuning relatively affordable and ChatGPT can itself generate very high-quality datasets by chatting with itself.

Privacy and local inference is one of the only advantages we have that no big-tech company is capable of reproducing yet, and even if they could, that would mean we can do so too.

And about "fuzzing the embeddings", to generate the embeddings or send the relevant documents, you by definition have to send those documents to OpenAI in clear-text for processing. You can't have your cake and eat it too with this one.

If you look at enough previous history, you'd see that whenever your whole product is a thin client around a big-tech API, sooner or later, if it's useful, they are just going to do the same themselves, except better and more integrated. Congregating around needlessly large models trained on datasets too large to check is bound to result in subtle defects and very much visible monopolies.

Let's just give open-access LMs a chance. Until then, you can just replace two lines of code to use GPT-3.5 or 4. This project has many weird and needless inefficiencies, so it's not like it's the gold standard of local document Q&A.

I'm not trying to be harsh- we are clearly in a Renaissance Bubble, as I call it. Look at the GH stats for anything connected to or forked from HF, Meta, etc- It's insane.

What I mean about fuzzing the embeddings was what I do now- I proxy the traffic both ways and run some pipelines to look for PII etc- this is just for fun, not for production. Yes- caveat is that the gist must be there for it to reply correctly. Really it's about censoring bad behavior where, in a business environment I'm in, we do to protect users from themselves. E.G. GPT here's a spreadsheet full of PII, sort if for me and list the person the makes the most money" GPT is off limits for where I work as I presume many other places.

Honestly the gpt4-faiss-langchain-chroma slash gh code works great. I run it on 3.5 because I'm not lucky enough for 4.0 API and honestly, don't think I'd even see much of a difference. I mean if you step back, it's shockingly good quality for what it requires of your system. Fact is, the OpenAI is the teet most of us are sucking from. I think people are getting sick of it fast and that's going to continue to accelerate the open sourciness of all of this stuff. If not soon, or already happening...It's coming.

I can't help but respond to your model/complexity point. I cannot agree more. Who needs a ferrari to buy a gallon of milk at a store two blocks away. We don't need the model to know the history of cats. This is more like showing off to just possess a variety of information. We need purpose built transformers, ideally, so segmented that you can easily blend the model for your use case. Am I really querying a model with 186 billion elements to ask it how I can install cuda 12.1 in wsl 2 from some book it ingested? what a waste.

As you suggested, most all of the innovation happens right here.

zylon-ai / private-gpt

Suggestions for speeding up ingestion? #10