weaviate / Verba

Retrieval Augmented Generation (RAG) chatbot powered by Weaviate
BSD 3-Clause "New" or "Revised" License
5.83k stars 620 forks source link

Instruction: How to add BAAI/bge-m3 embedder #128

Open bakongi opened 5 months ago

bakongi commented 5 months ago

Hi everyone. This is working sample how to add BAAI/bge-m3 embedder to Verba.

  1. Create copy of MiniLMEmbedder.py file and rename it to "BGEM3Embedder.py" in goldenverba/components/embedding
  2. Make changes in the file: rename MiniLMEmbedder class to BGEM3Embedder and so on:
    
    from tqdm import tqdm
    from wasabi import msg
    from weaviate import Client

from goldenverba.components.embedding.interface import Embedder from goldenverba.components.reader.document import Document

class BGEM3Embedder(Embedder): """ BGEM3Embedder for Verba. """

def __init__(self):
    super().__init__()
    self.name = "BGEM3Embedder"
    self.requires_library = ["torch", "transformers"]
    self.description = "Embeds and retrieves objects using SentenceTransformer's BAAI/bge-m3 model"
    self.vectorizer = "BAAI/bge-m3"
    self.model = None
    self.tokenizer = None
    try:
        import torch
        from transformers import AutoModel, AutoTokenizer

        def get_device():
            if torch.cuda.is_available():
                return torch.device("cuda")
            elif torch.backends.mps.is_available():
                return torch.device("mps")
            else:
                return torch.device("cpu")

        self.device = get_device()

        self.model = AutoModel.from_pretrained(
            "BAAI/bge-m3", device_map=self.device
        )
        self.tokenizer = AutoTokenizer.from_pretrained(
            "BAAI/bge-m3", device_map=self.device
        )
        self.model = self.model.to(self.device)

...

3. In manager.py in goldenverba/components/embedding make this changes:

from goldenverba.components.embedding.MiniLMEmbedder import MiniLMEmbedder from goldenverba.components.embedding.BGEM3Embedder import BGEM3Embedder from goldenverba.components.reader.document import Document

class EmbeddingManager: def init(self): self.embedders: dict[str, Embedder] = { "MiniLMEmbedder": MiniLMEmbedder(), "BGEM3Embedder": BGEM3Embedder(), "ADAEmbedder": ADAEmbedder(), "CohereEmbedder": CohereEmbedder(), }

...

4. Make changes in goldenverba/components/schema/schema_generation.py:

VECTORIZERS = {"text2vec-openai", "text2vec-cohere"} # Needs to match with Weaviate modules EMBEDDINGS = {"MiniLM", "BAAI/bge-m3"} # Custom Vectors



5. Done! Start Verba!

P.S.
If you want to use English specific model like "BAAI/bge-large-en" just use "BAAI/bge-large-en" instead of "BAAI/bge-m3" and use appropriate names for files.
thomashacker commented 4 months ago

Great work! We'll look into this for the next update

moncefarajdal commented 3 months ago

@bakongi I've done the same as you but I can't figure out where to choose this custom embedder in the frontend of Verba. Any suggestions please?

bakongi commented 3 months ago

@bakongi I've done the same as you but I can't figure out where to choose this custom embedder in the frontend of Verba. Any suggestions please?

How you installed verba - pip or from sources?

moncefarajdal commented 3 months ago

@bakongi I installed Verba using pip install goldenverba like shown in the documentation

bakongi commented 3 months ago

@bakongi I installed Verba using pip install goldenverba like shown in the documentation

Ok. Where did you make changes? (folder path) I think you should make changes in python shared library folder where verba is installed

moncefarajdal commented 3 months ago

@bakongi I make the changes exactly in the files that you mentioned. "I think you should make changes in python shared library folder where verba is installed" Can you please elaborate?

moncefarajdal commented 3 months ago

@bakongi One more thing, the new embedding model that I added doesn't seem to be downloaded from HugginFace my guess is an api key should be configured or does sentence_transformers do the whole job? Thank you

bakongi commented 3 months ago

@bakongi One more thing, the new embedding model that I added doesn't seem to be downloaded from HugginFace my guess is an api key should be configured or does sentence_transformers do the whole job? Thank you

The location of the Python shared library folder where installed libraries are stored depends on your operating system and the environment in which Python is running. Here are the typical locations for different environments:

On Unix-like systems (Linux, macOS):

On Windows:

Checking the location programmatically:

You can also check the location of installed libraries programmatically using Python:

import site
import sys

# List all site-packages directories
print(site.getsitepackages())

# List user-specific site-packages directory
print(site.getusersitepackages())

# List all paths where Python looks for packages
print(sys.path)

This code will print the paths where Python searches for libraries, including the site-packages directories.

moncefarajdal commented 3 months ago

I see. I've installed Verba pip install goldenverba on a virtual environment created using python venv and it's located in the project directory. Is this correct?

bakongi commented 3 months ago

I see. I've installed Verba pip install goldenverba on a virtual environment created using python venv and it's located in the project directory. Is this correct?

When you install a Python package in a virtual environment, the package is installed within the directory structure of the virtual environment itself. This ensures that the package dependencies are isolated from the global Python environment and any other virtual environments you might have.

Here's a typical structure of a virtual environment:

/ ├── / │ ├── bin/ # Executables and scripts (Linux/macOS) or Scripts/ (Windows) │ ├── lib/ # Libraries (Linux/macOS) or Lib/ (Windows) │ │ └── pythonX.Y/ │ │ └── site-packages/ │ │ └── goldenverba/ ├── your_project_files/ └── ...
moncefarajdal commented 3 months ago

So what should I do in this case for the project to run correctly?

bakongi commented 3 months ago

So what should I do in this case for the project to run correctly?

Go to\Lib\site-packages\goldenverba and make nesessary changes in files in "components" folder and subfolder

or, if you downloaded sourse files and made changes there just run

pip install -e .

in your virtual anv.

luc42ei commented 2 months ago

not sure if this is your problem @moncefarajdal but I think you need to install pip install goldenverba[huggingface]

luc42ei commented 2 months ago

Hi everyone. This is working sample how to add BAAI/bge-m3 embedder to Verba. …

for this to show up in Verba, you also need to adjust goldenverba/components/embedding/manager.py accordingly

13777469818 commented 2 months ago

unsubscribe

From: luc42ei Date: 2024-06-05 22:33 To: weaviate/Verba CC: Subscribed Subject: Re: [weaviate/Verba] Instruction: How to add BAAI/bge-m3 embedder (Issue #128) Hi everyone. This is working sample how to add BAAI/bge-m3 embedder to Verba. … for this to show up in Verba, you also need to adjust goldenverba/components/embedding/manager.py accordingly — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>