how to do embeddings? - Githubissues

putuoka commented 11 months ago

I want to create an AI assistant for my personal website using Node.js. While I can easily create it using OpenAI embeddings, their API costs are prohibitively expensive. Therefore, I am looking for an alternative method and wondering how I can perform embeddings using a CSV file. Can you advise me on how to do this?


async function getEmbeddings(tokens) {
  console.log("start getEmbeddings");

  let response;
  try {
    console.log("initiating openai api call");
    response = await openai.createEmbedding({
      model: "text-embedding-ada-002",
      input: tokens,
    });
  } catch (e) {
    console.error("Error calling OpenAI API getEmbeddings:", e?.response?.data);
    throw new Error("Error calling OpenAI API getEmbeddings");
  }

  return response.data.data;
}

putuoka commented 11 months ago

Error


let inputs = await model.tokenizer(texts);
                             ^

TypeError: model.tokenizer is not a function

// embeddings.js

import { Pipeline } from "@xenova/transformers";

const texts = [
    "Text 1",
    "Text 2",
    "Text 3"
];

// Fungsi generate embeddings
async function generateEmbeddings(model, texts) {

    // Perbaikan: panggil tokenizer dari pipeline
    let inputs = await model.tokenizer(texts);

    return await model(inputs);

}

// Kelas Embedder 
class Embedder {

    async init() {
        this.model = await new Pipeline("tokenizer", "embeddings", "Xenova/all-MiniLM-L6-v2");
    }

    async embed(texts) {
        return await generateEmbeddings(this.model, texts);
    }

}

// Main program
async function run() {

    const embedder = new Embedder();
    await embedder.init();

    const embeddings = await embedder.embed(texts);

    console.log(embeddings);

}

run();

putuoka commented 11 months ago

throw Error(`Unsupported pipeline: ${task}. Must be one of [${Object.keys(SUPPORTED_TASKS)}]`)
              ^

Error: Unsupported pipeline: text-embedding. Must be one of [text-classification,token-classification,question-answering,fill-mask,summarization,translation,text2text-generation,text-generation,zero-shot-classification,automatic-speech-recognition,image-to-text,image-classification,image-segmentation,zero-shot-image-classification,object-detection,feature-extraction]
    at pipeline

import { Document } from 'langchain/document';
import { Embeddings } from "langchain/embeddings/base";
import { pipeline } from '@xenova/transformers';

export class MiniLMEmbeddings extends Embeddings {

    constructor() {
        super({});
        this.model = null;
    }

    async initModel() {
        this.model = await pipeline("text-embedding", "Xenova/all-MiniLM-L6-v2");
    }

    async embedDocuments(texts) {
        await this.initModel();

        const embeddings = [];

        for (const text of texts) {
            const doc = new Document({
                pageContent: text
            });

            const embedding = await this.embedQuery(text);
            embeddings.push(embedding);
        }

        return embeddings;
    }

    async embedQuery(text) {
        return this.model(text);
    }

}

export const embedder = new MiniLMEmbeddings();

const texts = [
    "Hello world",
    "How are you today?"
];

(async () => {

    console.log('Loading model...');

    const embeddings = await embedder.embedDocuments(texts);

    console.log('Embeddings:', embeddings);

})();

xenova commented 11 months ago

Hi there, please refer to this section of the documentation for instructions on how to use transformers.js for embeddings.

Here is some example code:

import { pipeline } from '@xenova/transformers';

let extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
let result = await extractor('This is a simple test.', { pooling: 'mean', normalize: true });
console.log(result);
// Tensor {
//     type: 'float32',
//     data: Float32Array [0.09094982594251633, -0.014774246141314507, ...],
//     dims: [1, 384]
// }

putuoka commented 11 months ago

Hi there, please refer to this section of the documentation for instructions on how to use transformers.js for embeddings.

Here is some example code:

import { pipeline } from '@xenova/transformers';

let extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
let result = await extractor('This is a simple test.', { pooling: 'mean', normalize: true });
console.log(result);
// Tensor {
//     type: 'float32',
//     data: Float32Array [0.09094982594251633, -0.014774246141314507, ...],
//     dims: [1, 384]
// }

Thank you very much. This is very helpful for me. I need a question-answering pipeline. Apparently the model also affects it. I thought I could use any model lol. Do you know any question-answering models that support multiple languages? Or is the embedding technique better than Q&A? I want to make an AI Assistant for my website and the AI should only be able to answer questions related to the website's content.

Before I found your repo, my initial idea was to use cosine similarity to find related corpus data from the web, then throw that text and the question text to OpenAI to get a natural answer. So it just plays with the prompt because embedding takes up a huge number of expensive tokens.

xenova commented 11 months ago

Apparently the model also affects it. I thought I could use any model lol.

That's right! Check out the MTEB leaderboard to help you choose. The models which perform well for question-answering (based on embeddings) are specifically trained for mapping queries and answers to the same points in the latent space, then performing similarity checks between the embeddings. For example, with the multilingual-e5-* models, there are special "query:" and "passage:" prefixes which will help guide this process. See here for more information about this.

Do you know any question-answering models that support multiple languages?

Here are some I have already converted: https://huggingface.co/models?library=transformers.js&sort=trending&search=multilingual-e5, but feel free to search the HF Hub for other models (which you will need to convert yourself).

Or is the embedding technique better than Q&A?

It depends on what types of answers you want to give. The embedding technique is useful if you already have the list of available answers (and then the user can input any text and it will select the closest relevant answer). If you have a single passage where you want to find the answer in, you can use a question-answering pipeline, but this will only return spans in the text for the answer. See this demo for an example:

The other option as you've already mentioned is use a large language model, where you feed in the context and question and let the model synthesize a response.

I would also recommend using the unquantized

putuoka commented 11 months ago

I tried using the multilingual Q&A model above and got an error. Please help

 for (let j = 0; j < output.start_logits.dims[0]; ++j) {
                                                ^
TypeError: Cannot read properties of undefined (reading 'dims')

Here the code


import { pipeline } from '@xenova/transformers';

let question = 'apa nama lain amazon dalam bahasa inggris?';
let context = `The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.`

let answerer = await pipeline('question-answering', 'Xenova/multilingual-e5-base');
let outputs = await answerer(question, context);
console.log(outputs);

xenova commented 11 months ago

That is because Xenova/multilingual-e5-base is NOT a question-answering model (it is a feature-extraction model). It is a model which does not have a last layer to map each token to a score.

See here for the list of pre-converted QA models: https://huggingface.co/models?pipeline_tag=question-answering&library=transformers.js&sort=trending

ashgansh commented 11 months ago

hi @xenova

first great job on the library 🔥

Context

I'm currently experimenting with generating embeddings in the browsers to perform sentence similarity

i have modified your example app and included a sentence similarity example here

Preview

Sample output of demo

// sample output
{"sentence":"You're going woof woof","score":0.2039476412419157},
{"sentence":"There's room to explore","score":0.15026012732862984},
{"sentence":"I always get there first","score":null},
{"sentence":"You're my favorite, Lion.","score":0.24164886611839995},

Code Snippet used in demo

async function getSentenceEmbeddings(sentences) {
    const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2',);
    return Promise.all(sentences.map(sentence => extractor(sentence)));
}

async function sentence_similarity(data) {
    const { query, sentences } = data;

    const queryVector = await getSentenceEmbeddings([query]);
    const sentencesVectors = await getSentenceEmbeddings(sentences);

    const scores = sentencesVectors.map((vector, index) => {
        const simScore = cos_sim(queryVector[0].data, vector.data)
        return {
            sentence: sentences[index],
            score: simScore
        };
    });
    console.log(scores)

    self.postMessage({
        type: 'complete',
        target: data.elementIdToUpdate,
        data: scores.sort((a, b) => b.score - a.score)
    });

}

Question

Is feature-extraction a good way to recreate a similar behavior to the one described by OP?

In a perfect world I'd love to recreate an output that mimics the output of the following (albeit with diff vector size)

openai.createEmbedding({
      model: "text-embedding-ada-002",
      input: tokens,
    });

openai embedding return vectors of fixed length and my current implementation does not, which makes it hard to do cos_sim and returns null in certain cases.

xenova commented 11 months ago

Hi there 👋

Is feature-extraction a good way to recreate a similar behavior to the one described by OP?

It certainly is 👍 any sentence-similarity functionality implemented by libraries like sentence-transformers does feature-extraction behind the scenes.

openai embedding return vectors of fixed length and my current implementation does not,

Yes, because you are missing the "pooling and normalization" layer at the end of the feature-extraction. You can include it as follows:

let result = await extractor('This is a simple test.', { pooling: 'mean', normalize: true });

or in your case,

return Promise.all(sentences.map(sentence => extractor(sentence, { pooling: 'mean', normalize: true })));

on that note, you should be able to pass the array of sentences in one go as follows:

return await extractor(sentences, { pooling: 'mean', normalize: true })

Another problem I can see with your code is that you create the pipeline every time you call getSentenceEmbeddings. You should rather create a single pipeline which is reused throughout your program.

ashgansh commented 11 months ago

@xenova thanks that's really helpful. I will try that out (and as for the pipeline, I separated it just for now will create a reusable one later :) )

ashgansh commented 11 months ago

alright got it to work and refactored the pipeline implementation 🐝

    const { query, sentences } = data;
    let extractor = await SentenceSimilarityPipelineFactory.getInstance();
    const vectors = await extractor([query, ...sentences], { pooling: 'mean', normalize: true });
    const [queryVector, ...sentencesVectors] = vectors;

    const scores = sentencesVectors.map((vector, index) => {
        const simScore = cos_sim(queryVector.data, vector.data)
        return {
            sentence: sentences[index],
            score: simScore
        };
    });

    self.postMessage({
        type: 'complete',
        target: data.elementIdToUpdate,
        data: scores.sort((a, b) => b.score - a.score)
    });

}

class SentenceSimilarityPipelineFactory extends PipelineFactory {
    static task = 'feature-extraction';
    static model = 'Xenova/all-MiniLM-L6-v2';
}

happy to do a PR w/ the sentence similarity example inside the demo if that helps you in any way

xenova commented 11 months ago

alright got it to work and refactored the pipeline implementation 🐝

Nice! 🚀

Currently, the demo is meant for existing pipelines, and since there is no sentence-similarity pipeline in the python library, I haven't added one here (yet maybe? 👀). So, don't worry about that :)

Utopiah commented 1 week ago

It's a detail but I find @ashgansh's example to be clearer than the existing documentation simply because of using cos_sim() on the result. It's obvious once you have done it once but as it's also part of transformer.js (cf utils/maths.js ) and it's a typical situation, I believe newcomers to the library would benefit from that.

xenova / transformers.js

how to do embeddings? #203

Context

Preview

Sample output of demo

Code Snippet used in demo

Question