whitphx / transformers.js.py

Apache License 2.0
72 stars 6 forks source link

Sample code request #214

Closed mostaphaRoudsari closed 4 months ago

mostaphaRoudsari commented 4 months ago

Hi @whitphx, this is not really an issue! I didn't know where to post this question so I'm creating an issue for it.

I'm trying to translate this code into transformers.js.py to be able to run it with pyodide but I'm struggling to map the code from the sentence_transformers library.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def get_embedding(programs):
    return [model.encode(p, convert_to_tensor=True) for p in programs]

def calculate_similarities(room_name, _programs_embedding):
    input_embedding = model.encode(room_name, convert_to_tensor=True)
    ranking = {
        count: util.pytorch_cos_sim(input_embedding, program)[0][0]
        for count, program in enumerate(_programs_embedding)
    }
    data = list(sorted(ranking.items(), key=lambda item: item[1], reverse=True))
    return data[0]

if __name__ == '__main__':
    programs = ['Bedroom', 'Bathroom']
    pe = get_embedding(programs)
    index, uncertainty = calculate_similarities('Men', pe)
    assert programs[index] == 'Bathroom'

Two questions:

  1. Is it possible? Are sentence_transformers also accessible through this library?
  2. If the answer is yes, is there a sample file that I can use to learn from?

Thanks.

whitphx commented 4 months ago

This library transformers.js.py doesn't support SentenceTransformer directly as it just proxies function calls from Python to transformers.js and SentenceTransformer is built on top of transformers, not transformers.js. However SentenceTransformer uses transformers in its internals so you can create encode() using transformers.js.py referring to the original implementation.

Here is a sample I created referring to https://github.com/UKPLab/sentence-transformers/blob/c0fc0e8238f7f48a1e92dc90f6f96c86f69f1e02/sentence_transformers/SentenceTransformer.py#L405 though it doesn't fully implement the original method.

import scipy

from transformers_js_py import pipeline, AutoModel, AutoTokenizer

# Re-implement encode() using `transformers.js.py`
model_name = "sentence-transformers/all-MiniLM-L6-v2"
options = {
    "quantized": False
}

tokenizer = await AutoTokenizer.from_pretrained(model_name, options);
model = await AutoModel.from_pretrained(model_name, options);

async def encode(sentences, output_value="sentence_embedding"):
    model_inputs = tokenizer(sentences);
    embeddings = await model(**model_inputs);
    output = embeddings[output_value].to_numpy()
    output = output[0]  # Make it 1-D
    return output

# Modify its callers as well:
async def get_embedding(programs):
    return [await encode(p) for p in programs]

async def calculate_similarities(room_name, _programs_embedding):
    input_embedding = await encode(room_name)
    ranking = {
        count: 1 - scipy.spatial.distance.cosine(input_embedding, program)
        for count, program in enumerate(_programs_embedding)
    }
    data = list(sorted(ranking.items(), key=lambda item: item[1], reverse=True))
    return data[0]

programs = ['I love Transformers', 'It was raining yesterday']
pe = await get_embedding(programs)
index, uncertainty = await calculate_similarities('What was the weather?', pe)

assert programs[index] == 'It was raining yesterday'

You can test it here: https://edit.share.stlite.net/#!ChBzdHJlYW1saXRfYXBwLnB5Eq0KChBzdHJlYW1saXRfYXBwLnB5EpgKCpUKaW1wb3J0IHNjaXB5Cgpmcm9tIHRyYW5zZm9ybWVyc19qc19weSBpbXBvcnQgcGlwZWxpbmUsIEF1dG9Nb2RlbCwgQXV0b1Rva2VuaXplcgoKCm1vZGVsX25hbWUgPSAic2VudGVuY2UtdHJhbnNmb3JtZXJzL2FsbC1NaW5pTE0tTDYtdjIiCm9wdGlvbnMgPSB7CiAgICAicXVhbnRpemVkIjogRmFsc2UKfQoKdG9rZW5pemVyID0gYXdhaXQgQXV0b1Rva2VuaXplci5mcm9tX3ByZXRyYWluZWQobW9kZWxfbmFtZSwgb3B0aW9ucyk7Cm1vZGVsID0gYXdhaXQgQXV0b01vZGVsLmZyb21fcHJldHJhaW5lZChtb2RlbF9uYW1lLCBvcHRpb25zKTsKCmFzeW5jIGRlZiBlbmNvZGUoc2VudGVuY2VzLCBvdXRwdXRfdmFsdWU9InNlbnRlbmNlX2VtYmVkZGluZyIpOgogICAgbW9kZWxfaW5wdXRzID0gdG9rZW5pemVyKHNlbnRlbmNlcyk7CiAgICBlbWJlZGRpbmdzID0gYXdhaXQgbW9kZWwoKiptb2RlbF9pbnB1dHMpOwogICAgb3V0cHV0ID0gZW1iZWRkaW5nc1tvdXRwdXRfdmFsdWVdLnRvX251bXB5KCkKICAgIG91dHB1dCA9IG91dHB1dFswXSAgIyBNYWtlIGl0IDEtRAogICAgcmV0dXJuIG91dHB1dAoKCmFzeW5jIGRlZiBnZXRfZW1iZWRkaW5nKHByb2dyYW1zKToKICAgIHJldHVybiBbYXdhaXQgZW5jb2RlKHApIGZvciBwIGluIHByb2dyYW1zXQoKCmFzeW5jIGRlZiBjYWxjdWxhdGVfc2ltaWxhcml0aWVzKHJvb21fbmFtZSwgX3Byb2dyYW1zX2VtYmVkZGluZyk6CiAgICBpbnB1dF9lbWJlZGRpbmcgPSBhd2FpdCBlbmNvZGUocm9vbV9uYW1lKQogICAgcmFua2luZyA9IHsKICAgICAgICBjb3VudDogMSAtIHNjaXB5LnNwYXRpYWwuZGlzdGFuY2UuY29zaW5lKGlucHV0X2VtYmVkZGluZywgcHJvZ3JhbSkKICAgICAgICBmb3IgY291bnQsIHByb2dyYW0gaW4gZW51bWVyYXRlKF9wcm9ncmFtc19lbWJlZGRpbmcpCiAgICB9CiAgICBkYXRhID0gbGlzdChzb3J0ZWQocmFua2luZy5pdGVtcygpLCBrZXk9bGFtYmRhIGl0ZW06IGl0ZW1bMV0sIHJldmVyc2U9VHJ1ZSkpCiAgICByZXR1cm4gZGF0YVswXQoKCnByb2dyYW1zID0gWydJIGxvdmUgVHJhbnNmb3JtZXJzJywgJ0l0IHdhcyByYWluaW5nIHllc3RlcmRheSddCnBlID0gYXdhaXQgZ2V0X2VtYmVkZGluZyhwcm9ncmFtcykKaW5kZXgsIHVuY2VydGFpbnR5ID0gYXdhaXQgY2FsY3VsYXRlX3NpbWlsYXJpdGllcygnV2hhdCB3YXMgdGhlIHdlYXRoZXI_JywgcGUpCgphc3NlcnQgcHJvZ3JhbXNbaW5kZXhdID09ICdJdCB3YXMgcmFpbmluZyB5ZXN0ZXJkYXknCgppbXBvcnQgc3RyZWFtbGl0IGFzIHN0CnN0LndyaXRlKHByb2dyYW1zW2luZGV4XSkaEnRyYW5zZm9ybWVyc19qc19weRoFc2NpcHk,

mostaphaRoudsari commented 4 months ago

Thank you, @whitphx for taking the time to provide a working example. This should do what I need, and I can have a closer look into the original implementation. I did a couple of quick tests, and it worked very well. Cheers.