xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
9.71k stars 571 forks source link

Embeddings: Expose the prompt templates found in config_sentence_transformers.json, when available #810

Open benjamincburns opened 1 week ago

benjamincburns commented 1 week ago

Feature request

Many embeddings models are trained with task-specific prefixes, and these prefixes are often described in a file named config_sentence_transformers.json (example - see the 'prompts' object). It would be nice if the content of that object was exposed in some way.

Motivation

Without this, there's no way for me to configure which task-specific model I'd like to use at runtime and get the performance out of that model that I need. It limits the set of models that my code can run against, as I need to go and find the task-specific prefixes out of band.

I'm currently working on a simple text classifier that uses sentence embeddings models, but I've also encountered this while working on a RAG pipeline. I'd like to be able to initialize a feature-extraction pipeline, and inspect it to determine whether I need to add a task-specific prefix to the text that I'm about to embed with the pipeline.

Alternatively, I'd like it if the task-specific prefix were applied automatically by the pipeline itself, however I imagine that this would require the user to identify the task to the pipeline on creation, which may be a more complicated change than allowing me to inspect the config.

Your contribution

If the maintainers would provide some guidance for how this feature should be implemented, I'd be happy to submit a PR.

xenova commented 1 week ago

Hi there 👋 Since this is related to sentence-transformers and not transformers, I believe this feature is out of scope for transformers.js. After all, you can fetch+parse the JSON file yourself and add the prompts if there are any before passing this to Transformers.js. Maybe it could be useful when sentence-transformers.js comes around though? 😉