xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.85k stars 134 forks source link

Is it possible to use instructor without instructions? #35

Closed jlia0 closed 1 year ago

jlia0 commented 1 year ago

How will the model perform if not providing instructions or just some general instructions without giving the 1) domain, 2) text type, 3) text objective? Which one is the most important?

Harry-hash commented 1 year ago

Hi, Thanks a lot for your interest in the INSTRUCTOR model!

The model may degrade to a general encoder if no instruction is provided. The instruction components are important, but you may skip one or two parts if the context is ambiguous and hard to determine.

Feel free to add any further questions or comments!

jlia0 commented 1 year ago

Thank you @Harry-hash! Is there a way to do few-shot examples on the instruction side? Or is it a future work?

hongjin-su commented 1 year ago

Currently, we do not have few-shot examples in the instruction. We may leave it to future work!

syonfox commented 1 year ago

Is there a way to have another model generate the instructions based on the text?

hongjin-su commented 1 year ago

Yes, I think it is possible to prompt ChatGPT or GPT-4 to summarize the domain, task type and task objective for the instruction

syonfox commented 1 year ago

Colud this have similar performance in the backend to a summarization and embedding approche

Ie are the instructions embeded and the distance is reduced by the use of a more simaler / regular output. My question is if you know how important the instruction schema matching. is

Could wildly verbose instructions be used to increase simalarity.

But yeah gpt3.5 can do that in a few tokens hopfully. Its probably hard codable. as well there are only so many options but i suspect include

"What is the domain and important relavent information. generate a json object with keys {domain, summary, questions, extrapolation, resorces}" with the subject domain or field, a sumary of the data, any questions people might ask about this, an extrapolation of this based on training data and any aditional resoces you recomend looking up for more info and facts."

syonfox commented 1 year ago

Sample gpt3.5 0.5 tmp

What is the domain and crucial relevant information? generate a JSON object with keys {domain, summary, questions, extrapolation, resources}" with the subject domain or field, an overview of the data, any questions people might ask about this, an extrapolation of this based on training data and any additional resources you recommend looking up for more info and facts.

Query:

Where is the most impactful regulation of this project?

Json Lines: --You must return only valid json lines below this line.

{ "domain": "Project Regulations", "summary": "This JSON object pertains to regulations related to a specific project.", "questions": [ "What are the regulations that impact this project?", "What is the most impactful regulation for this project?", "How do these regulations affect the project's timeline and budget?", "What are the consequences of not adhering to these regulations?" ], "extrapolation": "Based on training data, it is crucial for project managers to identify and comply with the most impactful regulations to avoid potential delays, penalties, and legal issues. It is also important to regularly review and update the project's compliance plan as regulations may change over time.", "resources": [ "https://www.epa.gov/laws-regulations", "https://www.osha.gov/laws-regs", "https://www.fda.gov/regulatory-information/search-fda-guidance-documents" ] }

hongjin-su commented 1 year ago

You may refer to the table 5 in the paper for examples of words in each field.

hongjin-su commented 1 year ago

Feel free to re-open the issue if you have any further questions or comments!