xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.78k stars 131 forks source link

Representing metadata and conditionally encoding text for a specific task #71

Closed bkamapantula closed 11 months ago

bkamapantula commented 11 months ago

Thanks for the work on Instructor and making it accessible via open source. I enjoyed reading your research work!

I prototyped the implementation of a book and it works fine. I'd like to extend it to several books and so have a few questions related to it.

  1. What's the best way to tag metadata related to the chunks that are embedded?

Text from an arbitrary page will be on a specific page, specific chapter and section (also source language of the text). This sort of metadata doesn't fall into any of the proposed attributes (text type, text objective, domain).

One possibility is that I can concatenate the metadata along with each chunk as Chapter: NAME Page number: NUMBER Section: NAME chunk text

  1. Is it fair to assume that each chunk is embedded for all 70 tasks? If so, can I conditionally encode it for a specific task type (ex: instruction for retrieval) thereby bringing down the compute time?
hongjin-su commented 11 months ago

Hi, Thanks a lot for your interest in the INSTRUCTOR!

  1. The page number does not seem to provide meaningful information. In this case, you may try to transfer the chapter name and section name into domain descriptions.
  2. Yes, you only need to encode a text chunk for a specific task with customized instructions.
bkamapantula commented 11 months ago

Thanks for your response. Makes sense.