xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.79k stars 132 forks source link

file downloads on package import #49

Closed greenpau closed 1 year ago

greenpau commented 1 year ago

When first importing the package, I noticed a number of downloads happening.

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

I am deploying the package in the environment where I don't have access to Internet.

Is there a way to download all the required files ahead of time and tell the INSTRUCTOR to use the files instead of downloading them at runtime?

hongjin-su commented 1 year ago

Hi, You may try to cache the downloaded model in a local path before the test time, and load it later without Internet access.

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large', cache_folder="local_path")

The downloading only happens in the first time you run the script, and will not require Internet access from the second time.

Feel free to add any questions or comments!

eshaanagarwal commented 1 year ago

I want to download it before since the cache might get deleted

hongjin-su commented 1 year ago

Hi, you may specify any local path for the parameter cache_folder that will not be deleted.

hiranya911 commented 1 year ago

Is there anyway to load the model from the git clone? For instance, with AutoModel or AutoTokenizer I can do this:

$ git clone https://huggingface.co/hkunlp/instructor-large

Then in the code:

from transformers import AutoTokenizer

AutoTokenizer.from_pretrained('./instructor-large')

But this doesn't seem to work with the INSTRUCTOR class. Any way to get this to work?

hongjin-su commented 1 year ago

Hi, would you like to share your codes? The following works for me

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('./instructor-large')
hiranya911 commented 1 year ago

Strange. That's what I tried too. But I get this error:

load INSTRUCTOR_Transformer
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 415, in load_state_dict
    return torch.load(checkpoint_file, map_location="cpu")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: invalid load key, 'v'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 94, in __init__
    modules = self._load_sbert_model(model_path)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/InstructorEmbedding/instructor.py", line 474, in _load_sbert_model
    module = module_class.load(os.path.join(model_path, module_config['path']))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/InstructorEmbedding/instructor.py", line 306, in load
    return INSTRUCTOR_Transformer(model_name_or_path=input_path, **config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/InstructorEmbedding/instructor.py", line 240, in __init__
    self._load_model(self.model_name_or_path, config, cache_dir, **model_args)
  File "/opt/homebrew/lib/python3.11/site-packages/sentence_transformers/models/Transformer.py", line 47, in _load_model
    self._load_t5_model(model_name_or_path, config, cache_dir)
  File "/opt/homebrew/lib/python3.11/site-packages/sentence_transformers/models/Transformer.py", line 55, in _load_t5_model
    self.auto_model = T5EncoderModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2429, in from_pretrained
    state_dict = load_state_dict(resolved_archive_file)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/transformers/modeling_utils.py", line 420, in load_state_dict
    raise OSError(
OSError: You seem to have cloned a repository without having git-lfs installed. Please install git-lfs and run `git lfs install` followed by `git lfs pull` in the folder you cloned.

I'm on Mac btw.

hongjin-su commented 1 year ago

Have you installed git-lfs? You may also try to load the state dict and see whether there is error

hiranya911 commented 1 year ago

I don't think I have git-lfs. Just have the usual git client. Is git-lfs a hard requirement for this use case?