xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.85k stars 134 forks source link

Evaluation settings of INSTRUCTOR #67

Open EliverQ opened 1 year ago

EliverQ commented 1 year ago

Hello! I have a very puzzling question that I would like to ask. Since your model is fine-tuned with instructions, why not use instructions during benchmark evaluations (e.g. MTEB)?

hongjin-su commented 1 year ago

Hi, the instructions are included in the evaluation. You may refer to the table 1 in our paper

EliverQ commented 1 year ago

As far as I understand, when evaluating MTEB in your code, the following lines are used:

model = INSTRUCTOR(args.model_name, cache_folder=args.cache_dir)
evaluation = MTEB(tasks=[args.task_name], task_langs=["en"])
evaluation.run(model, output_folder=args.output_dir, eval_splits=[args.split], args=args, overwrite_results=True)

During the execution of evaluation.run(), it utilizes INSTRUCTOR.encode() to encode the input sentences. However, when I print the sentences passed to INSTRUCTOR.encode() before tokenization, it appears that the corresponding task instructions are not added to these sentences.

I'm not sure if my understanding and evaluation method are correct. I would greatly appreciate it if you could provide me with answers. Thank you very much.

hongjin-su commented 1 year ago

Hi, could you share the scripts you print out the sentences? Also, make sure you have correctly installed the InstructorEmbedding library.

EliverQ commented 1 year ago

I just use the source code on the Github:

https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L478-L565

    def encode(self, sentences,
               batch_size: int = 32,
               show_progress_bar: bool = None,
               output_value: str = 'sentence_embedding',
               convert_to_numpy: bool = True,
               convert_to_tensor: bool = False,
               device: str = None,
               normalize_embeddings: bool = False):
        """
        Computes sentence embeddings

        :param sentences: the sentences to embed
        :param batch_size: the batch size used for the computation
        :param show_progress_bar: Output a progress bar when encode sentences
        :param output_value:  Default sentence_embedding, to get sentence embeddings. Can be set to token_embeddings to get wordpiece token embeddings. Set to None, to get all output values
        :param convert_to_numpy: If true, the output is a list of numpy vectors. Else, it is a list of pytorch tensors.
        :param convert_to_tensor: If true, you get one large tensor as return. Overwrites any setting from convert_to_numpy
        :param device: Which torch.device to use for the computation
        :param normalize_embeddings: If set to true, returned vectors will have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used.

        :return:
           By default, a list of tensors is returned. If convert_to_tensor, a stacked tensor is returned. If convert_to_numpy, a numpy matrix is returned.
        """
        self.eval()
        if show_progress_bar is None:
            show_progress_bar = False

        if convert_to_tensor:
            convert_to_numpy = False

        if output_value != 'sentence_embedding':
            convert_to_tensor = False
            convert_to_numpy = False

        input_was_string = False
        if isinstance(sentences, str) or not hasattr(sentences, '__len__'): #Cast an individual sentence to a list with length 1
            sentences = [sentences]
            input_was_string = True

        if device is None:
            device = self._target_device

        self.to(device)

        all_embeddings = []
        if isinstance(sentences[0],list):
            lengths = []
            for sen in sentences:
                lengths.append(-self._text_length(sen[1]))
            length_sorted_idx = np.argsort(lengths)
        else:
            length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences])
        sentences_sorted = [sentences[idx] for idx in length_sorted_idx]

        for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
            sentences_batch = sentences_sorted[start_index:start_index+batch_size]
            features = self.tokenize(sentences_batch)
            features = batch_to_device(features, device)

            with torch.no_grad():
                out_features = self.forward(features)

                if output_value == 'token_embeddings':
                    embeddings = []
                    for token_emb, attention in zip(out_features[output_value], out_features['attention_mask']):
                        last_mask_id = len(attention)-1
                        while last_mask_id > 0 and attention[last_mask_id].item() == 0:
                            last_mask_id -= 1

                        embeddings.append(token_emb[0:last_mask_id+1])
                elif output_value is None:  #Return all outputs
                    embeddings = []
                    for sent_idx in range(len(out_features['sentence_embedding'])):
                        row =  {name: out_features[name][sent_idx] for name in out_features}
                        embeddings.append(row)
                else:   #Sentence embeddings
                    embeddings = out_features[output_value]
                    embeddings = embeddings.detach()
                    if normalize_embeddings:
                        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

                    # fixes for #522 and #487 to avoid oom problems on gpu with large datasets
                    if convert_to_numpy:
                        embeddings = embeddings.cpu()

                all_embeddings.extend(embeddings)

        all_embeddings = [all_embeddings[idx] for idx in np.argsort(length_sorted_idx)]

I print sentences in the encode() at about Line 512

hongjin-su commented 1 year ago

For the case of MTEB, make sure that the library is correctly installed by following https://github.com/HKUNLP/instructor-embedding#mteb.

EliverQ commented 1 year ago

Sorry, I have tried to install this previously but failed with the message here:

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/tmp/tmp73vjuhzp/output.json

So I can only pip install mteb following the instruction at https://github.com/embeddings-benchmark/mteb

Maybe this is the reason for the problem?

hongjin-su commented 1 year ago

Yes, we should install the customized mteb package for correct evaluation.

EliverQ commented 1 year ago

Thanks! I've corrected my evaluation method following your customized mteb package. The performance of replicating INSTRUCTOR have been improved but still lower than yours. Here I still have some detailed questions:

  1. Have you been consistently overlooking the token embeddings of the instructions during the training and evaluation processes?
  2. Have you been consistently using mean pooling as the pooling method during the training and evaluation processes?

Thank you again for your patient response.

hongjin-su commented 1 year ago

Yes, we use the mean pooling in both the training and evaluation processes.

ashokrajab commented 1 year ago
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/tmp/tmp73vjuhzp/output.json

I guess the reason for this issue might be due to the exit() statement present at https://github.com/HKUNLP/instructor-embedding/blob/main/evaluation/MTEB/setup.py#L42 @hongjin-su, could you kindly check this?

hongjin-su commented 9 months ago

Hi, you may check the permission of /tmp or /tmp/tmp73vjuhzp.