xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.87k stars 135 forks source link

Issue with Evaluation ArguAna #89

Open yeliusf opened 1 year ago

yeliusf commented 1 year ago

Thanks for publishing the nice work. I want to evaluate Arguana by following your comments: python examples/evaluate_model.py --model_name hkunlp/instructor-large --output_dir outputs --task_name ArguAna --result_file results

But I face the following issue:

  File "[/export/home/Instructor/mteb/abstasks/AbsTaskRetrieval.py](https://github.com/xlang-ai/instructor-embedding/blob/654f34cbd777d0dcb0d401ea3d7ccdbeeb3b259c/evaluation/MTEB/mteb/abstasks/AbsTaskRetrieval.py#L745C26-L745C26)", line 746, in <listcomp>
    [instruction, (corpus["title"][i] + self.sep + corpus["text"][i]).strip()]
TypeError: 'list' object is not callable
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

I tried a fix by changing your code to:

            sentences = [
                [instruction,
                 (corpus["title"][i] + self.sep + corpus["text"][i]).strip()
                 if "title" in corpus else corpus["text"][i].strip()]
                for i in range(len(corpus["text"]))
            ]

But face a new issue:

  File "/opt/conda/envs/instructor/lib/python3.8/site-packages/datasets/table.py", line 1059, in from_file
    table = _memory_mapped_arrow_table_from_file(filename)
  File "/opt/conda/envs/instructor/lib/python3.8/site-packages/datasets/table.py", line 66, in _memory_mapped_arrow_table_from_file
    pa_table = opened_stream.read_all()
  File "pyarrow/ipc.pxi", line 699, in pyarrow.lib.RecordBatchReader.read_all
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Expected to be able to read 5197224 bytes for message body, got 5197216
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Could you take a look into this? Thanks!

hongjin-su commented 1 year ago

Hi, could you help to print out corpus["title"][i], self.sep and their types? They are expected to be strings.

ParishadBehnam commented 1 year ago

Hello. I have the same problem.

sentences = [
                [instruction, (corpus["title"][i] + self.sep + corpus["text"][i]).strip()]
                (corpus["title"][i] + self.sep + corpus["text"][i]).strip()
                if "title" in corpus
                else corpus["text"][i].strip()
                for i in range(len(corpus["text"]))
            ]

This code snippet leads to an error. Could you please double-check if it is written as you intended?

wenhaoy-0428 commented 1 year ago

@ParishadBehnam @yeliusf I encountered the same issue, the easy workaround is to DO NOT use pip install -e . to install the mteb package. Instead, use pip install mteb.

Xavier1999-Chen commented 1 year ago

@ParishadBehnam @yeliusf I encountered the same issue, the easy workaround is to DO NOT use pip install -e . to install the mteb package. Instead, use pip install mteb.

it works on me :)

ashokrajab commented 1 year ago

@ all, the following error is due to a recent change that from my contribution

    [instruction, (corpus["title"][i] + self.sep + corpus["text"][i]).strip()]
TypeError: 'list' object is not callable
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

I have raised a PR #92 to fix the same.

Waiting for a response from the repo maintainers to merge the PR. As a temporary fix, until this PR is merged, kindly include the commit in the PR in your branch to resolve the issue.