xsc1234 / Search-in-the-Chain

Code for Search-in-the-Chain: Towards Accurate, Credible and Traceable Large Language Models for Knowledge-intensive Tasks
47 stars 5 forks source link

Indexed corpus #7

Open Iriseve opened 1 month ago

Iriseve commented 1 month ago

Hi, I find that the retrieval corpus used by hotpotqa and other datasets mentioned in the paper seems different? I have obtained the pre-processed corpus of wikipedia2017 from other issues.

May I ask what is the difference between the retrieval corpus used by other datasets and this corpus, and whether relevant data can be provided?

image

Besides, when I use python ColBERT/index.py to index enwiki-20171001-pages-meta-current-withlinks-abstracts.tsv, it seems to take long time?It's been a few hours since it started the first iteration. However, the occupancy rate and utilization rate of GPU memory are very low, and it seems to be stuck here. Are there any solutions? Thank you very much!

[Sep 22, 12:24:12] #> Loading collection...
0M 1M 2M 3M 4M 5M /python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884  warnings.warn(Search-in-the-Chain/ColBERT/colbert/utils/amp.py:12: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = torch.cuda.amp.GradScaler()
[Sep 22, 12:24:35] [0]           # of sampled PIDs = 400959      sampled_pids[:3] = [3494860, 85305, 2505172]
[Sep 22, 12:24:37] [0]           #> Encoding 400959 passages..
/Search-in-the-Chain/ColBERT/colbert/utils/amp.py:15: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
[Sep 22, 12:31:14] [0]           avg_doclen_est = 60.502357482910156     len(local_sample) = 400,959
[Sep 22, 12:31:21] [0]           Creaing 262,144 partitions.
[Sep 22, 12:31:21] [0]           *Estimated* 316,628,741 embeddings.
[Sep 22, 12:31:21] [0]           #> Saving the indexing plan to /Search-in-the-Chain/ColBERT/experiments/hotpotqa_wiki/indexes/hotpotqa_wiki.nbits_whole=2/plan.json ..
/Search-in-the-Chain/ColBERT/colbert/indexing/collection_indexer.py:243: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  sub_sample = torch.load(sub_sample_path)
Clustering 24208964 points in 128D to 262144 clusters, redo 1 times, 20 iterations
  Preprocessing in 1.98 s
Iteration 0 (8805.82 s, search 8799.72 s): objective=7.09834e+06 imbalance=1.725 nsplit=0  
xmanners commented 3 days ago

Maybe a tips for the corpus, in the Citation part, author mentioned that corpus for other dataset was the same as Dense Passage Retrieval & another paper, you can try to search DPR in github, which provide downloads for it's passage corpus. (But not sure is the same)