Indexed corpus - Githubissues

Hi, I find that the retrieval corpus used by hotpotqa and other datasets mentioned in the paper seems different? I have obtained the pre-processed corpus of wikipedia2017 from other issues.

May I ask what is the difference between the retrieval corpus used by other datasets and this corpus, and whether relevant data can be provided?

Besides, when I use python ColBERT/index.py to index enwiki-20171001-pages-meta-current-withlinks-abstracts.tsv, it seems to take long time?It's been a few hours since it started the first iteration. However, the occupancy rate and utilization rate of GPU memory are very low, and it seems to be stuck here. Are there any solutions? Thank you very much!

[Sep 22, 12:24:12] #> Loading collection...
0M 1M 2M 3M 4M 5M /python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884  warnings.warn(Search-in-the-Chain/ColBERT/colbert/utils/amp.py:12: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = torch.cuda.amp.GradScaler()
[Sep 22, 12:24:35] [0]           # of sampled PIDs = 400959      sampled_pids[:3] = [3494860, 85305, 2505172]
[Sep 22, 12:24:37] [0]           #> Encoding 400959 passages..
/Search-in-the-Chain/ColBERT/colbert/utils/amp.py:15: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
[Sep 22, 12:31:14] [0]           avg_doclen_est = 60.502357482910156     len(local_sample) = 400,959
[Sep 22, 12:31:21] [0]           Creaing 262,144 partitions.
[Sep 22, 12:31:21] [0]           *Estimated* 316,628,741 embeddings.
[Sep 22, 12:31:21] [0]           #> Saving the indexing plan to /Search-in-the-Chain/ColBERT/experiments/hotpotqa_wiki/indexes/hotpotqa_wiki.nbits_whole=2/plan.json ..
/Search-in-the-Chain/ColBERT/colbert/indexing/collection_indexer.py:243: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  sub_sample = torch.load(sub_sample_path)
Clustering 24208964 points in 128D to 262144 clusters, redo 1 times, 20 iterations
  Preprocessing in 1.98 s
Iteration 0 (8805.82 s, search 8799.72 s): objective=7.09834e+06 imbalance=1.725 nsplit=0

xsc1234 / Search-in-the-Chain

Indexed corpus #7