westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
332 stars 32 forks source link

Downloading pretraining dataset from huggingface #43

Open rubenweitzman opened 3 months ago

rubenweitzman commented 3 months ago

Hi, Thanks for providing the pre-training database with foldseek tokens! having difficulty downloading the dataset and using with hugginface functions. Trying

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("westlake-repl/AF2_UniRef50")

# Load the train split of the dataset
train_dataset = dataset["train"]

but getting error

if not module_name:
    raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
    return module_name, default_builder_kwargs

DataFilesNotFoundError: No (supported) data files found in westlake-repl/AF2_UniRef50

What then is the proper way to load in the dataset from huggingface?

LTEnjoy commented 3 months ago

Hi,

AF2_UniRef50 is organized in LMDB format. If you want to load it, you have to first download it and then open the file using lmdb package.

Here is the example of how you get samples:

import lmdb

lmdb_dir = "/your/path/to/AF2_UniRef50/train"
with lmdb.open(lmdb_dir, readonly=True).begin() as txn:
    length = int(txn.get('length'.encode()).decode())
    for i in range(length):
        data_str = txn.get(i.encode()).decode()
        data = json.loads(data_str)
        print(data)
        break

Hope this could resolve your problem:)