No DB_clu.tsv in PepMD Dataset

yaledeus / FBM

Force-Guided Bridge Matching for Full-Atom Time-Coarsened Dynamics of Peptides

MIT License

7 stars 1 forks source link

Thanks for the great work!

I am actually trying to curate the PepMD dataset. I found that the post_process cannot be executed.

https://github.com/yaledeus/FBM/blob/0d9f1e3065cc0a37839fa2f253ed843e57546369/data/download.py#L172

as the "mmseqs createtsv ./tmp/DB ./tmp/DB ./tmp/DB_clu ./tmp/DB_clu.tsv" in https://github.com/yaledeus/FBM/blob/0d9f1e3065cc0a37839fa2f253ed843e57546369/data/cluster.py#L40 will output:

No datafile could be found for ./tmp/DB_clu! 'createtsv ./tmp/DB ./tmp/DB ./tmp/DB_clu ./tmp/DB_clu.tsv \n\nMMseqs Version: \t15.6f452\nFirst sequence as representative\tfalse\nTarget column \t1\nAdd full header \tfalse\nSequence source \t0\nDatabase output \tfalse\nThreads \t64\nCompressed \t0\nVerbosity \t3\n\n'.

Furthermore, I wonder how many computer resources will be taken for the MD simulation?

Thanks for your recognition!

Honestly I have not encountered the issue that you mentioned. If the following code is exceuted successfully (line 32,33 in cluster.py), the file ./tmp/DB_clu will be created.

cmd = f'mmseqs cluster {db} {db_clustered} {tmp_dir} --min-seq-id 0.6' # similarity > 0.6 in the same cluster res = exec_mmseq(cmd)

Please first check before you run download.py, you have downloaded the PDB sequence file from here and unzip it to obtain pdb_seqres.fasta.

For MD simulations, it will take around 1h for simulations of one peptide if you use one GPU (less than 500M CUDA memory will be occupied). So if you want to reproduce the training and test set used in our work (136+14=150), you'll need about a week to generate the data. Of course, if you run the simulation in parallel on multiple CPUs, it might be much faster (which I haven't tested yet).

Wish you the best!

yaledeus / FBM

No DB_clu.tsv in PepMD Dataset #1