zjshi / gt-pro

MIT License
23 stars 7 forks source link

Build customize database problem #43

Closed wjj666 closed 2 years ago

wjj666 commented 2 years ago

Hi,

As mentioned in the GT-Pro paper: Database building ran for about 20 d with the following peak use of computer resources: CPU, 72 cores; RAM, 976 GB, and storage 15 TB. It really consumes a lot of computing resources as well as takes a long time.

I want to build a large GT-Pro database (about 1,000 species, 180,000 genomes). Is it possible to split all species into several parts and run the GT_Pro build command for each part on batch submission system, then merge all the db.bin together?

zjshi commented 2 years ago

Hi wjj666, thanks for using GT-Pro! Unfortunately the split and merge strategy may not work due to a specificity-checking step where GT-Pro screens each one of kmers from a species against kmers from all other species. Split all species will create some "blind spots" for specificity checking.

zjshi commented 2 years ago

Building a large customized database is not a trivial task but doable. The documented resource uses were peaks in our case. Actual use could be lower with planed running and data maneuvers. If you are interested, please let me know what hardware resource you have access to. I am happy to provide step-by-step help.

wjj666 commented 2 years ago

Hi, zjshi, thanks for your reply. I want to run GT_Pro build on batch submission system, the big mem node has 1T RAM, 128 CPU, the storage space is enough. But this node can only be used for max 7 days once. Is it possible to run GT_Pro build task with some sub-tasks and each sub-task takes less than 7 days?

zjshi commented 2 years ago

GT_Pro build has 3 steps: kmer extraction, uniqueness validation and data structure building. First two steps can be run by individual species in parallel and the last step is resource efficient. So it is possible to break down the building process to run on batch submission, but it requires minor modifications to the src/build_db.py script. I am considering providing a new script to enable running of each step separately, please stay tuned.

wjj666 commented 2 years ago

Thank you so much and looking forward to your updates.

wjj666 commented 2 years ago

Hi, zjshi. Now it's the merge_build (path_objs, output_dir, dbname) step. After running this command db_build 'ls ./final_build/*allowed*tsv | sort ' > ./my_db.bin, it generated the bin file my_db.bin but reported this error:

ERROR:  Multispecies kmers found!   There is possibly a problem with the upstream sckmerdb_allowed.tsv generator.
db_build: ./src/db_build.cpp:538: void multi_btc64(int, const char**): Assertion `false && "multispecies kmers present in input"' failed.
Aborted

Is there any problem with my data or the previous building steps? Could you please help me fix this error? Thank you very much.

zjshi commented 2 years ago

Would it be possible to share with me the allowedtsv files? Thanks!

wjj666 commented 2 years ago

This is the google drive link for the 23 allowedtsv files I used: https://drive.google.com/drive/folders/1Sq8SOlPGZV_oqnrZe8o1Iq1xv5MeNWk7?usp=sharing Thanks for your help.

zjshi commented 2 years ago

I performed a quick check on your files and I did find 11,914 multispecies k-mers. Here is a complete list of these k-mers: https://drive.google.com/file/d/1Vnu0h571GZgA7jDtf3TEowiAzPmReGEn/view?usp=sharing. So the easy fix is to exclude them from each of sckmer_allowed.tsv.

zjshi commented 2 years ago

To help me further understand this issue, could you please also share your genome files? I was also wondering how did you run each step, but never mind if the workflow was run as a whole. Thanks a lot!

wjj666 commented 2 years ago

Thanks for your help.

There are still other sckmer_allowed.tsv files, could please share the code to identify multispecies?

Here are the genome files corresponding to the 23 skcmer_allowed.tsv: https://drive.google.com/drive/folders/1caEbHxrwKsDuFl8PDULf0a1eMBOEMnja?usp=sharing

The input files for GT_Pro build were generated by CallM with the command: CallM genomes --fna-dir {} --out-dir {} --threads {} --snp-freq 0.01 --min-prev 0.9 --keep-redundancy

The GT_Pro build command: GT_Pro build --in build.list --out {} --dbname {} --threads {} --overwrite

zjshi commented 2 years ago

Thanks for providing all the genomes! I was led to locate a small bug disrupting uniqueness validation. The bug is now fixed. Please give it another try let me know please if it also works on your end.

wjj666 commented 2 years ago

Thank you very much! The database customization job has completely been done. But I have other questions:

  1. How to classify all SNPs as bi-, tri-, quad-allelic, as well as synonymous, nonsynonymous?
  2. The SNPs in the customized database are all biallelic SNPs?
  3. Is there anything else I need to do before metagenotyping samples?

GT_Pro optimize command outputs four files :

 mydb_optimized_db_snps.bin
 mydb_optimized_db_kmer_index.bin
 mydb_optimized_db_mmer_bloom_36.bin
 mydb_optimized_db_lmer_index_32.bin

but with some warning and errors:

[WARNING] found zero hits for the 0 reads input from /dev/stdin
[ERROR] Error writing output /dev/stdout cl 1067
[ERROR] Failed to write to output file
*** Failed for ALL 1 input files. ***

Could you please help me fix the errors?

zjshi commented 2 years ago

Excited that you got it work!

  1. How to classify all SNPs as bi-, tri-, quad-allelic, as well as synonymous, nonsynonymous? GT-Pro does not classify SNPs. It detects alleles of SNPs in the database. You may classify SNPs using the VCF file.

  2. The SNPs in the customized database are all biallelic SNPs? Yes, that is correct.

  3. Is there anything else I need to do before metagenotyping samples? Nothing extra but make sure the tutorial is followed. In your case, you may also want to insert your customized database to the command lines wherever database is required.

but with some warning and errors: Could you please help me reproduce the error by letting me know which command lines you ran.

wjj666 commented 2 years ago

I tested the optimize command with the database customized with the previous 23 species clusters. The optimize command: GT_Pro optimize --db /***/my_db22/db_2 --in /***/my_db22/SRR413665_2.fastq.gz --tmp /***/my_db22/optimize_tmp

The log:

[OK] start initial optimization
[OK] database found
[OK] optimize from /***/my_db22/db_2.bin
[OK] initial optimization done
[OK] finalize optimization with a break-in test

Error: the following returned non-zero status: 'gt_pro -d /***/my_db22/db_2 /***/snp/my_db22/SRR413665_2.fastq.gz -o /***/my_db22/optimize_tmp//breakin ':

b'gt_pro\t/***/my_db22/db_2\t20\tno_overwrite\n1643952548921:  [Info] Starting to load DB: /***/my_db22/db_2\n1643952548922:  [Info] MMAPPING /***/my_db22/db_2_optimized_db_snps.bin\n1643952548961:  [Info] MMAPPING /***/my_db22/db_2_optimized_db_kmer_index.bin\n1643952549132:  [Info] Using -l 32 -m 36 as optimal for system RAM\n1643952549132:  [Info] MMAPPING /***/my_db22/db_2_optimized_db_mmer_bloom_36.bin\n1643952551311:  [Info] MMAPPING /***/my_db22/db_2_optimized_db_lmer_index_32.bin\n1643952556341:  [Info] Done with init for optimized DB with 109063704 kmers.  That took 7 seconds.\n1643952556551:  [Info] Waiting for all readers to quiesce\n1643952558708:  [Done] searching is completed for the 600000 reads input from /***/my_db22/SRR413665_2.fastq.gz\n1643952558723:  [Stats] 20960 snps, 600000 reads, 1.08 hits/snp, for /***/my_db22/SRR413665_2.fastq.gz\n1643952558738:  [ERROR] Error writing output /***/my_db22/optimize_tmp//breakin.tsv.gz cl 1067\n1643952558745:  0.6 million reads were scanned after 2 seconds\n*** Failed for ALL 1 input files. ***\n1643952558747:  Totally done: 2 seconds elapsed processing reads, after DB was loaded.\n'

There was abreakin.err file in optimize_tmp dir:

 $ cat breakin.err
[ERROR] Failed to write to output file.
zjshi commented 2 years ago

Thanks! Could you please try this slightly different command line to see what will happen:

GT_Pro optimize --db /***/my_db22/db_2 --in <(gunzip -dc /***/my_db22/SRR413665_2.fastq.gz) --tmp /***/my_db22/optimize_tmp

wjj666 commented 2 years ago
 $ GT_Pro optimize --db /***/snp/my_db22/db_2 --in <(gunzip -dc /***/my_db22/SRR413665_2.fastq.gz) --tmp /***/my_db22/optimize_tmp
[OK] start initial optimization
[OK] database found
[OK] optimize from /***/my_db22/db_2.bin
[OK] initial optimization done
[OK] finalize optimization with a break-in test
Traceback (most recent call last):
  File "***/gt-pro-master/scripts/optimize.py", line 99, in <module>
    main()
  File "***/gt-pro-master/scripts/optimize.py", line 95, in main
    breakin_test(dbname, inpath, tmp_dir)
  File "***/gt-pro-master/scripts/optimize.py", line 69, in breakin_test
    assert os.path.isfile(inpath)
AssertionError

I try to print the inpath: /dev/fd/63

zjshi commented 2 years ago

Hmm... Which OS and shell do you use? Could you please then try the following instead? Thanks. gunzip -dc /***/my_db22/SRR413665_2.fastq.gz > /***/my_db22/SRR413665_2.fastq GT_Pro optimize --db /***/my_db22/db_2 --in /***/my_db22/SRR413665_2.fastq --tmp /***/my_db22/optimize_tmp

wjj666 commented 2 years ago

Sorry, it occurred to errors during gt_pro -d db_2 SRR413665_2.fastq -o optimize_tmp//breakin

GT_Pro optimize --db db_2 --in SRR413665_2.fastq --tmp optimize_tmp after gunzip -dc SRR413665_2.fastq.gz > SRR413665_2.fastq :

[OK] start initial optimization
[OK] database found
[OK] optimize from db_2.bin
[OK] initial optimization done
[OK] finalize optimization with a break-in test,SRR413665_2.fastq

Error: the following returned non-zero status: 'gt_pro -d db_2 SRR413665_2.fastq -o optimize_tmp//breakin ':

Then I tried to re-run this command gt_pro genotype -d db_2 SRR413665_2.fastq -o optimize_tmp//breakin, the log is:

gt_pro  db_2    20      no_overwrite
1644023025500:  [Info] Starting to load DB: db_2
1644023025501:  [Info] MMAPPING ./db_2_optimized_db_snps.bin
1644023025534:  [Info] MMAPPING ./db_2_optimized_db_kmer_index.bin
1644023025689:  [Info] Using -l 32 -m 36 as optimal for system RAM
1644023025689:  [Info] MMAPPING ./db_2_optimized_db_mmer_bloom_36.bin
1644023027648:  [Info] MMAPPING ./db_2_optimized_db_lmer_index_32.bin
1644023032856:  [Info] Done with init for optimized DB with 109063704 kmers.  That took 7 seconds.
1644023032973:  [Info] Waiting for all readers to quiesce
1644023032974:  [ERROR] Failed to read past position 0 in presumed FASTQ file genotype
genotype: No such file or directory
1644023032974:  [Done] searching is completed for the 0 reads input from genotype
1644023033973:  [Done] searching is completed for the 600000 reads input from SRR413665_2.fastq
1644023033982:  [Stats] 20960 snps, 600000 reads, 1.08 hits/snp, for SRR413665_2.fastq
1644023033983:  [ERROR] Error writing output optimize_tmp//breakin.tsv cl 1067
1644023033992:  0.6 million reads were scanned after 1 seconds
*** Failed for ALL 2 input files. ***
1644023033993:  Totally done: 1 seconds elapsed processing reads, after DB was loaded.

The fastq file was downloaded from https://github.com/zjshi/gt-pro/raw/master/test/SRR413665_2.fastq.gz

$ cat /etc/issue
CentOS release 6.8 (Final)
 $ gcc --version
gcc (GCC) 5.2.0
 $ bash --version
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
zjshi commented 2 years ago

The program gt_pro does not take sub command like genotype, which is different from the main program GT_Pro. So could you please type in gt_pro -d db_2 SRR413665_2.fastq -o optimize_tmp//breakin instead?

wjj666 commented 2 years ago

Sorry, it's the same writing error:

 $ gt_pro -d db_2 SRR413665_2.fastq -o optimize_tmp//breakin
gt_pro  db_2    20      no_overwrite
1644049266788:  [Info] Starting to load DB: db_2
1644049266793:  [Info] MMAPPING ./db_2_optimized_db_snps.bin
1644049267028:  [Info] MMAPPING ./db_2_optimized_db_kmer_index.bin
1644049267577:  [Info] Using -l 32 -m 36 as optimal for system RAM
1644049267577:  [Info] MMAPPING ./db_2_optimized_db_mmer_bloom_36.bin
1644049272659:  [Info] MMAPPING ./db_2_optimized_db_lmer_index_32.bin
1644049325127:  [Info] Done with init for optimized DB with 109063704 kmers.  That took 58 seconds.
1644049325258:  [Info] Waiting for all readers to quiesce
1644049326501:  [Done] searching is completed for the 600000 reads input from SRR413665_2.fastq
1644049326513:  [Stats] 20960 snps, 600000 reads, 1.08 hits/snp, for SRR413665_2.fastq
1644049326515:  [ERROR] Error writing output optimize_tmp//breakin.tsv cl 1067
1644049326525:  0.6 million reads were scanned after 1 seconds
*** Failed for ALL 1 input files. ***
1644049326526:  Totally done: 1 seconds elapsed processing reads, after DB was loaded.
zjshi commented 2 years ago

OK, three questions:

  1. Could you find the actual output file even after seeing the error message, in your case optimize_tmp//breakin.tsv?
  2. Are you running it on a HPC?
  3. Would you please try it one more time with gt_pro -d db_2 -t 1 SRR413665_2.fastq -o optimize_tmp//breakin?

Sorry for asking you to do a lot of trials! The problem is that I couldn't reproduce the error on my end, for both Ubuntu and CentOS.

wjj666 commented 2 years ago
  1. breakin.tsv was not in dir optimize_tmp. There was only abreakin.err in dir optimize_tmp:
    $ cat optimize_tmp/breakin.err
    [ERROR] Failed to write to output file.
  2. Yes, it was running on a HPC. The job was submit by qsub command.
  3. gt_pro -d db_2 -t 1 SRR413665_2.fastq -o optimize_tmp//breakin also reported the same error.

Many thanks for your time and patience.

zjshi commented 2 years ago

Thanks for all the information! I am getting a clearer picture now. In this case (Error writing output *** cl 1067), GT-Pro completed metagenotyping and closed output writing with no apparent error detected, and then it threw the error message because the file it had just done writing did not exist. I just tested it on a HPC but still failed to reproduce the error. I was wondering whether it is due to some specific OS/disk settings with the HPC you have been running GT-Pro. Sorry for asking but would it be possible to have you try another computing environment?

zjshi commented 2 years ago

Hi wjj666, after I worked with another user, there may be a plausible cause and solution: https://github.com/zjshi/gt-pro/issues/44. I will close this issue for now. Feel free to reopen it if the problem persists.