zjshi / gt-pro

MIT License
23 stars 7 forks source link

GT-pro optimize returned Error #44

Closed boaty closed 2 years ago

boaty commented 2 years ago

GT-pro version: 1.0.1 python version: 3.10.1

Hi,

While running GT-pro optimise with example fastq file SRR413665_2.fastq.gz, program returned those error messages as below.

bash /home/Desktop/tools/gt-pro/GT_Pro optimize --db /data/GTpro_database/20190723_881species --in /home/Desktop/tools/gt-pro/test/SRR413665_2.fastq.gz

[OK] start initial optimization [OK] database found [OK] optimize from /data/GTpro_database/20190723_881species_optimized_db_kmer_index.bin and /data/GTpro_database/20190723_881species_optimized_db_snps.bin

Error: the following returned non-zero status: 'gt_pro -d /data/GTpro_database/20190723_881species </dev/null':

b"gt_pro\t/data/GTpro_database/20190723_881species\t160\tno_overwrite\n1641794108700: [Info] Starting to load DB: /data/GTpro_database/20190723_881species\n1641794108700: [Info] MMAPPING /data/GTpro_database/20190723_881species_optimized_db_snps.bin\n1641794108825: [Info] MMAPPING /data/GTpro_database/20190723_881species_optimized_db_kmer_index.bin\n1641794109266: [Info] Using -l 32 -m 36 as optimal for system RAM\n1641794109266: [ERROR] Failed to MMAP /data/GTpro_database/20190723_881species_optimized_db_mmer_bloom_36.bin. This is fine, but init will be slower as we recreate this file.\n1641794112949: [ERROR] Failed to MMAP /data/GTpro_database/20190723_881species_optimized_db_lmer_index_32.bin. This is fine, but init will be slower as we recreate this file.\n1641794128674: Recomputing bloom index and/or filter.\n1641794140181: Processed almost 20 million kmers (0.7 percent). Expect to complete processing in 74.4 more minutes.\n1641794150333: Processed almost 70 million kmers (2.4 percent). Expect to complete processing in 27.6 more minutes.\n1641794161663: Processed almost 130 million kmers (4.5 percent). Expect to complete processing in 18.5 more minutes.\n1641794172327: Processed almost 180 million kmers (6.3 percent). Expect to complete processing in 15.7 more minutes.\n1641794183205: Processed almost 230 million kmers (8 percent). Expect to complete processing in 14.1 more minutes.\n1641794194369: Processed almost 280 million kmers (9.8 percent). Expect to complete processing in 13.1 more minutes.\n1641794206090: Processed almost 330 million kmers (11.5 percent). Expect to complete processing in 12.4 more minutes.\n1641794218254: Processed almost 380 million kmers (13.3 percent). Expect to complete processing in 11.8 more minutes.\n1641794228453: Processed almost 420 million kmers (14.7 percent). Expect to complete processing in 11.5 more minutes.\n1641794239143: Processed almost 460 million kmers (16.1 percent). Expect to complete processing in 11.3 more minutes.\n1641794249676: Processed almost 500 million kmers (17.5 percent). Expect to complete processing in 11 more minutes.\n1641794260581: Processed almost 540 million kmers (18.9 percent). Expect to complete processing in 10.8 more minutes.\n1641794271693: Processed almost 580 million kmers (20.3 percent). Expect to complete processing in 10.6 more minutes.\n1641794282176: Processed almost 620 million kmers (21.7 percent). Expect to complete processing in 10.4 more minutes.\n1641794293146: Processed almost 660 million kmers (23.1 percent). Expect to complete processing in 10.2 more minutes.\n1641794304270: Processed almost 700 million kmers (24.5 percent). Expect to complete processing in 10 more minutes.\n1641794316749: Processed almost 750 million kmers (26.2 percent). Expect to complete processing in 9.7 more minutes.\n1641794328345: Processed almost 790 million kmers (27.6 percent). Expect to complete processing in 9.5 more minutes.\n1641794339726: Processed almost 830 million kmers (29 percent). Expect to complete processing in 9.3 more minutes.\n1641794350905: Processed almost 870 million kmers (30.4 percent). Expect to complete processing in 9.2 more minutes.\n1641794362199: Processed almost 910 million kmers (31.8 percent). Expect to complete processing in 9 more minutes.\n1641794373799: Processed almost 950 million kmers (33.2 percent). Expect to complete processing in 8.8 more minutes.\n1641794385248: Processed almost 990 million kmers (34.6 percent). Expect to complete processing in 8.6 more minutes.\n1641794396813: Processed almost 1030 million kmers (36 percent). Expect to complete processing in 8.5 more minutes.\n1641794407824: Processed almost 1070 million kmers (37.4 percent). Expect to complete processing in 8.3 more minutes.\n1641794419620: Processed almost 1110 million kmers (38.8 percent). Expect to complete processing in 8.1 more minutes.\n1641794430999: Processed almost 1150 million kmers (40.2 percent). Expect to complete processing in 7.9 more minutes.\n1641794442342: Processed almost 1190 million kmers (41.6 percent). Expect to complete processing in 7.7 more minutes.\n1641794454087: Processed almost 1230 million kmers (43 percent). Expect to complete processing in 7.6 more minutes.\n1641794465671: Processed almost 1270 million kmers (44.4 percent). Expect to complete processing in 7.4 more minutes.\n1641794477049: Processed almost 1310 million kmers (45.8 percent). Expect to complete processing in 7.2 more minutes.\n1641794488488: Processed almost 1350 million kmers (47.2 percent). Expect to complete processing in 7 more minutes.\n1641794500294: Processed almost 1390 million kmers (48.6 percent). Expect to complete processing in 6.8 more minutes.\n1641794511983: Processed almost 1430 million kmers (50 percent). Expect to complete processing in 6.7 more minutes.\n1641794524046: Processed almost 1480 million kmers (51.8 percent). Expect to complete processing in 6.4 more minutes.\n1641794534856: Processed almost 1520 million kmers (53.2 percent). Expect to complete processing in 6.2 more minutes.\n1641794546391: Processed almost 1560 million kmers (54.6 percent). Expect to complete processing in 6 more minutes.\n1641794557507: Processed almost 1600 million kmers (56 percent). Expect to complete processing in 5.8 more minutes.\n1641794569147: Processed almost 1640 million kmers (57.4 percent). Expect to complete processing in 5.6 more minutes.\n1641794580714: Processed almost 1680 million kmers (58.8 percent). Expect to complete processing in 5.5 more minutes.\n1641794592404: Processed almost 1720 million kmers (60.2 percent). Expect to complete processing in 5.3 more minutes.\n1641794604387: Processed almost 1760 million kmers (61.6 percent). Expect to complete processing in 5.1 more minutes.\n1641794615273: Processed almost 1800 million kmers (63 percent). Expect to complete processing in 4.9 more minutes.\n1641794626564: Processed almost 1840 million kmers (64.4 percent). Expect to complete processing in 4.7 more minutes.\n1641794638114: Processed almost 1880 million kmers (65.8 percent). Expect to complete processing in 4.5 more minutes.\n1641794649993: Processed almost 1920 million kmers (67.2 percent). Expect to complete processing in 4.3 more minutes.\n1641794661917: Processed almost 1960 million kmers (68.6 percent). Expect to complete processing in 4.2 more minutes.\n1641794672954: Processed almost 2000 million kmers (70 percent). Expect to complete processing in 4 more minutes.\n1641794684261: Processed almost 2040 million kmers (71.4 percent). Expect to complete processing in 3.8 more minutes.\n1641794696084: Processed almost 2080 million kmers (72.8 percent). Expect to complete processing in 3.6 more minutes.\n1641794708069: Processed almost 2120 million kmers (74.2 percent). Expect to complete processing in 3.4 more minutes.\n1641794718497: Processed almost 2160 million kmers (75.6 percent). Expect to complete processing in 3.2 more minutes.\n1641794729248: Processed almost 2200 million kmers (77 percent). Expect to complete processing in 3 more minutes.\n1641794740934: Processed almost 2240 million kmers (78.4 percent). Expect to complete processing in 2.8 more minutes.\n1641794753155: Processed almost 2280 million kmers (79.8 percent). Expect to complete processing in 2.7 more minutes.\n1641794764648: Processed almost 2320 million kmers (81.2 percent). Expect to complete processing in 2.5 more minutes.\n1641794776426: Processed almost 2360 million kmers (82.6 percent). Expect to complete processing in 2.3 more minutes.\n1641794788664: Processed almost 2400 million kmers (84 percent). Expect to complete processing in 2.1 more minutes.\n1641794800753: Processed almost 2440 million kmers (85.4 percent). Expect to complete processing in 1.9 more minutes.\n1641794812489: Processed almost 2480 million kmers (86.8 percent). Expect to complete processing in 1.7 more minutes.\n1641794824488: Processed almost 2520 million kmers (88.2 percent). Expect to complete processing in 1.5 more minutes.\n1641794836717: Processed almost 2560 million kmers (89.6 percent). Expect to complete processing in 1.4 more minutes.\n1641794848585: Processed almost 2600 million kmers (91 percent). Expect to complete processing in 1.2 more minutes.\n1641794860498: Processed almost 2640 million kmers (92.4 percent). Expect to complete processing in 1 more minutes.\n1641794872190: Processed almost 2680 million kmers (93.8 percent). Expect to complete processing in 0.8 more minutes.\n1641794884226: Processed almost 2720 million kmers (95.2 percent). Expect to complete processing in 0.6 more minutes.\n1641794896562: Processed almost 2760 million kmers (96.6 percent). Expect to complete processing in 0.4 more minutes.\n1641794908700: Processed almost 2800 million kmers (98 percent). Expect to complete processing in 0.2 more minutes.\n1641794920836: Processed almost 2840 million kmers (99.4 percent). Expect to complete processing in 0 more minutes.\n1641794929357: Done writing /data/GTpro_database/20190723_881species_optimized_db_mmer_bloom_36.bin. That took 3 more seconds.\n1641794945786: Done writing /data/GTpro_database/20190723_881species_optimized_db_lmer_index_32.bin. That took 16 more seconds.\n1641794945786: [Info] Done with init for optimized DB with 2856121626 kmers. That took 837 seconds.\n1641794945786: [Info] Will input reads from stdin and output snps to stdout.\n1641794945786: [Info] Output will appear only after stdin reaches EOF.\nterminate called after throwing an instance of 'std::length_error'\n what(): vector::_M_default_append\nAborted (core dumped)\n"

It seems to be some issue about read length. My fastq files are merged reads for length of 150nt.

I also tired to run genotype with fastq file for testing, but I noticed that for same fastq file, sometimes GT-pro gave me a error file with those words inside.

returns [error] Failed to write to output file.

sometimes it worked well with genotype results . But when it failed, it shows in stout

[ERROR] Error writing output L1EEC240295_SH201905_199_S20_L004_R1_001_kneaddata_pairedgtpro20190723_881species.tsv.gz cl 1067

En plus, in step-by-step tutorial (https://github.com/zjshi/gt-pro/blob/master/ExampleTutorial.md) and GT-pro optimiser's help, attribute: --out is still included, but the GT-pro.sh does not seem to recognize it.

Thanks

zjshi commented 2 years ago

Hmm... GT_Pro is designed to handle any short reads that is no longer than 500nt, so it should work with 150nt reads. It looks like you are experiencing a quite similar problem as https://github.com/zjshi/gt-pro/issues/42, which is new to us. Could you please share your input file (SRR413665_2.fastq.gz) if that is realistic. Thanks!

boaty commented 2 years ago

Thanks,

Sure. We got Metagenome fastq file. The raw fastq was processed by QC(KneadData)+host removal(bowtie2 against human genome) and merged by microbiome helper(function concat_paried_end.pl)

After those treatments, we got a filtered and joined fastq, here's some example.

And noticed that some fastq files were randomly successful while running GT-PRO, means GT-PRO worked sometimes but failed at other times. I do not know how it happened.

@A00133:159:H35K7DSXY:4:1101:5032:1000:N:0:ATTATC#0/1
ATAATGCCGTGAAGTTTACCGGGGAAGGCAGCATTGAGTTCGGGTTTGATGTGCGCGAAGACGGTTTCCTGCATTTCTATGTCACCGACACCGGTTGCGGGATACCGGAGGAACGGTTGGAAGAAATCTTCGGGAATTTTGTCAAACTCA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFF

@A00133:159:H35K7DSXY:4:1101:7988:1391:N:0:ATTATC#0/1
AAAAGGATTGGGTAAGGGACTTGATTCATTAATTACAGACAAGGTATCAAAGCCGGTAAAACCAAAGAGCAATCATGCAGCGGATGCTGTAATGATAGATATTAATA
+
FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF,FFFFFFFFFFF,FFF

@A00456:422:HFV2TDSXY:2:1101:1208:8124:N:0:GCTTGG#0/1
ATATACTGCTTGAGCCACTAATTGTTTGTAGTCCTTCATCACTAATATCCTCCTTTAAAAATAAAAACGTCTCTTGCATGGCGTTAAATAACACTTTGCAAGAGACGAGAAATTCCCGCGGTACCACTCTATTTGATCAATATCCACTCAAGAGCAACACAGAACAAAAAGAGCGATTTCACACCCGCCGCGCCCATGCGGGCAAATGGCGGGCAGCTGAAGGAAGCGCGGGCGACCGCGGTATCCTGCGAGTGGGATCCGGCGAAATAGAAGTACAAGGAAAGGAG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FF:FFFFF,FFFFFFF,,FFFFF:F:F:FFFFF,FFFFFFFFFFFFFFFFFFFFFFFF,F:FFF:FF:FFFFFFFFFFFFFF,F:FFFFFF,F:FF:FFFFFFFFF:FFF,FF::FFFFFFF:FFFF::F:
zjshi commented 2 years ago

Thank you boaty! I tried a couple of ways but I couldn't reproduce the error you were seeing with three sequences provided. However, I found two empty lines (Line no. 5 and 10) in your FASTQ file, perhaps that is the reason preventing GT_Pro from behaving properly?

boaty commented 2 years ago

Thanks Shi,

I retried GT-pro with the example file SRR413665_2.fastq.gz from GitHub, I noticed that sometimes GT-pro gave me a error with error code 1067.

/home/Desktop/tools/gt-pro/GT_Pro genotype -d /data/GTpro_database/20190723_881species -C ./test SRR413665_2.fastq.gz -t 10 -f
gt_pro  /data/GTpro_database/20190723_881species    10  force_overwrite
1644481038613:  [Info] Starting to load DB: /data/GTpro_database/20190723_881species
1644481038613:  [Info] MMAPPING /data/GTpro_database/20190723_881species_optimized_db_snps.bin
1644481038868:  [Info] MMAPPING /data/GTpro_database/20190723_881species_optimized_db_kmer_index.bin
1644481039784:  [Info] Using -l 32 -m 36 as optimal for system RAM
1644481039784:  [Info] MMAPPING /data/GTpro_database/20190723_881species_optimized_db_mmer_bloom_36.bin
1644481040451:  [Info] MMAPPING /data/GTpro_database/20190723_881species_optimized_db_lmer_index_32.bin
1644481042738:  [Info] Done with init for optimized DB with 2856121626 kmers.  That took 4 seconds.
1644481042742:  [Info] Forcing recompute for input: SRR413665_2.fastq.gz
1644481042863:  [Info] Waiting for all readers to quiesce
1644481045743:  [Done] searching is completed for the 600000 reads input from SRR413665_2.fastq.gz
1644481045786:  [Stats] 102624 snps, 600000 reads, 1.28 hits/snp, for SRR413665_2.fastq.gz
1644481045796:  [ERROR] Error writing output SRR413665_2__gtpro__20190723_881species.tsv.gz cl 1067
1644481045798:  0.6 million reads were scanned after 3 seconds
*** Failed for ALL 1 input files. ***
1644481045811:  Totally done: 3 seconds elapsed processing reads, after DB was loaded.

then I tried again and finally it returned with results

/home/Desktop/tools/gt-pro/GT_Pro genotype -d /data/GTpro_database/20190723_881species -C ./test SRR413665_2.fastq.gz -t 10 -f
gt_pro  /data/GTpro_database/20190723_881species    10  force_overwrite
1644481164346:  [Info] Starting to load DB: /data/GTpro_database/20190723_881species
1644481164346:  [Info] MMAPPING /data/GTpro_database/20190723_881species_optimized_db_snps.bin
1644481164600:  [Info] MMAPPING /data/GTpro_database/20190723_881species_optimized_db_kmer_index.bin
1644481165534:  [Info] Using -l 32 -m 36 as optimal for system RAM
1644481165534:  [Info] MMAPPING /data/GTpro_database/20190723_881species_optimized_db_mmer_bloom_36.bin
1644481166240:  [Info] MMAPPING /data/GTpro_database/20190723_881species_optimized_db_lmer_index_32.bin
1644481168554:  [Info] Done with init for optimized DB with 2856121626 kmers.  That took 4 seconds.
1644481168641:  [Info] Waiting for all readers to quiesce
1644481171566:  [Done] searching is completed for the 600000 reads input from SRR413665_2.fastq.gz
1644481171616:  [Stats] 102624 snps, 600000 reads, 1.28 hits/snp, for SRR413665_2.fastq.gz
1644481171627:  0.6 million reads were scanned after 3 seconds
1644481171627:  Successfully processed 1 input files containing 600000 reads.
1644481171642:  Totally done: 3 seconds elapsed processing reads, after DB was loaded.

I have no idea why it happened, is it the last one's result usable?

Thanks

zjshi commented 2 years ago

Hi boaty, just a quick question: did you happen to run this case on an HPC environment, if not, which OS and version did you use?

boaty commented 2 years ago

Hi shi, I run the program on our remote server, it's a single HPC not cluster. The server has Linux Mint 20 Ulyana installed with max 160 threads and max 1000G mem.

System: Kernel: 5.4.0-65-generic x86_64 bits: 64 compiler: gcc v: 9.3.0 Desktop: Xfce 4.14.2 tk: Gtk 3.24.13 wm: xfwm4 dm: LightDM Distro: Linux Mint 20 Ulyana base: Ubuntu 20.04 focal

zjshi commented 2 years ago

Hi boaty, thanks for the information! Error cl 1067 is an uncharacterized I/O failure. This is likely an issue specific to some OS/disk settings or status. If GT-Pro does not complain, the results should clean or reliable. In your case, it is good to go. The problem is related to this issue: https://github.com/zjshi/gt-pro/issues/43. As I am still gathering information to reproduce the error on my side, would you please let me know if you changed any of the parameters when you ran it the second time?

boaty commented 2 years ago

Hi Shi,

Thanks for your response. I did not change any parameters for the second run.

Moreover, I re-run the test file, I found somehow the problem is link to "-C" attribute. if I just gave GT-pro the fastq path directly without "-C" to specify a folder, it returned no error.

zjshi commented 2 years ago

Aha, thanks. You helped me a lot, boaty! I think I now understand what the issue is: GT-Pro will by default output to the same directory of input file, so it will raise the error when the input file is in a write restricted location if -C is not specified to redirect the output. So two work-arounds: 1. put input file in the directory where write is allowed and 2. using -C to assign output directory.