yfukasawa / LongQC

LongQC is a tool for the data quality control of the PacBio and ONT long reads.
MIT License
146 stars 18 forks source link

longqc #49

Open chaikhi-soumaya opened 2 years ago

chaikhi-soumaya commented 2 years ago

while running longqc I had this error can someone tell me what the problem is ValueError: truncated quality string in [my path to the fastq file]

yfukasawa commented 2 years ago

Hi, you also raised an issue in #6, correct?

Did you run runqc subcommand? Can you provide the command you execute? runqc subcommand will be soon obsoleted for PacBio (PacBio will stop providing scraps.bam, which is a requirement for runqc subcommand). So, if this is the case, I recommend to run sampleqc subcommand for your fastq file with a proper choice of profile (for -x option).

Yoshinori

chaikhi-soumaya commented 2 years ago

hello, Thanks for your reply, you can look at the file or the screen bellow you will find the command I used and what kind of problem I am having.

On Wed, 22 Jun 2022 at 18:09 Yoshinori Fukasawa @.***> wrote:

Hi, you also raised an issue in #6 https://github.com/yfukasawa/LongQC/issues/6, correct?

Did you run runqc subcommand? Can you provide the command you execute? runqc subcommand will be soon obsoleted for PacBio (PacBio will stop providing scraps.bam, which is a requirement for runqc subcommand). So, if this is the case, I recommend to run sampleqc subcommand for your fastq file with a proper choice of profile (for -x option).

Yoshinori

— Reply to this email directly, view it on GitHub https://github.com/yfukasawa/LongQC/issues/49#issuecomment-1163394520, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT42IR2HEGQ24FRHL4HLWDTVQNCDDANCNFSM5ZGL7DSA . You are receiving this because you authored the thread.Message ID: @.***>

yfukasawa commented 2 years ago

I can't find any attachment??

chaikhi-soumaya commented 2 years ago

maybe because I replied in email. well, I will send in the following comment all the text that contains the log and the command I used.

chaikhi-soumaya commented 2 years ago

(python3) soumayachaikhi@MacBook-de-Soumaya LongQC % python longQC.py sampleqc -x ont-rapid -o ../assemble_quality ../3_2_GB.fastq

longQC:2022-06-21 16:41:05,726:169:INFO:Cmd: longQC.py sampleqc -x ont-rapid -o ../assemble_quality ../3_2_GB.fastq longQC:2022-06-21 16:41:05,726:233:INFO:Preset "ont-rapid" was applied. Options --pb(--ont) is overwritten. longQC:2022-06-21 16:41:07,766:306:INFO:Computation of the low complexity region started for a chunk 0 lq_mask:2022-06-21 16:41:09,427:111:INFO:New job was submitted: in->../assemble_quality/analysis/tmp_0.fastq, out->../assemble_quality/analysis/tmp_0.txt longQC:2022-06-21 16:41:09,435:311:INFO:Adapter search is starting for a chunk 0. longQC:2022-06-21 16:41:09,436:327:INFO:Computation of the GC fraction started for a chunk 0 lq_utils:2022-06-21 16:41:21,436:380:INFO:list for subsample is not initialized. Initializing now. lq_adapt:2022-06-21 16:41:22,948:77:INFO:9744 reads were skipped due to their short lengths. lq_adapt:2022-06-21 16:41:22,949:97:INFO:Adapter Sequence: GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA, max identity:0.905660 and the number of trimmed reads: 436 longQC:2022-06-21 16:41:28,389:338:INFO:Adapter search has done for a chunk 0. longQC:2022-06-21 16:41:28,390:342:INFO:subsample finished for chunk 0. longQC:2022-06-21 16:41:29,927:306:INFO:Computation of the low complexity region started for a chunk 1 lq_mask:2022-06-21 16:41:33,241:111:INFO:New job was submitted: in->../assemble_quality/analysis/tmp_1.fastq, out->../assemble_quality/analysis/tmp_1.txt longQC:2022-06-21 16:41:33,246:311:INFO:Adapter search is starting for a chunk 1. longQC:2022-06-21 16:41:33,246:327:INFO:Computation of the GC fraction started for a chunk 1 lq_adapt:2022-06-21 16:41:45,053:77:INFO:4534 reads were skipped due to their short lengths. lq_adapt:2022-06-21 16:41:45,056:97:INFO:Adapter Sequence: GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA, max identity:0.905660 and the number of trimmed reads: 516 longQC:2022-06-21 16:41:51,539:338:INFO:Adapter search has done for a chunk 1. longQC:2022-06-21 16:41:51,558:342:INFO:subsample finished for chunk 1. longQC:2022-06-21 16:41:53,345:306:INFO:Computation of the low complexity region started for a chunk 2 lq_mask:2022-06-21 16:41:57,141:111:INFO:New job was submitted: in->../assemble_quality/analysis/tmp_2.fastq, out->../assemble_quality/analysis/tmp_2.txt longQC:2022-06-21 16:41:57,141:311:INFO:Adapter search is starting for a chunk 2. longQC:2022-06-21 16:41:57,142:327:INFO:Computation of the GC fraction started for a chunk 2 lq_adapt:2022-06-21 16:42:09,077:77:INFO:4823 reads were skipped due to their short lengths. lq_adapt:2022-06-21 16:42:09,078:97:INFO:Adapter Sequence: GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA, max identity:0.920000 and the number of trimmed reads: 719 longQC:2022-06-21 16:42:15,219:338:INFO:Adapter search has done for a chunk 2. longQC:2022-06-21 16:42:15,237:342:INFO:subsample finished for chunk 2. longQC:2022-06-21 16:42:16,765:306:INFO:Computation of the low complexity region started for a chunk 3 lq_mask:2022-06-21 16:42:20,208:111:INFO:New job was submitted: in->../assemble_quality/analysis/tmp_3.fastq, out->../assemble_quality/analysis/tmp_3.txt longQC:2022-06-21 16:42:20,208:311:INFO:Adapter search is starting for a chunk 3. longQC:2022-06-21 16:42:20,209:327:INFO:Computation of the GC fraction started for a chunk 3 lq_adapt:2022-06-21 16:42:31,980:77:INFO:4626 reads were skipped due to their short lengths. lq_adapt:2022-06-21 16:42:31,982:97:INFO:Adapter Sequence: GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA, max identity:0.920000 and the number of trimmed reads: 578 longQC:2022-06-21 16:42:38,062:338:INFO:Adapter search has done for a chunk 3. longQC:2022-06-21 16:42:38,085:342:INFO:subsample finished for chunk 3. Traceback (most recent call last): File "longQC.py", line 956, in main(args) File "longQC.py", line 62, in main args.handler(args) File "longQC.py", line 299, in command_sample for (reads, n_seqs, n_bases) in open_seq_chunk(args.input, file_format_code, chunk_size=args.mem*1024**3, is_upper=True): File "/Users/soumayachaikhi/Bioinfo/Assemblyproject/LongQC/lq_utils.py", line 65, in open_seq_chunk yield from parse_fastx_chunk(fn, chunk_size, is_upper=is_upper) File "/Users/soumayachaikhi/Bioinfo/Assemblyproject/LongQC/lq_utils.py", line 269, in parse_fastx_chunk for e in f: File "pysam/libcfaidx.pyx", line 653, in pysam.libcfaidx.FastxFile.next ValueError: truncated quality string in ../3_2_GB.fastq

mbeavitt commented 1 year ago

I also had the same issue!

Wondering if there are any insights into this? Was testing a pipeline on some online data (SRR15206231).

Should be noted that this pipeline worked on another dataset - SRS17583785.

Looks like it's related to a chunking process - could this be due to some kind of memory limitation? The hifi reads comprise 102.2Gbases (~60Gbytes). The previous test was 10x smaller.

(running in nextflow, containerised)

Redacted@Redacted:~/HDD/test_run$ nextflow run asm_pipeline.nf -with-report -with-trace -with-timeline -with-dag dag.png --accession_id SRR15206231 N E X T F L O W ~ version 23.04.4 Launching asm_pipeline.nf [elegant_hypatia] DSL2 - revision: d62de36e54 executor > local (4) executor > local (4) [- ] process > FASTQC (FASTQC on SRR15206231) - [56/ffa7bb] process > LONGQC (LONGQC on SRR15206231) [100%] 1 of 1, failed: 1 ✘ [- ] process > NANOPLOT (NANOPLOT on SRR15206231) - [- ] process > HIFIADAPT (HIFIADAPT on SRR15206231) - [- ] process > HIFIASM - ERROR ~ Error executing process > 'LONGQC (LONGQC on SRR15206231)'

Caused by: Process LONGQC (LONGQC on SRR15206231) terminated with an error exit status (1)

Command executed:

/opt/LongQC/longQC.py sampleqc --index 400M --ncpu 8 -m 2 -x pb-hifi -o longqc_SRR15206231_output SRR15206231_subreads.fastq.gz

Command exit status: 1

Command output: (empty)

Command error: longQC:2023-10-03 16:32:45,888:170:INFO:Cmd: /opt/LongQC/longQC.py sampleqc --index 400M --ncpu 8 -m 2 -x pb-hifi -o longqc_SRR15206231_output SRR15206231_subreads.fastq.gz longQC:2023-10-03 16:32:45,888:234:INFO:Preset "pb-hifi" was applied. Options --pb(--ont) is overwritten. Traceback (most recent call last): File "/opt/LongQC/longQC.py", line 957, in main(args) File "/opt/LongQC/longQC.py", line 63, in main args.handler(args) File "/opt/LongQC/longQC.py", line 300, in command_sample for (reads, n_seqs, n_bases) in open_seq_chunk(args.input, file_format_code, chunk_size=args.mem*1024**3, is_upper=True): File "/opt/LongQC/lq_utils.py", line 65, in open_seq_chunk yield from parse_fastx_chunk(fn, chunk_size, is_upper=is_upper) File "/opt/LongQC/lq_utils.py", line 269, in parse_fastx_chunk for e in f: File "pysam/libcfaidx.pyx", line 651, in pysam.libcfaidx.FastxFile.next ValueError: truncated quality string in SRR15206231_subreads.fastq.gz

Work dir: /media/Redacted/Redacted/Redacted/test_run/work/56/ffa7bb44ea6c6e9ca147d0f918b7f0

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

-- Check '.nextflow.log' file for details

mbeavitt commented 1 year ago

I did four things at once and the problem was fixed. Anyone with the same issue could try any of them, but some are specific to nextflow/pipelines.

1) I changed my LongQC -x argument from pb-hifi to pb-sequel 2) I used a local file rather than pulling directly from SRA ('fromSRA' channel in nextflow) 3) I increased threads (--npcu) from 8 to 24 4) I ran my pipeline serially, as in, LongQC was the only major process running on the machine

Hope this helps!