yhwu / idemp

Barcode demultiplex for Illumina I1, R1, R2 fastq.gz files
GNU General Public License v2.0
30 stars 8 forks source link

Empty sequence leads to corrupt demultiplexed fastq file #14

Open skembel opened 5 years ago

skembel commented 5 years ago

Hi, I am having a problem where some fastq files produced by idemp are corrupted when there are input sequences that are blank (no nucleotides/PHRED quality scores) for one of the paired ends. This situation arises for example when using cutadapt to trim Illumina adapters and heterogeneity spacers from fastq files produced by MiSeq, prior to demultiplexing with idemp. One or the other paired end sequence may contain no nucleotides after cutadapt trimming.

In the input R1 fastq file, the problem sequence looks like the following:

grep "@M02360:10:000000000-AD1YJ:1:1111:19743:10766 1:N:0:1" *.fastq -A 4
R1reads_cut.fastq:@M02360:10:000000000-AD1YJ:1:1111:19743:10766 1:N:0:1
R1reads_cut.fastq-
R1reads_cut.fastq-+
R1reads_cut.fastq-
R1reads_cut.fastq-@M02360:10:000000000-AD1YJ:1:1111:15610:10766 1:N:0:1

But then in the resulting fastq file after demultiplexing by idemp, this sequence is malformed (missing line 4 with PHRED quality scores).

grep "@M02360:10:000000000-AD1YJ:1:1111:19743:10766 1:N:0:1" *.fastq -A 3
R1reads_cut.fastq_Sutton.B3.fastq:@M02360:10:000000000-AD1YJ:1:1111:19743:10766 1:N:0:1
R1reads_cut.fastq_Sutton.B3.fastq-
R1reads_cut.fastq_Sutton.B3.fastq-+
R1reads_cut.fastq_Sutton.B3.fastq-@M02360:10:000000000-AD1YJ:1:1111:21734:10769 1:N:0:1

The sequence is missing the fourth line with the PHRED scores after demultiplexing. This leads to a corrupt fastq files that prevents downstream analysis (e.g. it crashes dada2). A workaround seems to be to use sed to replace these blank lines with Ns prior to demultiplexing by idemp, but this seems like a potential bug?

yhwu commented 5 years ago

I have never seen space as a bp from MiSeq. In idemp, I used bwa's internal reader and writer to read and write the reads to make sure both input and output follow common standards. You could also demultiplex first then cut adapter.