yezhengSTAT / mHiC

MIT License
22 stars 10 forks source link

Empty sam file after step 1 #14

Closed bgbrink closed 4 years ago

bgbrink commented 4 years ago

I am running your pipeline on 22 files, for 4 of them I get an empty file after step 1 and I have troubles figuring out why. I ran the pipeline while keeping all the temporary files. This is the output:

Rep1_1.sai                                    2.3 GB    
Rep1_1_raw.sam                                32.2 GB
Rep1_unmapped_1.sam                           8.8 GB
Rep1_unmapped_1.fastq                         7.3 GB
Rep1_unmapped_trim_filter_1.fastq             0 bytes
Rep1_unmapped_trim_filter_1.sai               68 bytes
Rep1_unmapped_trim_filter_1.sam               2.4KB
Rep1_unmapped_trim_filter_noheader_1.sam      0 bytes

As you can see, the problem occurs somewhere during the generation of the _trim_filter file. The weird thing is that the other 18 files run just fine. Do you have any insight for me how I could investigate further?

Edit: Also, this is the summary file generated after S1, showing 0 reads align in the second stage.

First stage alignment aligned reads total 1: 63648247
First stage alignment aligned reads total 2: 61334947
First stage alignment aligned multi-reads 1: 13918786
First stage alignment aligned multi-reads 2: 13351390
Second stage alignment aligned reads total 1: 0
Second stage alignment aligned reads total 2: 0
Second stage alignment aligned multi-reads 1: 0
Second stage alignment aligned multi-reads 2: 0
yezhengSTAT commented 4 years ago

Hello, Thanks for using mHi-C. It seems to me that the problem may happen at line 73 in s1_bwaAlignment.sh:

awk -v minLen=$seqLength -v maxLen=$readL 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= minLen && length(seq) < maxLen) {print header, seq, qheader, qseq}}' < $resultsDir/$name\_unmapped_$i.fastq >$resultsDir/$name\_unmapped_trim_filter_$i.fastq

Can you check the value that you gave to variable seqLength? By default it is set to 25bp. Also, we assume the read length, readL, is the same for all the reads in the fastq file so readL will take the value based on the first read in your fastq file. Can you check if it is larger than seqLength? If everything looks normal, can you run the above code in linux? Otherwise, do you notice anything abnormal in the print out messages?

Hope it it helpful.

Ye

bgbrink commented 4 years ago

Thank you for your quick reply.

Also, we assume the read length, readL, is the same for all the reads in the fastq file so readL will take the value based on the first read in your fastq file.

I assume this is problem. In my files, not all reads have the same length. The vast majority of reads is 76bp, but some are shorter. I assume, in those 4 files the first read happen to be shorter than 25bp, while in the other files it was fine. May I ask what the reasoning behind this assumption is? I assume by hardcoding maxLen=76 this line of code will remove all reads shorter than 76bp from the analysis, correct?

Edit: Actually, I found a bug in your code. When you define readL, you are using cat on the zipped file (line 47): readL="$(cat $fastqDir/$name\_R$i.fq.gz | head -2 | awk '{if(NR%4==2) print length($1)}')" You need to use zcat instead.

yezhengSTAT commented 4 years ago

Hello, readL is collected to filter and save chimeric reads. In other words, after trimming until the ligation sites, if the reads are shorter than original read length, such reads are chimeric and will be align again. By hardcoding maxLen=76 should also work for your data.

My code on line 47 is readL="$(cat $fastqDir/$name\_$i.fastq | head -2 | awk '{if(NR%4==2) print length($1)}')" and I don't think I used zipped file. I am not sure where did you find that command......but as long as it works, feel free to adjust it on your side. :)

Ye

bgbrink commented 4 years ago

Ah you are right, I changed your code a while ago to work directly on gzipped data instead of having to unzip everything first. And now I forgot about it. Sorry for the confusion. I will close the issue since step 1 is running okay with readL=76. Thanks for your help!