Closed bgbrink closed 4 years ago
Hello, Thanks for using mHi-C. It seems to me that the problem may happen at line 73 in s1_bwaAlignment.sh:
awk -v minLen=$seqLength -v maxLen=$readL 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= minLen && length(seq) < maxLen) {print header, seq, qheader, qseq}}' < $resultsDir/$name\_unmapped_$i.fastq >$resultsDir/$name\_unmapped_trim_filter_$i.fastq
Can you check the value that you gave to variable seqLength
? By default it is set to 25bp. Also, we assume the read length, readL
, is the same for all the reads in the fastq file so readL will take the value based on the first read in your fastq file. Can you check if it is larger than seqLength
? If everything looks normal, can you run the above code in linux? Otherwise, do you notice anything abnormal in the print out messages?
Hope it it helpful.
Ye
Thank you for your quick reply.
Also, we assume the read length, readL, is the same for all the reads in the fastq file so readL will take the value based on the first read in your fastq file.
I assume this is problem. In my files, not all reads have the same length. The vast majority of reads is 76bp, but some are shorter. I assume, in those 4 files the first read happen to be shorter than 25bp, while in the other files it was fine.
May I ask what the reasoning behind this assumption is? I assume by hardcoding maxLen=76
this line of code will remove all reads shorter than 76bp from the analysis, correct?
Edit:
Actually, I found a bug in your code. When you define readL
, you are using cat on the zipped file (line 47):
readL="$(cat $fastqDir/$name\_R$i.fq.gz | head -2 | awk '{if(NR%4==2) print length($1)}')"
You need to use zcat instead.
Hello,
readL
is collected to filter and save chimeric reads. In other words, after trimming until the ligation sites, if the reads are shorter than original read length, such reads are chimeric and will be align again. By hardcoding maxLen=76 should also work for your data.
My code on line 47 is
readL="$(cat $fastqDir/$name\_$i.fastq | head -2 | awk '{if(NR%4==2) print length($1)}')"
and I don't think I used zipped file. I am not sure where did you find that command......but as long as it works, feel free to adjust it on your side. :)
Ye
Ah you are right, I changed your code a while ago to work directly on gzipped data instead of having to unzip everything first. And now I forgot about it. Sorry for the confusion. I will close the issue since step 1 is running okay with readL=76
.
Thanks for your help!
I am running your pipeline on 22 files, for 4 of them I get an empty file after step 1 and I have troubles figuring out why. I ran the pipeline while keeping all the temporary files. This is the output:
As you can see, the problem occurs somewhere during the generation of the _trim_filter file. The weird thing is that the other 18 files run just fine. Do you have any insight for me how I could investigate further?
Edit: Also, this is the summary file generated after S1, showing 0 reads align in the second stage.