Open pna059 opened 1 year ago
Hi Pavla, So sorry for the late reply! I have just come back from a travel. Yes, you can simply merge the .validpair files! Any step before binning the genome should work.
Thanks, Ye
Hi, I simply concatenated the .ValidPair files in s3 and resumed the pipeline from step 4. However, it seem as if the duplicates are being removed there. I thought that if I merge replicates, I should do so after removing duplicates to avoid removing identical pairs originating from two separate samples (false duplicates). Is it OK to proceed as I do anyway?
Thank you Pavla
Hello Pavla, Usually, we only remove duplicates within each replicate. Across replicates, considering they are separate experimental runs, it is likely that there are fragments sharing the same genomic location. Therefore, generally, I think you are safe to proceed with what you are doing but if you have concerns, nothing can stop you from checking the duplicate rate between two replicates. I guess it won't be high anyway. If it is extremely high, you will consider removing them.
Thanks, Ye
Hello Ye, working with more of my samples, once again, I am in doubt about how to merge replicates. Based on our discussion, I previously merged both .ValidPairs and .uniMulti from the s3 directory and resumed the pipeline from step 4 using these merged files. However, the s4 does both removing duplicates and binning, so I am at risk of removing false duplicates. I thought that I should merge AFTER removing duplicates and BEFORE binning, which is difficult as is would have to be done in the middle of the s4 step. Am I missing something here? Sorry for being meticulous here, but with my previous sample, I got fewer interactions than by using HiCPro and not considering multi mappers, so I am extremely careful now not to miss something with my next sample.
Thank you, Pavla
Hello Pavla, Yes, it makes more sense to remove the duplicates first before merging independent replicates. You can run step 4 first and merge the bin-pairs. Merging the binpairs is a little like: https://github.com/yezhengSTAT/mHiC#64-post-mhic-processing. I haven't tested the following code but something like sorting the bin-pair coordinate and then sum up the counts if the bin-pair coordinates match.
cat $rep1.binpair $rep2.binpair | sort -k1,1V -k2,2n -k3,3V -k4,4n | awk -v OFS="\t" '{a[$1" "$2" "$3" "$4]+=$5}END{for (i in a) print i,a[i]}' | sort -k1,1V -k2,2n -k3,3V -k4,4n >$merged.binpair
Hope it helps.
P.S. if you observe fewer interactions after adding multi-mapping reads, maybe you should check if the uni-mapping reads have changed or visualize it to see if it is because the background signal changes (background signal would be higher with more multi-mapping reads).
Thanks, Ye
Hi Ye, after running step 4, I have these files in the s4 directory:
HiChIP_2.validPairs.binPair.Marginal HiChIP_2.validPairs.binPair.uni HiChIP_2.validPairs.UNI.binPair
HiChIP_2.validPairs.binPairCount.uni.KRnorm.biasAll HiChIP_2.validPairs.MULTI.binPair.multi HiChIP_2.validPairs.MULTI
HiChIP_2.validPairs.binPairCount.uni.KRnorm.bias HiChIP_2.validPairs.MULTI.binPair.uni HiChIP_2.validPairs.UNI
HiChIP_2.validPairs.binPairCount.uni.KRnorm HiChIP_2.validPairs.MULTI.binPair.multiList HiChIP_2.validPairs.MultiRMdupList
HiChIP_2.validPairs.binPairCount.uni HiChIP_2.validPairs.MULTI.binPair HiChIP_2.validPairs.MULTI.binPair.multiKeys
So for using your suggested code (to concatenate the replicates), I am going to use the file "HiChIP_2.validPairs.binPairCount.uni" in place of $rep2.binpair? After this merging step, How can I restart from step 4.3 Uni-binPair contact matrix normalization/preparation for step 5 prior building? At least that is how I understand it......Or could I start directly from step 5?
Thanks, Pavla
Hi Ye, sorry to rush you, but I am about to do the merge the bin-pairs from after the deduplication (step 4) and want to be sure that I am concatenating correct files and resuming the pipeline in the right way (should I modify the s4 script by deleting the preceding steps, while redirecting the input to the merged file, for example?).
Thank you, Pavla
Hello Pavla, Sorry but can you be more specific?
I guess if you can prepare the input files for step5, you should be good to move on.
Thanks, Ye
Hi Ye, I was more specific in the message before - asking about the s4 outputs - which of them should be merged and from which step exactly should I resume? We know that replicates should be merged AFTER deduplication, but you pointed out earlier, that "Any step before binning the genome should work." If it it should be BEFORE binning, Is it OK to use your suggested code to merge the 'validPairs.binPairCount.uni' replicate files and resume from the step 4.3?
cat HiChIP1_validPairs.binPairCount.uni HiChIP2_validPairs.binPairCount.uni | sort -k1,1V -k2,2n -k3,3V -k4,4n | awk -v OFS="\t" '{a[$1" "$2" "$3" "$4]+=$5}END{for (i in a) print i,a[i]}' | sort -k1,1V -k2,2n -k3,3V -k4,4n > HiChIPmerged_validPairs.binPairCount.uni
and use this merged file in place of $contactFile in step 4.3 of the s4_bin.sh
## **********************************************
## 4.3 Uni-binPair contact matrix normalization.
## In preparation for step 5 prior building.
## **********************************************
# Exclude low count contacts
if [ "$minCount" -gt "1" ];then
awk -v minC=$minCount -v OFS="\t" '$5>=minC {print $0}' HiChIPmerged_validPairs.binPairCount.uni| sort -k1,1V -k2,2n -k3,3V -k4,4n -T $dir/sorttmp >$valid.binPairCount.uni.minCount$minCount
and continue to step 5.
Thank you! Pavla
Yes, it looks reasonable to me!
Best, Ye
Hi Ye, the merging step went fine. Now I am stuck on how to modify the s4_bin.sh. Is it possible to split this step into two, maybe, so that the user can merge replicates after deduplication and before binning? I simply deleted all before 4.3 and used the modified file instead of s4_bin.sh, but that did not work. How is it possible to split this step into two, so that the user can merge replicates after deduplication and before binning?
Thanks Pavla
I see your confusion. Yes, it could be quite messy to merge all kinds of intermediate files and modify the code to start in the middle of step 4. I think maybe there is another way out. The initial need is to remove the duplicates but still keep the read pairs from the same location but different replicates. Maybe you can consider adding a suffix such as "_rep1", "_rep2" in column 4 (strand column) and column 9. Then concatenate .validPairs files from different replicas. The duplicates are removed based on column2, 3, 4, 7, 8, 9. With the additional suffix, we should be able to keep the overlapped read pairs across replicates. Going forward, columns 4 and 9 won't be used in forming the binPair files, therefore should be fine to leave the suffix there. In other words, after concatenating the .validPairs with an additional replicate suffix, we should be able to run step4_bin.sh without adjustment.
Let me know if you run into further problems!
Hope this time, it works!
Thanks, Ye
OK, Ye. Let`s try, but I have to make sure I understand well: Adding a suffix as "_rep1", "_rep2" in columns 4 and 9 (strand columns) of the s3/*.validPairs results in this:
A00703:100:HH7MVDRXY:1:1101:1371:1016 chr4H 231646110.0 +_rep1 HIC_chr4H_1809924 231650000 chr5H 346862795.0 - _rep1 HIC_chr5H_2675825 346870000 InterChrom UNI
A00703:100:HH7MVDRXY:1:1101:1696:1016 chr5H 190107396.0 -_rep1 HIC_chr5H_1463327 190110000 chr5H 321539437.5 +_rep1 HIC_chr5H_2471833 321530000 131432041.5 MULTI
Is this what I should do? Then concatenate these two resulting *validPairs into a new directory and restart the pipeline from step 4 calling this directory and the merged file as an input. It does not matter in the downstream procedure to have the strand information changed this way?
Thank you for helping me to solve this :), Pavla
Yes, exactly.
Hi, I plan using your tool to analyze HiChIP data (followed by the FithHiChIP tool). At what step and how is it possible to merge replicates (which have been checked for reproducibility by HiCRep)? I would do it after removing duplicates, but am not sure if I can simply merge .ValidPair files. Thanks, Pavla