yezhengSTAT / mHiC

MIT License
22 stars 10 forks source link

Merging replicates #19

Open pna059 opened 1 year ago

pna059 commented 1 year ago

Hi, I plan using your tool to analyze HiChIP data (followed by the FithHiChIP tool). At what step and how is it possible to merge replicates (which have been checked for reproducibility by HiCRep)? I would do it after removing duplicates, but am not sure if I can simply merge .ValidPair files. Thanks, Pavla

yezhengSTAT commented 1 year ago

Hi Pavla, So sorry for the late reply! I have just come back from a travel. Yes, you can simply merge the .validpair files! Any step before binning the genome should work.

Thanks, Ye

pna059 commented 1 year ago

Hi, I simply concatenated the .ValidPair files in s3 and resumed the pipeline from step 4. However, it seem as if the duplicates are being removed there. I thought that if I merge replicates, I should do so after removing duplicates to avoid removing identical pairs originating from two separate samples (false duplicates). Is it OK to proceed as I do anyway?

Thank you Pavla

yezhengSTAT commented 1 year ago

Hello Pavla, Usually, we only remove duplicates within each replicate. Across replicates, considering they are separate experimental runs, it is likely that there are fragments sharing the same genomic location. Therefore, generally, I think you are safe to proceed with what you are doing but if you have concerns, nothing can stop you from checking the duplicate rate between two replicates. I guess it won't be high anyway. If it is extremely high, you will consider removing them.

Thanks, Ye

pna059 commented 1 year ago

Hello Ye, working with more of my samples, once again, I am in doubt about how to merge replicates. Based on our discussion, I previously merged both .ValidPairs and .uniMulti from the s3 directory and resumed the pipeline from step 4 using these merged files. However, the s4 does both removing duplicates and binning, so I am at risk of removing false duplicates. I thought that I should merge AFTER removing duplicates and BEFORE binning, which is difficult as is would have to be done in the middle of the s4 step. Am I missing something here? Sorry for being meticulous here, but with my previous sample, I got fewer interactions than by using HiCPro and not considering multi mappers, so I am extremely careful now not to miss something with my next sample.

Thank you, Pavla

yezhengSTAT commented 1 year ago

Hello Pavla, Yes, it makes more sense to remove the duplicates first before merging independent replicates. You can run step 4 first and merge the bin-pairs. Merging the binpairs is a little like: https://github.com/yezhengSTAT/mHiC#64-post-mhic-processing. I haven't tested the following code but something like sorting the bin-pair coordinate and then sum up the counts if the bin-pair coordinates match.

cat $rep1.binpair $rep2.binpair | sort -k1,1V -k2,2n -k3,3V -k4,4n | awk -v OFS="\t" '{a[$1" "$2" "$3" "$4]+=$5}END{for (i in a) print i,a[i]}' | sort -k1,1V -k2,2n -k3,3V -k4,4n >$merged.binpair

Hope it helps.

P.S. if you observe fewer interactions after adding multi-mapping reads, maybe you should check if the uni-mapping reads have changed or visualize it to see if it is because the background signal changes (background signal would be higher with more multi-mapping reads).

Thanks, Ye

pna059 commented 1 year ago

Hi Ye, after running step 4, I have these files in the s4 directory:

HiChIP_2.validPairs.binPair.Marginal                 HiChIP_2.validPairs.binPair.uni              HiChIP_2.validPairs.UNI.binPair
HiChIP_2.validPairs.binPairCount.uni.KRnorm.biasAll  HiChIP_2.validPairs.MULTI.binPair.multi      HiChIP_2.validPairs.MULTI
HiChIP_2.validPairs.binPairCount.uni.KRnorm.bias     HiChIP_2.validPairs.MULTI.binPair.uni        HiChIP_2.validPairs.UNI
HiChIP_2.validPairs.binPairCount.uni.KRnorm          HiChIP_2.validPairs.MULTI.binPair.multiList  HiChIP_2.validPairs.MultiRMdupList
HiChIP_2.validPairs.binPairCount.uni                 HiChIP_2.validPairs.MULTI.binPair            HiChIP_2.validPairs.MULTI.binPair.multiKeys

So for using your suggested code (to concatenate the replicates), I am going to use the file "HiChIP_2.validPairs.binPairCount.uni" in place of $rep2.binpair? After this merging step, How can I restart from step 4.3 Uni-binPair contact matrix normalization/preparation for step 5 prior building? At least that is how I understand it......Or could I start directly from step 5?

Thanks, Pavla

pna059 commented 1 year ago

Hi Ye, sorry to rush you, but I am about to do the merge the bin-pairs from after the deduplication (step 4) and want to be sure that I am concatenating correct files and resuming the pipeline in the right way (should I modify the s4 script by deleting the preceding steps, while redirecting the input to the merged file, for example?).

Thank you, Pavla

yezhengSTAT commented 1 year ago

Hello Pavla, Sorry but can you be more specific?

I guess if you can prepare the input files for step5, you should be good to move on.

Thanks, Ye

pna059 commented 1 year ago

Hi Ye, I was more specific in the message before - asking about the s4 outputs - which of them should be merged and from which step exactly should I resume? We know that replicates should be merged AFTER deduplication, but you pointed out earlier, that "Any step before binning the genome should work." If it it should be BEFORE binning, Is it OK to use your suggested code to merge the 'validPairs.binPairCount.uni' replicate files and resume from the step 4.3?

cat HiChIP1_validPairs.binPairCount.uni HiChIP2_validPairs.binPairCount.uni | sort -k1,1V -k2,2n -k3,3V -k4,4n | awk -v OFS="\t" '{a[$1" "$2" "$3" "$4]+=$5}END{for (i in a) print i,a[i]}' | sort -k1,1V -k2,2n -k3,3V -k4,4n > HiChIPmerged_validPairs.binPairCount.uni

and use this merged file in place of $contactFile in step 4.3 of the s4_bin.sh

## **********************************************
## 4.3 Uni-binPair contact matrix normalization.
## In preparation for step 5 prior building.
## **********************************************

# Exclude low count contacts
if [ "$minCount" -gt "1" ];then

   awk -v minC=$minCount -v OFS="\t" '$5>=minC {print $0}' HiChIPmerged_validPairs.binPairCount.uni| sort -k1,1V -k2,2n -k3,3V -k4,4n -T $dir/sorttmp  >$valid.binPairCount.uni.minCount$minCount

and continue to step 5.

Thank you! Pavla

yezhengSTAT commented 1 year ago

Yes, it looks reasonable to me!

Best, Ye

pna059 commented 1 year ago

Hi Ye, the merging step went fine. Now I am stuck on how to modify the s4_bin.sh. Is it possible to split this step into two, maybe, so that the user can merge replicates after deduplication and before binning? I simply deleted all before 4.3 and used the modified file instead of s4_bin.sh, but that did not work. How is it possible to split this step into two, so that the user can merge replicates after deduplication and before binning?

Thanks Pavla

yezhengSTAT commented 1 year ago

I see your confusion. Yes, it could be quite messy to merge all kinds of intermediate files and modify the code to start in the middle of step 4. I think maybe there is another way out. The initial need is to remove the duplicates but still keep the read pairs from the same location but different replicates. Maybe you can consider adding a suffix such as "_rep1", "_rep2" in column 4 (strand column) and column 9. Then concatenate .validPairs files from different replicas. The duplicates are removed based on column2, 3, 4, 7, 8, 9. With the additional suffix, we should be able to keep the overlapped read pairs across replicates. Going forward, columns 4 and 9 won't be used in forming the binPair files, therefore should be fine to leave the suffix there. In other words, after concatenating the .validPairs with an additional replicate suffix, we should be able to run step4_bin.sh without adjustment.

Let me know if you run into further problems!

Hope this time, it works!

Thanks, Ye

pna059 commented 1 year ago

OK, Ye. Let`s try, but I have to make sure I understand well: Adding a suffix as "_rep1", "_rep2" in columns 4 and 9 (strand columns) of the s3/*.validPairs results in this:

A00703:100:HH7MVDRXY:1:1101:1371:1016   chr4H   231646110.0     +_rep1       HIC_chr4H_1809924       231650000       chr5H   346862795.0     - _rep1      HIC_chr5H_2675825       346870000       InterChrom     UNI
A00703:100:HH7MVDRXY:1:1101:1696:1016   chr5H   190107396.0     -_rep1       HIC_chr5H_1463327       190110000       chr5H   321539437.5     +_rep1       HIC_chr5H_2471833       321530000       131432041.5    MULTI

Is this what I should do? Then concatenate these two resulting *validPairs into a new directory and restart the pipeline from step 4 calling this directory and the merged file as an input. It does not matter in the downstream procedure to have the strand information changed this way?

Thank you for helping me to solve this :), Pavla

yezhengSTAT commented 1 year ago

Yes, exactly.