yezhengSTAT / mHiC

MIT License
22 stars 10 forks source link

Converting mHiC Output to other formats #17

Open willmonty14 opened 3 years ago

willmonty14 commented 3 years ago

Hi mHiC team,

I love your program! I think that it's a really unique solution to a complex problem, and the tutorial was very easy to conduct. However, I'm now running into some problems with the output data, namely that I am relatively new to manipulating Hi-C data, and I don't know which output file is the one that I should use to visualize my Hi-C data. I'd like to be able to convert the output from this program to a .hic file so that I can run Arrowhead and HICCUPs on it, but I'm not sure where to start. Can you suggest anything?

yezhengSTAT commented 3 years ago

Hello!

So happy to hear that mHi-C is helpful!

To convert into .hic format, follow the instruction at https://github.com/yezhengSTAT/mHiC#64-post-mhic-processing to merge uni-reads interaction bin-pairs and multi-mapping reads count out of mHi-C. It will give you the final bin-pair interaction frequency file (${name}.validPairs.binPairCount.uniMulti) in the format of

"chrA   midPointOfBinA   chrB   midPointOfBinB   interactionFrequency". 

This file provides all the required information needed for the short with score format at https://github.com/aidenlab/juicer/wiki/Pre#short-with-score-format. Reorganize the corresponding columns and Juicebox pre will be able to convert this file into .hic format.

Good luck! Let me know if it does/does not work.

Thanks, Ye

willmonty14 commented 3 years ago

Hello again! So thank you so much for the help, your advice did indeed work and I got a lovely Juicebox file from it. However, now that I'm using my own data I've run into a very odd problem. Whenever I run the pipeline, I get the following message:

step1.5 - Merge chimeric reads with mapped full-length reads.
s1_bwaAlignment.sh: line 44: 25852 Killed                  awk 'NR==FNR{a[$1]=$0;next;}a[$1]{$0=a[$1]}1' $resultsDir/$name\_unmapped_trim_filter_noheader_$i.sam $resultsDir/$name\_$i\_raw.sam > $resultsDir/$name\_$i.sam

I've changed the read length in both the s1 script and the pipeline caller, but for some reason this keeps happening. Just a thought on my part, but are the example files paired-end or single-end reads, and either way, are the two sample files replicates of each other or are they the different ended reads? Thanks again for all your help!

yezhengSTAT commented 3 years ago

Hi, Glad to hear that it works on the demo data!

The input sequencing data (fastq files) are actually spited into two ends files, hence for all the Hi-C input data, they are paired-ended. For the demo data, under your project directory, there is a folder called "fastqFiles" and you will see two fastq files there: TROPHOZOITES_1.fastq TROPHOZOITES_2.fastq. _1.fastq means it is for end 1 and _2.fastq is end 2.

If you download the data from GEO, you can use SRA tool fasts-dump to download and split into two ends files.

$sraDir/bin/fastq-dump -F --split-files 

The code you quoted is part of the function to re-align chimeric reads which span the ligation junction part. This code is to replace the original chimeric reads unmapped results by the trimmed chimeric reads mapped results. You probably can first check if the two input files exist under s1 subfolder and run this command in terminal to check if it works.

For example, this is the code I used to check what does the results look like:

awk 'NR==FNR{a[$1]=$0;next;}a[$1]{$0=a[$1]}1' TROPHOZOITES_unmapped_trim_filter_noheader_1.sam TROPHOZOITES_1_raw.sam | head -30

Let me know if the error remains.

Thanks, Ye