hifimeth is a deep neural network model for calling 5mc for PacBio HiFi reads. Currently it only works on CpG motif.
Install Pytorch with the default conda channel:
conda create -n hifimeth python=3.10
conda activate hifimeth
conda install cudatoolkit=11.7 -c nvidia
pip install --force-reinstall torch=1.13.1 torchvision=0.14.1 torchaudio=0.13.1
wget https://data.pyg.org/whl/torch-1.13.0+cu117.html
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv torch_geometric -f torch-1.13.0+cu117.html
Or with the tuna.tsinghua.edu.cn
channel:
conda create -n hifimeth python=3.10
conda activate hifimeth
conda install cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install --force-reinstall torch=1.13.1 torchvision=0.14.1 torchaudio=0.13.1
wget https://data.pyg.org/whl/torch-1.13.0+cu117.html
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv torch_geometric -f torch-1.13.0+cu117.html
hifimeth
git clone https://github.com/xiaochuanle/hifimeth.git
cd hifimeth
./install_cpp_tools.sh
cd ..
The hifimeth
is downloaded in directory
/data1/chenying/bs3/hifimeth/
The reads are found in the source of hifimeth
:
/data1/chenying/bs3/hifimeth/examples/chr1-reads.bam
Download the GRCh38
reference (pbmm2
does not recognise the .fna
format):
$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
$ gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
$ mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
$ pbmm2 index --preset CCS GCA_000001405.15_GRCh38_no_alt_analysis_set.fa GCA_000001405.15_GRCh38_no_alt_analysis_set.fa.mmi
$ samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
Download a phased VCF:
$ wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/latest/GRCh38/SupplementaryFiles/HG002_NA24385_GRCh38_1_22_v4.2_benchmark_phased_StrandSeq_whatshap.vcf.gz
$ tabix -p vcf HG002_NA24385_GRCh38_1_22_v4.2_benchmark_phased_StrandSeq_whatshap.vcf.gz
There are three using cases.
$ /data1/chenying/bs3/hifimeth/src/scripts/hifimeth.sh \
--src /data1/chenying/bs3/hifimeth \
--num_threads 48 \
--model_path /data1/chenying/bs3/hifimeth/models/hifimeth-mol-k400.ckpt \
--input /data1/chenying/bs3/hifimeth/examples/chr1-reads.bam \
--out 5mc-call
5mc-call/5mc-call.bam
5mc-call/5mc-call.txt
5mc-call/5mc-call.bam
adds calling results to the input bam /data1/chenying/bs3/hifimeth/examples/chr1-reads.bam
. 5mC base modification values are read from the MM and ML auxiliary tags which encode base modifications and confidence values. These tags are further described in the SAM tag specification document.
5mc-call/5mc-call.txt
lists calling results in human readable format:
m64008_201124_002822/104531456/ccs 8 2 15052 0.9861
m64008_201124_002822/104531456/ccs 8 12 15042 1.0000
m64008_201124_002822/104531456/ccs 8 80 14974 0.0026
m64008_201124_002822/104531456/ccs 8 170 14884 0.9995
m64008_201124_002822/104531456/ccs 8 181 14873 0.9998
m64008_201124_002822/104531456/ccs 8 213 14841 0.9860
m64008_201124_002822/104531456/ccs 8 273 14781 0.9969
m64008_201124_002822/104531456/ccs 8 281 14773 0.0308
m64008_201124_002822/104531456/ccs 8 288 14766 0.0100
m64008_201124_002822/104531456/ccs 8 311 14743 0.9862
Each calling result ocupies one line and contains five colummns:
Step 1: map the bam to a reference:
pbmm2 align --preset CCS --sort GCA_000001405.15_GRCh38_no_alt_analysis_set.fa \
/data1/chenying/bs3/hifimeth/examples/chr1-reads.bam chr1-reads-grch38.bam
Step 2: call 5mc with the mapped bam:
/data1/chenying/bs3/hifimeth/src/scripts/hifimeth.sh \
--src /data1/chenying/bs3/hifimeth \
--model_path /data1/chenying/bs3/hifimeth/models/hifimeth-mol-k400.ckpt \
--reference GCA_000001405.15_GRCh38_no_alt_analysis_set.fa \
--input chr1-reads-grch38.bam \
--out 5mc-call-grch38
5mc-call-grch38/5mc-call.bam
5mc-call-grch38/5mc-call.txt
Details of 5mc-call-grch38/5mc-call.bam
are found in Case 1.
5mc-call-grch38/5mc-call.txt
lists calling results in human readable format:
m64008_201124_002822/102303964/ccs 0 5069 12204 0.2486
m64008_201124_002822/102303964/ccs 0 5079 12194 0.7334
m64008_201124_002822/102303964/ccs 0 5097 12176 0.9999
m64008_201124_002822/102303964/ccs 0 5129 12144 chr1 1955103 + 1.0000
m64008_201124_002822/102303964/ccs 0 5140 12133 chr1 1955114 + 1.0000
m64008_201124_002822/102303964/ccs 0 5142 12131 chr1 1955116 + 1.0000
m64008_201124_002822/102303964/ccs 0 5157 12116 chr1 1955131 + 1.0000
m64008_201124_002822/102303964/ccs 0 5161 12112 chr1 1955135 + 1.0000
Each calling result ocupies one line and contains five or eight colummns. Details of a calling result containing five colummns are given in Case 1. In this situation, it means that this CpG site fail to be mapped to a CpG site on the reference.
A calling result containing eight colummns also contains mapping info:
+
for forward and -
for reverse) of this read mapped to the referenceStep 3: call modification frequencies
/data1/chenying/bs3/hifimeth/src/scripts/hifimeth-freq.sh \
--src /data1/chenying/bs3/hifimeth \
--reference GCA_000001405.15_GRCh38_no_alt_analysis_set.fa \
--num_threads 48 \
--bam 5mc-call-grch38/5mc-call.bam \
--out freq-call-grch38
freq-call-grch38/freq-call.count.txt
freq-call-grch38/freq-call.count.txt
shares the same format as BS-seq (it is actually a bed file):
chr1 1950011 1950012 1.0000 1 0
chr1 1950075 1950076 1.0000 1 0
chr1 1950137 1950138 1.0000 1 0
chr1 1950262 1950263 1.0000 1 0
chr1 1950276 1950277 1.0000 1 0
Step 1: map the bam to a reference:
pbmm2 align --preset CCS --sort GCA_000001405.15_GRCh38_no_alt_analysis_set.fa \
/data1/chenying/bs3/hifimeth/examples/chr1-reads.bam chr1-reads-grch38.bam
Step 2: call 5mc with the mapped bam:
/data1/chenying/bs3/hifimeth/src/scripts/hifimeth.sh \
--src /data1/chenying/bs3/hifimeth \
--model_path /data1/chenying/bs3/hifimeth/models/hifimeth-mol-k400.ckpt \
--reference GCA_000001405.15_GRCh38_no_alt_analysis_set.fa \
--input chr1-reads-grch38.bam \
--out 5mc-call-grch38
Same as Case 2.
Step 3: add haplotypes to the bam:
$ samtools index -@8 5mc-call-grch38/5mc-call.bam
$ whatshap haplotag --reference GCA_000001405.15_GRCh38_no_alt_analysis_set.fa \
--ignore-read-groups \
--o 5mc-call-grch38/5mc-call-tag.bam \
HG002_NA24385_GRCh38_1_22_v4.2_benchmark_phased_StrandSeq_whatshap.vcf.gz \
5mc-call-grch38/5mc-call.bam
Step 4: calling modification frequencies on the taged bam:
/data1/chenying/bs3/hifimeth/src/scripts/hifimeth-freq.sh \
--src /data1/chenying/bs3/hifimeth \
--reference GCA_000001405.15_GRCh38_no_alt_analysis_set.fa \
--num_threads 48 --bam 5mc-call-grch38/5mc-call-tag.bam \
--out freq-call-grch38-tag \
--phase
freq-call-grch38-tag/freq-call.count.txt
freq-call-grch38-tag/freq-call.hp1.count.txt
freq-call-grch38-tag/freq-call.hp2.count.txt