Allele-specific modeling of diploid Hi-C data
Create a conda env with ASHIC
installed:
conda env create -f environment.yml
Ye, T., & Ma, W. (2020). ASHIC: hierarchical Bayesian modeling of diploid chromatin contacts and structures. Nucleic acids research, 48(21), e123-e123. DOI: https://doi.org/10.1093/nar/gkaa872
> ashic -h
Usage: ashic [OPTIONS]
ASHIC: Hierarchical Bayesian modeling of diploid chromatin contacts and structures.
Example: ashic -i <INPUT> -o <OUTPUT>
Refer to README.md for <INPUT> format detail and how to generate with command `ashic-data`.
Options:
-i, --inputfile PATH Input file name. [required]
-o, --outputdir PATH Directory to save output files. [required]
--model [ASHIC-ZIPM|ASHIC-PM] Choose between the Poisson-multinomial
(ASHIC-PM) or the zero-inflated Poisson-
multinomial (ASHIC-ZIPM) model. [default:
ASHIC-ZIPM]
--diag INTEGER Number of diagonals ignored in the contact
matrix. [default: 1]
--max-iter INTEGER Maximum iterations allowed for EM algorithm.
[default: 30]
--tol FLOAT Minimum relative difference between the last
two iterations when EM algorithm converges.
[default: 0.0001]
--seed INTEGER Seed of the random number generator
initializing structures and gamma values.
[default: 0]
--gamma-share INTEGER Number of diagonals from the last share the
same gamma value. If not provided, each
diagonal will use different gamma value.
--init-x TEXT Method to initialize structure: 'random',
'MDS' or a TEXT file containing precomputed
structure. [default: MDS]
--max-func INTEGER Maximum iterations allowed for each structure
optimization by L-BFGS-B. [default: 200]
--separate / --no-separate Whether or not optimize each structure
separately then combine. [default: True]
--normalize Incorporate ICE bias into model. [default:
False]
--save-iter / --no-save-iter Whether or not save model file from each
iteration. [default: False]
-h, --help Show this message and exit.
Sample data can be found at examples/sample_data/GSM2863686_chrX_1000000.pickle
. The data file can be generated with ashic-data
command.
Run ASHIC-ZIPM
model on the example data:
ashic -i examples/sample_data/GSM2863686_chrX_1000000.pickle -o output --model ASHIC-ZIPM
Run ASHIC-PM
model on the example data:
ashic -i examples/sample_data/GSM2863686_chrX_1000000.pickle -o output --model ASHIC-PM
The output
directory includes:
We use the mapped read pairs file from Bonora et al. (2018) (GSM2863686) as an example to show the steps to generate ASHIC input data.
The mapped read pairs file should be tab-delimited text format and contains at least the following columns:
The allele of each read end should be some tag representing maternal-specific, paternal-specific, and allele-ambiguous reads. In the case of the example, we use 'ref' for maternal-specific reads, 'alt' for paternal-specific reads, and 'both-ref' for allele-ambiguous reads.
The preprocessing step would be performed using ashic-data
command.
First, we split the read pairs into allele-certain and allele-ambiguous read pairs per chromosome using ashic-data split2chrs
command.
ashic-data split2chrs --prefix=<PREFIX> <FILENAME> <OUTPUT>
Replace <FILENAME>
with the read pairs file name, and <OUTPUT>
with the destination directory to store split allele-certain and allele-ambiguous read pairs per chromosome.
Use <PREFIX>
to specify the prefix of the output files.
If you use input file format different from Bonora et al. (2018), you need to specify the allele tags used, and the index (1-based) of each column as listed before. Please refer to the following detailed explanation of options to do so. The default option values are for Bonora et al. (2018) data.
> ashic-data split2chrs -h
Usage: ashic-data split2chrs [OPTIONS] FILENAME OUTPUTDIR
Split contacts into intra-chromosomal allele-certain and allele-ambiguous contacts.
The input file will be split into the followings:
Allele-certain contacts:
<FILENAME>_<CHR>_<MAT>_<MAT>
<FILENAME>_<CHR>_<PAT>_<PAT>
<FILENAME>_<CHR>_<MAT>_<PAT>
Allele-ambiguous contacts:
<FILENAME>_<CHR>_<MAT>_<AMB>
<FILENAME>_<CHR>_<PAT>_<AMB>
<FILENAME>_<CHR>_<AMB>_<AMB>
Options:
--prefix TEXT Prefix of the output files. If not provided, the input
filename will be used.
--mat TEXT Allele flag of maternal-specific reads. [default: ref]
--pat TEXT Allele flag of paternal-specific reads. [default: alt]
--amb TEXT Allele flag of allele-ambiguous reads. [default: both-ref]
--chr1 INTEGER Column index (1-based) of chromosome of the 1st end.
[default: 3]
--allele1 INTEGER Column index (1-based) of allele of the 1st end.
[default: 5]
--chr2 INTEGER Column index (1-based) of chromosome of the 2nd end.
[default: 8]
--allele2 INTEGER Column index (1-based) of allele of the 2nd end.
[default: 10]
-h, --help Show this message and exit.
Then, bin the split read pairs files of a chromosome into contact matrices at chosen resolution.
ashic-data binning --res=<RES> --chrom=<CHROM> <PREFIX> <OUTPUT>
Replace <RES>
with the resolution in base pair, <CHROM>
with chosen chromosome, <OUTPUT>
with the destination directory to store binned '.npy' matrix files. <PREFIX>
is the common prefix of the split read pairs files with directory included, e.g. <PREFIX>_chrX_ref_ref
is the file containing both-end maternal-specific read pairs on chrX.
If you use input file format different from Bonora et al. (2018), you need to specify the allele tags used, and the index (1-based) of each column as listed before. If the reads were mapped use reference genome other than 'mm10', a two-column tab-delimited text file ('.chrom.sizes') containing each chromosome name and size in base pair should be provided. Please refer to the following detailed explanation of options to do so. The default option values are for Bonora et al. (2018) data.
> ashic-data binning -h
Usage: ashic-data binning [OPTIONS] PREFIX OUTPUTDIR
Bin the mapped read pairs of chromosome <CHROM> into contact matrices.
The input files containing allele-certain and allele-ambiguous read pairs
should have name in format of <PREFIX>_<CHROM>_<MAT>_<MAT> and etc. PREFIX
need to include the PATH to the files as well.
Options:
--res INTEGER Resolution in base pair of the binned contact matrices.
[default: 100000]
--chrom TEXT Chromosome to generate the binned contact matrices.
[required]
--genome TEXT Genome reference. Built-in supports 'mm10'. Other genome
can be supported by providing a '**.chrom.sizes' file.
[default: mm10]
--start INTEGER If provided, instead of binning the whole chromosome, only
region after <start> (inclusive) will be binned.
--end INTEGER If provided, instead of binning the whole chromosome, only
region before <end> (inclusive) will be binned.
--mat TEXT Allele flag of maternal-specific reads. [default: ref]
--pat TEXT Allele flag of paternal-specific reads. [default: alt]
--amb TEXT Allele flag of allele-ambiguous reads. [default: both-ref]
--c1 INTEGER Column index (1-based) of chromosome of the 1st end.
[default: 3]
--p1 INTEGER Column index (1-based) of coordinate of the 1st end.
[default: 4]
--a1 INTEGER Column index (1-based) of allele of the 1st end.
[default: 5]
--c2 INTEGER Column index (1-based) of chromosome of the 2nd end.
[default: 8]
--p2 INTEGER Column index (1-based) of coordinate of the 2nd end.
[default: 9]
--a2 INTEGER Column index (1-based) of allele of the 2nd end.
[default: 10]
-h, --help Show this message and exit.
Another approach to generate mapped allele specific read pairs from raw FASTQ files is to run HiC-Pro with allele specific mode.
It will generate validPairs files which contain one additional column with the allele assignment of each read pair.
We will generate the binned matrices from HiC-Pro validPairs files using the ashic-data bin-hicpro
command:
ashic-data bin-hicpro <VALIDPAIRS> <OUTPUT> --res=<RES> --region=<REGION> --genome=<CHROM.SIZES>
ashic-data bin-hicpro
has two arguments:
<VALIDPAIRS>
: path to the HiC-Pro validPairs file <OUTPUT>
: output directory to store binned matricesAnd three required options:
--res=<RES>
: resolution of binned matrices in base-pair--region=<REGION>
: chromosomes(s) or region(s) to generate the binned matrices e.g. chr1
or chr1:1000000-5000000
, if multiple regions, separate each region with comma ,
--genome=<CHROM.SIZES>
: path to the chrom.sizes
file that specify the chromosome lengths for a given genomeThe default values for other options are consistent with HiC-Pro validPairs file format. Please refer to the full description of each option below:
> ashic-data bin-hicpro -h
Usage: ashic-data bin-hicpro [OPTIONS] FILENAME OUTPUT
Options:
--prefix TEXT .npy files prefix. If not provide, use <filename>.
--res INTEGER Resolution in base pair of the binned contact matrices.
[required]
--region TEXT Chromosome(s) or region(s) to generate the binned contact
matrices. [required]
--genome PATH Genome reference (.chrom.sizes) file.
--mat TEXT Allele flag of maternal-specific reads. [default: 1]
--pat TEXT Allele flag of paternal-specific reads. [default: 2]
--amb TEXT Allele flag of allele-ambiguous reads. [default: 0]
--c1 INTEGER Column index (1-based) of chromosome of the 1st end.
[default: 2]
--p1 INTEGER Column index (1-based) of coordinate of the 1st end.
[default: 3]
--a1 INTEGER Column index (1-based) of allele of the 1st end.
[default: 13]
--c2 INTEGER Column index (1-based) of chromosome of the 2nd end.
[default: 5]
--p2 INTEGER Column index (1-based) of coordinate of the 2nd end.
[default: 6]
--a2 INTEGER Column index (1-based) of allele of the 2nd end.
[default: 13]
--sep TEXT Delimiter string to separate allele flags of 1st and 2nd ends
if they are in the same column. [default: -]
-h, --help Show this message and exit.
Finally, we pack the binned matrices into pickle data with some essential parameters (length of chromosome, distance decay exponent, etc.) and filter low mappability loci. The output pickle data can be used as input for ASHIC.
ashic-data pack --perc=<PERC> <INPUT> <OUTPUT>
Replace <INPUT>
with the directory containing the binned '.npy' matrix files of a chromosome, and <OUTPUT>
with the destination directory to store the pickle data.
Use <PERC>
to specify the percentage of rows or columns have the lowest contacts to filter out.
Specify the allele tags if different from the example.
> ashic-data pack -h
Usage: ashic-data pack [OPTIONS] DATADIR OUTPUTDIR
Pack allele-certain and allele-ambiguous binned matrices in DATADIR into
ASHIC input format. Filter and mask loci with low mappability in allele-
certain intra-chromosomal contact matrices.
DATADIR should contain .npy format matrices files with names in format of
*_<MAT>_<MAT>_*.npy,
*_<MAT>_<PAT>_*.npy,
*_<PAT>_<PAT>_*.npy,
*_<MAT>_<AMB>_*.npy,
*_<PAT>_<AMB>_*.npy,
*_<AMB>_<AMB>_*.npy
Options:
--diag INTEGER Number of diagonals ignored in the contact matrix.
[default: 1]
--perc FLOAT Percentage (%) of rows/columns that interact the least to
filter out. Range from 0 to 100. [default: 2]
--mat TEXT Allele flag of maternal-specific reads. [default: ref]
--pat TEXT Allele flag of paternal-specific reads. [default: alt]
--amb TEXT Allele flag of allele-ambiguous reads. [default: both-ref]
-h, --help Show this message and exit.