Open suyeonwy opened 4 years ago
Hello! May I know how did you download the data from GEO? Sometimes when you download a paired-ended sequencing data and split into two end fastq files, a suffix such as "-1" or "_1" may be appended to the read name to indicate this is end 1. In our case, we don't want this to happen and the read name should be exactly the same and in the same order.
The corresponding python codes to check the read name is as below, if they are not exactly the same, you will see the error message.
if r1.qname == r2.qname:
if r1.is_unmapped == True and r2.is_unmapped == True:
unmapped_count += 1
continue
elif r1.is_unmapped == False and r2.is_unmapped == False:
mapped_count += 1
if is_uniread(r1) == True and is_uniread(r2) == True:
uni_count += 1
(r1, r2) = sam_flag(r1, r2, readEnd1, readEnd2)
outfile.write(r1)
outfile.write(r2)
else:
singleton_count += 1
else:
print("Read id in read 1 and read 2 file does not match. Please check the input read files and sort them correctly.")
sys.exit()
Otherwise......may I see a fraction of your input fastq files?
Thanks, Ye
I downloaded the data from SRA using fastq-dump
and split 2 paired-end files using --split-files
option
fastq-dump --split-files SRR6691720
These are some parts of my input FASTQ files
yea.....looks fine to me as well. Can you also share the original complete log or screen output information with the error messages? Looks to me that this error may not appear in the training section but in the simulation sequence data processing section. In that case, there must be something wrong with the simulation.
Here is the original complete log information!
I had to simulate multiple Hi-C data sets, so I completed the training step first and simulate multiple sets with training=0
option
[ Training step log ]
projDir=/mss5/RACA2/programs/FreeHiC
fastqFile=/mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720
ref=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa
refrag=/mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed
outDir=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x
simuName=bosTau9_10x
summaryFile=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/summary/bosTau9_FreeHiC.summary
bwa=bwa
samtools=samtools
bedtools=bedtools
train=1
simulate=1
postProcess=1
coreN=20
mismatchN=3
gapN=1
mismatchP=
gapP=
chimericP=
simuN=394259238
readLen=80
resolution=40000
lowerBound=80000
refragU=800
ligateSite=AAGCTAGCTT
Pre-checking the input parameters......
1. FreeHi-C package directory exist.
2. /mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720 will be downloaded.
3. Reference genome file exists and is not empty.
4. Restriction enzyme fragment file exists and is not empty.
5. Simulation run name is provided.
6. Output directory is provided.
7. Summary file path is provided.
8. Path to BWA executable file is provided. Checking bwa help manual......
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1198-dirty
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.
9. Path to samtools executable file is provided. Checking samtools help manual......
Program: samtools (Tools for alignments in the SAM format)
Version: 1.9 (using htslib 1.9)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
fqidx index/extract FASTQ
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
markdup mark duplicates
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
-- Statistics
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
10. Path to bedtools executable file is provided. Checking bedtools manual......
bedtools is a powerful toolset for genome arithmetic.
Version: v2.28.0
About: developed in the quinlanlab.org and by many contributors worldwide.
Docs: http://bedtools.readthedocs.io/
Code: https://github.com/arq5x/bedtools2
Mail: https://groups.google.com/forum/#!forum/bedtools-discuss
Usage: bedtools <subcommand> [options]
The bedtools sub-commands include:
[ Genome arithmetic ]
intersect Find overlapping intervals in various ways.
window Find overlapping intervals within a window around an interval.
closest Find the closest, potentially non-overlapping interval.
coverage Compute the coverage over defined intervals.
map Apply a function to a column for each overlapping interval.
genomecov Compute the coverage over an entire genome.
merge Combine overlapping/nearby intervals into a single interval.
cluster Cluster (but don't merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
shift Adjust the position of intervals.
subtract Remove intervals based on overlaps b/w two files.
slop Adjust the size of intervals.
flank Create new intervals from the flanks of existing intervals.
sort Order the intervals in a file.
random Generate random intervals in a genome.
shuffle Randomly redistribute intervals in a genome.
sample Sample random records from file using reservoir sampling.
spacing Report the gap lengths between intervals in a file.
annotate Annotate coverage of features from multiple files.
[ Multi-way file comparisons ]
multiinter Identifies common intervals among multiple interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.
[ Paired-end manipulation ]
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.
[ Format conversion ]
bamtobed Convert BAM alignments to BED (& other) formats.
bedtobam Convert intervals to BAM records.
bamtofastq Convert BAM records to FASTQ records.
bedpetobam Convert BEDPE intervals to BAM records.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
[ Fasta manipulation ]
getfasta Use intervals to extract sequences from a FASTA file.
maskfasta Use intervals to mask sequences from a FASTA file.
nuc Profile the nucleotide content of intervals in a FASTA file.
[ BAM focused tools ]
multicov Counts coverage from multiple BAMs at specific intervals.
tag Tag BAM alignments based on overlaps with interval files.
[ Statistical relationships ]
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
reldist Calculate the distribution of relative distances b/w two files.
fisher Calculate Fisher statistic b/w two feature files.
[ Miscellaneous tools ]
overlap Computes the amount of overlap from two intervals.
igv Create an IGV snapshot batch script.
links Create a HTML page of links to UCSC locations.
makewindows Make interval "windows" across a genome.
groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
expand Replicate lines based on lists of values in columns.
split Split a file into multiple files with equal records or base pairs.
[ General help ]
--help Print this help menu.
--version What version of bedtools are you using?.
--contact Feature requests, bugs, mailing lists, etc.
11. Action is defined for training raw input Hi-C sequencing data.
12. Action is defined for simulate Hi-C sequencing data.
13. Action is defined for whether to process simulated Hi-C sequencing data.
14. Number of processing core is specified.
15. Number of maximum alignment mismatches allowed is specified.
16. Number of maximum alignment gaps allowed is specified.
17. No number for the percentage of simulated reads that have mismatches is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
18. No number for the percentage of simulated reads that have gap is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
19. No number for the percentage of simulated reads that are chimeric reads is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
20. Sequencing depth of simulation is specified.
21. Read length of simulation is specified. Please make sure it does not exceed the read length of input raw Hi-C reads.
22. Resolution for processing Hi-C reads is specified.
23. Lower bound for valid long-range interactions is specified.
24. Upper bound for the distance to the assigned restriction enyzme cutting site is specified.
25. Sequences at the ligation site is provided.
Input parameter checking is successfull!!
Start FreeHi-C......
I. Training: 1 - Alignment!
(We assume the reads length are the same within the same fastq file.)......
Start alignment of read end 1
Step1.1 - BWA alignment
[bwa_aln_core] calculate SA coordinate... 271.55 sec
[bwa_aln_core] write to the disk... 0.05 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] calculate SA coordinate... 187.60 sec
[bwa_aln_core] write to the disk... 0.04 sec
[bwa_aln_core] 203342006 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720_1.fastq
[main] Real time: 12026.644 sec; CPU: 208603.968 sec
[bwa_aln_core] convert to sequence coordinate... 5.38 sec
[bwa_aln_core] refine gapped alignments... 1.06 sec
[bwa_aln_core] print alignments... 0.40 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] convert to sequence coordinate... 6.54 sec
[bwa_aln_core] refine gapped alignments... 1.55 sec
[bwa_aln_core] print alignments... 0.51 sec
[bwa_aln_core] 203342006 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_1.sai /mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720_1.fastq
[main] Real time: 8343.945 sec; CPU: 6698.826 sec
Step1.2 - Filter and get unmapped alignment sam file.
Step1.3 - Trim unmapped reads until the restriction enzyme cutting site.
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 73810881 reads
##Fastq file: /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_1.fastq
##Restriction sites:
AAGCTAGCTT
##Output File: /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_1.fastq
##Trimmed reads: 3676
Step1.4 - Rescue chimeric reads by re-aligning trimmed unmapped reads.
[bwa_aln_core] calculate SA coordinate... 6.69 sec
[bwa_aln_core] write to the disk... 0.00 sec
[bwa_aln_core] 2551 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_1.fastq
[main] Real time: 6.780 sec; CPU: 10.564 sec
[bwa_aln_core] convert to sequence coordinate... 4.56 sec
[bwa_aln_core] refine gapped alignments... 0.84 sec
[bwa_aln_core] print alignments... 0.00 sec
[bwa_aln_core] 2551 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_1.sai /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_1.fastq
[main] Real time: 7.280 sec; CPU: 5.413 sec
step1.5 - Merge chimeric reads with mapped full-length reads.
Start alignment of read end 2
Step1.1 - BWA alignment
[bwa_aln_core] calculate SA coordinate... 328.69 sec
[bwa_aln_core] write to the disk... 0.10 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] convert to sequence coordinate... 4.88 sec
[bwa_aln_core] refine gapped alignments... 0.95 sec
[bwa_aln_core] print alignments... 0.46 sec
[bwa_aln_core] 203342006 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_2.sai /mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720_2.fastq
[main] Real time: 8567.442 sec; CPU: 6501.996 sec
Step1.2 - Filter and get unmapped alignment sam file.
Step1.3 - Trim unmapped reads until the restriction enzyme cutting site.
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 89078768 reads
##Fastq file: /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_2.fastq
##Restriction sites:
AAGCTAGCTT
##Output File: /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_2.fastq
##Trimmed reads: 4339
Step1.4 - Rescue chimeric reads by re-aligning trimmed unmapped reads.
[bwa_aln_core] calculate SA coordinate... 6.39 sec
[bwa_aln_core] write to the disk... 0.00 sec
[bwa_aln_core] 3067 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_2.fastq
[main] Real time: 11.036 sec; CPU: 9.074 sec
[bwa_aln_core] convert to sequence coordinate... 3.82 sec
[bwa_aln_core] refine gapped alignments... 0.67 sec
[bwa_aln_core] print alignments... 0.01 sec
[bwa_aln_core] 3067 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_2.sai /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_2.fastq
[main] Real time: 12.245 sec; CPU: 4.506 sec
step1.5 - Merge chimeric reads with mapped full-length reads.
I. Training: 2 - Joining read-ends!
Read 1 file = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_1.sam
Read 2 file = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_2.sam
Output file = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720.sam
Summary of alignment results = True
Summary file path = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/summary/bosTau9_FreeHiC.summary
Training = 1
Verbose = True
Trainning!
------------Begin joining ends from read 1 file and read 2 file------------
# of reads processed: 1000000
.
.
.
# of reads processed: 203000000
I. Training: 3 - Categorize read-pairs!
RE fragment = /mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed
Paired-ended read alignment file = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720.sam
Output directory = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s2_validPairs
RE cutting site distance lower bound = 50
RE cutting site distance upper bound = 800
Minimum intrachromosomal contact distance = 80000
Binning method = window
Binning resolution = 40000
Summary = True
SummaryFile = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/summary/bosTau9_FreeHiC.summary
Verbose = True
## Loading Restriction File Intervals '/mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed'...
dict_keys(['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr23', 'chr24', 'chr25', 'chr26', 'chr27', 'chr28', 'chr29', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr999'])
------------Begin reading paired-ended file------------
------------Categorize aligned read pairs------------
I. Training: 4 - Remove duplicates!
I. Training: 5 - Binning!
[Simulation & Post-process step log]
projDir=/mss5/RACA2/programs/FreeHiC
fastqFile=/mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720
ref=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa
refrag=/mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed
outDir=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x
simuName=bosTau9_05x
summaryFile=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/summary/bosTau9_FreeHiC.summary
bwa=bwa
samtools=samtools
bedtools=bedtools
train=0
simulate=1
postProcess=1
coreN=20
mismatchN=3
gapN=1
mismatchP=
gapP=
chimericP=
simuN=157702285
readLen=80
resolution=40000
lowerBound=80000
refragU=800
ligateSite=AAGCTAGCTT
Pre-checking the input parameters......
1. FreeHi-C package directory exist.
2. /mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720 will be downloaded.
3. Reference genome file exists and is not empty.
4. Restriction enzyme fragment file exists and is not empty.
5. Simulation run name is provided.
6. Output directory is provided.
7. Summary file path is provided.
8. Path to BWA executable file is provided. Checking bwa help manual......
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1198-dirty
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.
9. Path to samtools executable file is provided. Checking samtools help manual......
Program: samtools (Tools for alignments in the SAM format)
Version: 1.9 (using htslib 1.9)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
fqidx index/extract FASTQ
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
markdup mark duplicates
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
-- Statistics
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
10. Path to bedtools executable file is provided. Checking bedtools manual......
bedtools is a powerful toolset for genome arithmetic.
Version: v2.28.0
About: developed in the quinlanlab.org and by many contributors worldwide.
Docs: http://bedtools.readthedocs.io/
Code: https://github.com/arq5x/bedtools2
Mail: https://groups.google.com/forum/#!forum/bedtools-discuss
Usage: bedtools <subcommand> [options]
The bedtools sub-commands include:
[ Genome arithmetic ]
intersect Find overlapping intervals in various ways.
window Find overlapping intervals within a window around an interval.
closest Find the closest, potentially non-overlapping interval.
coverage Compute the coverage over defined intervals.
map Apply a function to a column for each overlapping interval.
genomecov Compute the coverage over an entire genome.
merge Combine overlapping/nearby intervals into a single interval.
cluster Cluster (but don't merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
shift Adjust the position of intervals.
subtract Remove intervals based on overlaps b/w two files.
slop Adjust the size of intervals.
flank Create new intervals from the flanks of existing intervals.
sort Order the intervals in a file.
random Generate random intervals in a genome.
shuffle Randomly redistribute intervals in a genome.
sample Sample random records from file using reservoir sampling.
spacing Report the gap lengths between intervals in a file.
annotate Annotate coverage of features from multiple files.
[ Multi-way file comparisons ]
multiinter Identifies common intervals among multiple interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.
[ Paired-end manipulation ]
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.
[ Format conversion ]
bamtobed Convert BAM alignments to BED (& other) formats.
bedtobam Convert intervals to BAM records.
bamtofastq Convert BAM records to FASTQ records.
bedpetobam Convert BEDPE intervals to BAM records.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
[ Fasta manipulation ]
getfasta Use intervals to extract sequences from a FASTA file.
maskfasta Use intervals to mask sequences from a FASTA file.
nuc Profile the nucleotide content of intervals in a FASTA file.
[ BAM focused tools ]
multicov Counts coverage from multiple BAMs at specific intervals.
tag Tag BAM alignments based on overlaps with interval files.
[ Statistical relationships ]
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
reldist Calculate the distribution of relative distances b/w two files.
fisher Calculate Fisher statistic b/w two feature files.
[ Miscellaneous tools ]
overlap Computes the amount of overlap from two intervals.
igv Create an IGV snapshot batch script.
links Create a HTML page of links to UCSC locations.
makewindows Make interval "windows" across a genome.
groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
expand Replicate lines based on lists of values in columns.
split Split a file into multiple files with equal records or base pairs.
[ General help ]
--help Print this help menu.
--version What version of bedtools are you using?.
--contact Feature requests, bugs, mailing lists, etc.
11. Action is defined for training raw input Hi-C sequencing data.
12. Action is defined for simulate Hi-C sequencing data.
13. Action is defined for whether to process simulated Hi-C sequencing data.
14. Number of processing core is specified.
15. Number of maximum alignment mismatches allowed is specified.
16. Number of maximum alignment gaps allowed is specified.
17. No number for the percentage of simulated reads that have mismatches is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
18. No number for the percentage of simulated reads that have gap is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
19. No number for the percentage of simulated reads that are chimeric reads is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
20. Sequencing depth of simulation is specified.
21. Read length of simulation is specified. Please make sure it does not exceed the read length of input raw Hi-C reads.
22. Resolution for processing Hi-C reads is specified.
23. Lower bound for valid long-range interactions is specified.
24. Upper bound for the distance to the assigned restriction enyzme cutting site is specified.
25. Sequences at the ligation site is provided.
Input parameter checking is successfull!!
Start FreeHi-C......
## *****************************************
## II. Simulating interaction read sequences
## *****************************************
II. Simulating!
Feature (chr23:52498684-52498764) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498577-52498657) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498553-52498633) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498591-52498671) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498556-52498636) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498609-52498689) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498652-52498732) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498611-52498691) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498651-52498731) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498551-52498631) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498666-52498746) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498547-52498627) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498640-52498720) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498584-52498664) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498544-52498624) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498663-52498743) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498648-52498728) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498580-52498660) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498570-52498650) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498594-52498674) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498647-52498727) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498599-52498679) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498578-52498658) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498553-52498633) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498598-52498678) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498629-52498709) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498651-52498731) beyond the length of chr23 size (52498615 bp). Skipping.
Feature (chr23:52498538-52498618) beyond the length of chr23 size (52498615 bp). Skipping.
check1
check2
check3
check4
RE fragment = /mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed
Fragment interaction file = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/rawDataTraining/s2_validPairs/SRR6691720.validPairs.fragFreq
Output directory = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence
Output file name = SRR6691720
Total number of interactions to simulate = 157702285
Read length to simulate = 80
Maximum distance away from restriction site = 800
Path to bedtools = bedtools
Path to reference genome fasta file = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa
Path to summary file = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/summary/bosTau9_FreeHiC.summary
Verbose = True
Read in restriction fragment dictionary!
Read in interaction frequency!
Convert dictionary and list into C structure!
Sample fragment interaction!
Read in trained parameters!
243797050
101747200
2285627
2097474
Simulate end 1!
0
.
.
.
157500000
Get fasta from the reference genome!
Add mutation!
500000
.
.
.
65500000
Add insertion!
Add deletion
Simulate end 2!
0
.
.
.
157500000
Get fasta from reference genome!
Add mutation!
500000
.
.
.
65500000
Add insertion!
Add deletion!
II. Simulation Post-processing: 1 - Alignment!
(We assume the reads length are the same within the same fastq file.)......
Start alignment of read end 1
Step1.1 - BWA alignment
[bwa_aln_core] calculate SA coordinate... 53.10 sec
[bwa_aln_core] write to the disk... 0.05 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] calculate SA coordinate... 40.82 sec
[bwa_aln_core] write to the disk... 0.05 sec
[bwa_aln_core] 1763658 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_1.fastq
[main] Real time: 41.735 sec; CPU: 383.532 sec
[bwa_aln_core] convert to sequence coordinate... 5.36 sec
[bwa_aln_core] refine gapped alignments... 0.93 sec
[bwa_aln_core] print alignments... 0.46 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] convert to sequence coordinate... 4.28 sec
[bwa_aln_core] refine gapped alignments... 0.62 sec
[bwa_aln_core] print alignments... 0.33 sec
[bwa_aln_core] 1763658 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720_1.sai /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_1.fastq
[main] Real time: 56.076 sec; CPU: 42.642 sec
Choose not to rescue chimeric reads or the chimeric reads have been trimmed before alignment!
Start alignment of read end 2
Step1.1 - BWA alignment
[bwa_aln_core] calculate SA coordinate... 53.93 sec
[bwa_aln_core] write to the disk... 0.07 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] calculate SA coordinate... 48.07 sec
[bwa_aln_core] write to the disk... 0.06 sec
[bwa_aln_core] 11505705 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_2.fastq
[main] Real time: 161.632 sec; CPU: 2509.953 sec
[bwa_aln_core] convert to sequence coordinate... 8.88 sec
[bwa_aln_core] refine gapped alignments... 0.69 sec
[bwa_aln_core] print alignments... 0.45 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] convert to sequence coordinate... 4.38 sec
[bwa_aln_core] refine gapped alignments... 0.59 sec
[bwa_aln_core] print alignments... 0.40 sec
[bwa_aln_core] 11505705 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720_2.sai /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_2.fastq
[main] Real time: 263.851 sec; CPU: 262.748 sec
Choose not to rescue chimeric reads or the chimeric reads have been trimmed before alignment!
[bam_sort_core] merging from 0 files and 20 in-memory blocks...
[bam_sort_core] merging from 0 files and 20 in-memory blocks...
II. Simulation Post-processing: 2 - Joining read-ends!
Read 1 file = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720_1.sam
Read 2 file = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720_2.sam
Output file = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720.sam
Summary of alignment results = True
Summary file path = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/summary/bosTau9_FreeHiC.summary
Training = 0
Verbose = True
------------Begin joining ends from read 1 file and read 2 file------------
Read id in read 1 and read 2 file does not match. Please check the input read files and sort them correctly.
II. Simulation Post-processing: 3 - Categorize read-pairs!
RE fragment = /mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed
Paired-ended read alignment file = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720.sam
Output directory = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s2_validPairs
RE cutting site distance lower bound = 50
RE cutting site distance upper bound = 800
Minimum intrachromosomal contact distance = 80000
Binning method = window
Binning resolution = 40000
Summary = True
SummaryFile = /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/summary/bosTau9_FreeHiC.summary
Verbose = True
## Loading Restriction File Intervals '/mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed'...
dict_keys(['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr23', 'chr24', 'chr25', 'chr26', 'chr27', 'chr28', 'chr29', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr999'])
------------Begin reading paired-ended file------------
------------Categorize aligned read pairs------------
II. Simulation Post-processing: 4 - Binning!
If you need any other information, please ask me! Thanks, Sue
Can you check the number of reads in /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_1.fastq and /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_2.fastq ? It seems that more than 157500000 reads are simulated for each end by the following log.
Simulate end 1!
0
.
.
.
157500000
However, only 1763658 reads of end 1 are aligned by bwa aln. Similarly for end 2. Much fewer reads are processed by bwa.
[bwa_aln_core] 1763658 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_1.fastq
[main] Real time: 41.735 sec; CPU: 383.532 sec
If the fastq files do not contain 157702285 reads, then there should be something wrong with the simulation section. BTW, have you tried simulating using the demo data and demo runs?
Ye
There are 157,702,265 reads in simuSequence/SRR6691720_1.fastq file and 157,702,277 reads in simuSequence/SRR6691720_2.fastq file.
Yes, I have already run the demo data and got simulation data. It seems that the demo run did not have any problems in outputs.
Emm......that is indeed very weird....looks like the simulation is almost complete. Do you mind running the simulation section again with 10 times fewer number of simuN, say, 15,770,228, or even smaller number to see if it works successfully? One of the concerns is that the memory on your server may not be enough to convert this many reads into sequences at the same time. Sometimes, to accelerate the simulation speed, I may also split the simulation into 10 runs and merge the final bin-pairs instead of one big simulation.
Let me know if this helps.
Ye
Hi,
I ran the simulation section with 10 times fewer numbers (15,000,000) as you recommended
There are 14,999,999 reads in both _1.fastq and _2.fastq files, so the numbers of simulated reads in both files are equal.
However, the read ids were still not matched, and also the same error messages emerged.
Read id in read 1 and read 2 file does not match. Please check the input read files and sort them correctly.
These are the examples of final simulation data [SRR6691720_1.fastq]
[sy0819@europa00 simuSequence]$ head -n 12 SRR6691720_1.fastq
@0(-)
ggggtagaggggtgggatgaagagattgggattgacatatatatacatgattgatactattcataaaatagataagtaat
+
EAEEAEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEE/AEEEEEEEE//AEEEEAEEEEEEEE/EEEEEE/EE/
@2(+)
CTGTTGAACCATATGATGATTGTTGGCCTCCTAACCttcccccttaacagttcagcacttgtctccttagaacaagggct
+
AAAEEEEEE/EE6EEE/EEEEEEEEEEEEEE<EEEEAEEEEEEEEEEEEEEEEE/EEEE/EEEEEEEEEEEEEEEAAE/E
@3(-)
acatattaatcaaattaacaaagatcaaacacaaagaacaaatattaaaagcagcaagggaaaaacaacaaataacacac
+
AAEEAEEEEEE/EEEEEEEAEEEE//AEAEEEEEEAEEEEEEEAE6/EEEEEEEEE6EEEE/EEEEEEEEEEE<EAAAAE
[SRR6691720_2.fastq]
[sy0819@europa00 simuSequence]$ head -n 12 SRR6691720_2.fastq
@0(+)
agaaaatcaatgaaactaaaagttgcttctttgaaaagattaataaaattgataaatgtctagccactactaaaaaaaat
+
EA/AAAEEE6EEEEAEEAEEAEE6EEEEEEEEEE<EE/EEEEEEEEEEEE<AEEEEEEEEEEAEEEEEEAEEE<EAAAEA
@1(-)
ttgctggcatattgagtgcagcactttcacagcatcatctttcaggatttggaatagctcaactggaattccatcacctc
+
AEEEAEAEEEEEEEEEEEEEEAEEEEEEEEE/E/EEEEE/EEEEEEEEAEEEE//EEEEEE/EE/EE6EEEEE/EAEAA<
@3(+)
gcttccctcctggccctctcccctcactcctccacctgctcccattcctcaaacacaccatgctcccttccgcttcgaga
+
AE/EAEEEEEEEEAAEEEE6EEEEAAEEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEE/AAEEEEE/AAEA
Do you think that I should run the simulation with much smaller numbers? Please let me know if you need other information
Thanks, Sue
I see!! I just realized that bedtools has updated its getfasta function in recent released versions!! A quick solution is that if you can download and install 2.25.0 version of bedtools: https://github.com/arq5x/bedtools2/releases/tag/v2.25.0 and still use current freehic software, it should solve the problem. Otherwise, I will update the freehic repository later this week to accommodate the new output from the latest version of bedtools.
Let me know how it goes! Thanks for bringing it up! Appreciate your time and tries!
Ye
I was using bedtools 2.28.0 version before and I downloaded bedtools 2.25.0 version as you advised
I corrected the bedtools path in the parameter files and re-ran the simulation
bedtools="/mss5/RACA2/programs/bedtools2/bin/bedtools"
[sy0819@europa00 bosTau9]$ /mss5/RACA2/programs/bedtools2/bin/bedtools -version
bedtools v2.25.0
But I still get the same error messages before and the numbers of reads in both files are still different
Read id in read 1 and read 2 file does not match. Please check the input read files and sort them correctly.
Total line counts in both reads and examples of FASTQ files are below
[sy0819@europa00 simuSequence]$ wc -l *
59999996 SRR6691720_1.fastq
60000000 SRR6691720_2.fastq
[sy0819@europa00 simuSequence]$ head SRR6691720_1.fastq
@0
TGAGCCTGTGTTATCCCCCTCTTCTCATGTGTCCCCCTGCTAGAGGTGCAGCTCTAACTAGATCACTTCTCCTTTACTGG
+
EAAEAA/EEE6EAA/EEEEEEEA6EEEEEEEAEAAEEEE//EE/EEEEEEEEEEEEEEEEEEEEEEAEE<E/E/EAA6AA
@2
acttctgcaaagaagacatatagatggccaataaacacatgaaaagattctcgacatcactaactattCAGTTCAGTTCa
+
AE6EAAEEEEEEE/EEEEEEEEEEEEEEEEEEE//EE/EEAEEEEE/EEEEEEEEEEEE/EEEEEAE/EEEE<EEAEAAA
@4
AATAATAATAACAGTGAAAGAATGATTCTTCACCATTGGTGTGGTGTTTTTTCCCTTTCATGccccaaacacatgtgtat
[sy0819@europa00 simuSequence]$ head SRR6691720_2.fastq
@1
gagaatcccatggaccgaggaacctggtaggctgcagtccatgaggtcgctaagagtcggacacgactgagagacttcac
+
AEA/AEEEEAEEAAEEEEAEEEEEEEEEEEEE/E<EEEEAEE/EEEEEAEE/EEE/E<EE/EEEEEAEEEEEEAEA/AAA
@2
AATAGAAAATGGAAAGTAATTGAGGAGACATGTCCATATATTGTCAAACTAAAAATAACTAACCCATGGATGCTTTGAAA
+
AA//AEEEEAEEEEEEEEEEEAE/E<EEEEEEE/E/E/E/EEEEEEEEAE6EEE/EAEEEEE/EEEE<EEEAEEEEEEAE
@3
AAAAATATACCTTTTAAATAAACATCTATAAAAGTCTAAAATAATTATAAGTATTGAAAAAAATCAGGAATTTTGCTTTC
I think bedtools
could affect the final results but read unmatching problem is not that case.
Could you check it again please?
Sue
Hello Sue, I tried many ways but I still cannot reproduce the problem you have using demo data and other data on my side. Do you mind sharing the following two datasets or tell me where you downloaded online so that I can re-run your data and see what is happening?
ref=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa
refrag=/mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed
BTW, I noticed that you claim to use 20 cores for parallel alignment. Can you confirm that the computing resources are available on your side?
Apologize for the trouble that delays your studies!
Ye
Hello Ye,
I used the bosTau9 chromosome level assembly which is soft-masked and changed the name of chrX to chr999.
[sy0819@europa00 FreeHiC]$ grep ">" bosTau9.soft-masked.rename.fa
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr19
>chr1
>chr20
>chr21
>chr22
>chr23
>chr24
>chr25
>chr26
>chr27
>chr28
>chr29
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chr999
You could just download the data from the link below and filtering regular chromosome sequences only (chr1~29 + chrX).
After that, just rename chrX
into chr999
.
http://hgdownload.soe.ucsc.edu/goldenPath/bosTau9/bigZips/bosTau9.fa.gz
BTW, I renamed the chromosome X as I got some error messages while using FreeHiC because of chrX. It would be great if FreeHiC supports non-integer chromosome names including chromosome X or Y also :)
cow_hindiii.bed
was made by HiC-Pro's digest_genome.py script.
$PATH/HiC-Pro_2.11.1/bin/utils/digest_genome.py -r hindiii -o cow_hindiii.bed /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa
Or I could just share these data by email. So if you want me to send these data directly to you, please ask me!
Thanks, Sue
While I am running, if you are interested, you can also check out the lite version of FreeHiC that we recently developed. It is an approximation of the original FreeHi-C simulation pipeline but much faster. ;-)
Sounds great! I'll try the lite version also!
If you have any progress, please let me know! :)
Thanks, Sue
Hello, ye!
Did you reproduce the same results from mine with the data I told you before? It'll be great if I could know any progress about it!
Thanks, Sue
Apologize for the delay! I should be able to finish the runs later this week.
Ye
Hello Sue, I have tried on your data and simulated at 15000 and 15000000. Both works successfully on my side: I got the same number of reads in fastq files and matching read query name after alignment and sorting (instead of checking the simulated fastq file, it is more reliable to check the read id in sam file as reads in fastq file are not sorted by read id yet.)
>wc -l *fastq
60000000 cowData_demo_1.fastq
60000000 cowData_demo_2.fastq
120000000 total
>less cowData_demo_1.sam
@HD VN:1.6 SO:queryname
@SQ SN:chr1 LN:158534110
@SQ SN:chr2 LN:136231102
@SQ SN:chr3 LN:121005158
@SQ SN:chr4 LN:120000601
@SQ SN:chr5 LN:120089316
@SQ SN:chr6 LN:117806340
@SQ SN:chr7 LN:110682743
@SQ SN:chr8 LN:113319770
@SQ SN:chr9 LN:105454467
@SQ SN:chr10 LN:103308737
@SQ SN:chr11 LN:106982474
@SQ SN:chr12 LN:87216183
@SQ SN:chr13 LN:83472345
@SQ SN:chr14 LN:82403003
@SQ SN:chr15 LN:85007780
@SQ SN:chr16 LN:81013979
@SQ SN:chr17 LN:73167244
@SQ SN:chr18 LN:65820629
@SQ SN:chr19 LN:63449741
@SQ SN:chr20 LN:71974595
@SQ SN:chr21 LN:69862954
@SQ SN:chr22 LN:60773035
@SQ SN:chr23 LN:52498615
@SQ SN:chr24 LN:62317253
@SQ SN:chr25 LN:42350435
@SQ SN:chr26 LN:51992305
@SQ SN:chr27 LN:45612108
@SQ SN:chr28 LN:45940150
@SQ SN:chr29 LN:51098607
@SQ SN:chr999 LN:139009144
@PG ID:bwa PN:bwa VN:0.7.15-r1140 CL:bwa-0.7.15/bwa samse cowData/data/bosTau9.soft-masked.rename.fa cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_1.sai cowData/results/bosTau9_test/simuSequence/cowData_demo_1.fastq
@PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:samtools_v1.10/bin/samtools sort -n -@ 4 -o cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_1.sorted.sam cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_1.sam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools_v1.10/bin/samtools sort -n -@ 4 -o cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_1.sorted.sam cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_1.sam
0 0 chr2 101240873 37 80M * 0 0 GGAAACCCAAAGAGGTGAAGAAGGAAATAGCAACCATAAGGCAACAAATTAAAAAAAGAGGAGTCAGCATAAAGCCTGGA EAA<EEE/E/EEEEE/EEE/EAEEEEEEEEEEE/EEE6EEEEEEEEEEE6EEEEEEAEEEEEEAEEEEEEEEEEEAA/AA XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:55T24
1 16 chr9 31279648 37 81M * 0 0 GCTCTTCCCAGAATCAGCTTTAGCCTTGGCTACATCTTCCTGAGCGTTTCTCTTCTGGCTTTGGACCGTCTGTCAAATGAA AEAEEA/EEEE6EEEAEEE6EEEA/EEEEEE/EEEEEEEEEEE6/EEEEEEEEEEEEAE<EEEEAAEAEE/EEEEEEEAAA XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:28A51C0
2 0 chr3 29509100 37 80M * 0 0 TCTACATCTGTTTCACTTTGGTGGTACATTTCACAGGGTCTTTAACACTAGGCAGATCTCCCCATTCTTACCACTTGTTA AAEA/EEEEEEEEE<E6EEEEEE/EEEEE/EAEEEEEEEE/EEEEAEAEEEEEEEEEEEEEEEEEE/EEE/EE/EEAEA/ XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:80
3 16 chr21 25837445 37 80M * 0 0 TTCAACCATCTCATCCTCTGTCACTGTCTTCTGCCCTCAATCTCTCACAGCATCAGGGTCTTTTCCAATGAGTTGGCTCT AEAAAEEA<6/EEEEEEEEEEE/EEAEEEEEEAE/EEEEAEEEEEE/EEEEEEAEEEEEEEEEEEE<EEEEEEEEAAAEA XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:80
4 16 chr9 35227839 37 80M * 0 0 TTGTAAATCCCTTGACCTGGTCAAGCTTATTTCCTACATTCTAAATTTTATGCTCAAGAATTACTAACTTCCACAAATCT AAAAAE/EEEEEEEEE/E/EEEEEEEEAEAE/EAEEEEEE/EEEEEEEE/EAEAAEEEEEEEEEE/AE/E<EE/EA/AEA XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:80
5 0 chr11 21856404 37 80M * 0 0 TAGGAGCTCAAAGCATGTATCCCATTCCCCTTTCCCTGGGTCTGTTACACATGATAAGTTCTAAGAACAATCTAAGACTA AAEAE6E/EEEEAEEAEEEE//EE/EEEEEEEEEEEEEEEEE/AE6A/E/6AEEEEAE/EEEEA6EEEEE/EEEEAEA<A XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:28T51
6 16 chr11 21855034 37 80M * 0 0 AGCTGTGTGGATCTATTTTACATCATGGATCTCCTTCAGAGAGTAGAGTTCTACCAGAGAATGTTGCAAAGAGTAACCAG AEAAAEEEAAEE/EEEAEE/E/EEEEEEEE/A<EEEEEEEEEEEEEEEEEAEEEEEEEAEEEEAEAE6/EEEEE/EEEAA XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:80
7 0 chr29 5819250 37 80M * 0 0 TAAACAGTGCTATGATGAACATTGGGGTACAGATGTCTCTTTCAATTCTGGTTTCTTCGGTGTGTATGCCCAGCAGTGGG /EEEEEEEEE/EEEEEEEE/EEEEEEEAEEEEEEEEAE/EEEAE<EE/EEEEE/EEEEEEEAEEEEEEEEEEEEEAEAE/ XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:80
>less cowData_demo_2.sam
@HD VN:1.6 SO:queryname
@SQ SN:chr1 LN:158534110
@SQ SN:chr2 LN:136231102
@SQ SN:chr3 LN:121005158
@SQ SN:chr4 LN:120000601
@SQ SN:chr5 LN:120089316
@SQ SN:chr6 LN:117806340
@SQ SN:chr7 LN:110682743
@SQ SN:chr8 LN:113319770
@SQ SN:chr9 LN:105454467
@SQ SN:chr10 LN:103308737
@SQ SN:chr11 LN:106982474
@SQ SN:chr12 LN:87216183
@SQ SN:chr13 LN:83472345
@SQ SN:chr14 LN:82403003
@SQ SN:chr15 LN:85007780
@SQ SN:chr16 LN:81013979
@SQ SN:chr17 LN:73167244
@SQ SN:chr18 LN:65820629
@SQ SN:chr19 LN:63449741
@SQ SN:chr20 LN:71974595
@SQ SN:chr21 LN:69862954
@SQ SN:chr22 LN:60773035
@SQ SN:chr23 LN:52498615
@SQ SN:chr24 LN:62317253
@SQ SN:chr25 LN:42350435
@SQ SN:chr26 LN:51992305
@SQ SN:chr27 LN:45612108
@SQ SN:chr28 LN:45940150
@SQ SN:chr29 LN:51098607
@SQ SN:chr999 LN:139009144
@PG ID:bwa PN:bwa VN:0.7.15-r1140 CL:bwa-0.7.15/bwa samse cowData/data/bosTau9.soft-masked.rename.fa cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_2.sai cowData/results/bosTau9_test/simuSequence/cowData_demo_2.fastq
@PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:samtools_v1.10/bin/samtools sort -n -@ 4 -o cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_2.sorted.sam cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_2.sam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools_v1.10/bin/samtools sort -n -@ 4 -o cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_2.sorted.sam cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_2.sam
0 16 chr2 103068449 37 80M * 0 0 ACTTGACATTCTAAAAGCAGGATCTTTACATTTTTTGTTCATTTAGAAAAAACATTTTAGAGAAAGATTCCTACCAAATA EAAA/EEEAEAE/EEEEEEE/EEEEEEEEEEEEE/E<EAEEEEEEEAEEEEEEEEAE/AEE/EEA/EEE/6E/EEA/EEA XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:80
1 0 chr9 31688430 37 80M * 0 0 GGAGATGGAAGATCTTTGCTTGGTGAGGCATTCCCTGAGAAAGCAATCCTGGCTAGTTCCTTTGCTCTGCAGAATGCGGA A6E/<EEEEEAEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEAAEEEEAEEEEEEEEEEEEEEAAEA< XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:77A2
2 0 chr3 30227271 37 80M * 0 0 GAAAAAATATGAGTCTTTTAAAAATTCATTCTTCCATTCTCAGAAGGTACATCTGCACTGCACCTTGTCCTCATGGAAGA EAAAAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAEEEEEEEEEEAEEAE/EEEEEEEEEEEEEEAEAEA XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:51A28
3 16 chr29 22485763 37 80M * 0 0 CAAAAATATGTTCAACTGCATATATAGCTCACCATTTAGGGTATTCATGAATCCAAAACCCAACCATAGCTTTTAAACCA AEEAAAAEEEEEEEEEAEE//EEEE/EEEEEAEEEEAE/EEEAAEEEEEAE/EEEEEEEEEEEAEAEEEEEEEE<EA/EA XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:67G12
4 16 chr9 37358606 37 80M * 0 0 CCTTCACTATCTCCCGGAAGTTGTTCAAATTCACATTCACTGAGTTGGTGATGCCATCCAACTATTTCATCCTCTATTGT EAE6/EEEEEE/EEEEE/EEEEE/EEEEAAEEEEAEEEE<EEEAEEEEEEAEEEEEAEEEEAEEEEEEE/EAEEEEAAAE XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:23A56
5 16 chr999 11396651 37 80M * 0 0 TACAGGGTCAGTTCATTCTATCACTCTGCAGCTTACCAGGTATCTAGGATTTGGGACAGAGTCATAGAAGACAACATGAA AAAAAEEEEEEEEEEAEEEEEAEEEEEEEEEEEEEEEEEAAEE/EEEEEEEEEEEEEEAEEE6EEA<EEE/EE/EEA6A/ XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:80
6 0 chr999 11396545 37 80M * 0 0 ATCTCTGTTTTACACATGAGGAAACTGAGGCTTATGGAGGTTAATAAGTTGTCCAAAGTCTTTCATCCAGAAAGCAACAG /AAAE/EEEEEEEEAE/EAEE6EEEEEAAE/EEEEEEEEEE/EAEE/EEEEEEEEEEEEEEEAEE/EE/EEEAE/AEAAA XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:80
7 0 chr29 7194098 37 80M * 0 0 ATGCACAGTGCCCTAAGTGAGCTCAGCTGTTACATAACGATTATTTACAAAGTGTGGTGAGAGGGTGGGGTGAGAGAGCC A66E/EEEAEEEEAEAEEEEAEEAEEE/EEEEEAEEEEA6EE/EEEEEEEEEAEEEEEE/EEEEEEEEEEEEEEEAAAAA XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:80
Therefore, the issue you came across is still not quite clear to me.....may be related to your computing resources? Maybe you can try using 1 core, larger memory resource, clear the /tmp folder where sorting temporary files are stored, and simulate 10000 reads or smaller to start.
Otherwise, if this read id mismatches issue continues, maybe the following filtering may help. The rationale is to match read query name from both ends in the sam file. I added right after the alignment step in the simulation session. You can replace the original freeHiC.sh
by the newly added freeHiC_v2.sh
and see if it will help. Please let me know how it goes and I can merge it with the original pipeline! Thanks!
## match sam queryname from both ends %%%%
$samtools view ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam | awk '{print $1}' >${outDir}/${simuName}/simuProcess/s1_alignment/queryName_1
$samtools view ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam | awk '{print $1}' >${outDir}/${simuName}/simuProcess/s1_alignment/queryName_2
cat ${outDir}/${simuName}/simuProcess/s1_alignment/queryName_1 ${outDir}/${simuName}/simuProcess/s1_alignment/queryName_2 | sort | uniq -c | awk '$1 == 2 {print $2}' >${outDir}/${simuName}/simuProcess/s1_alignment/queryNam\
e.common
queryLen1=$(cat ${outDir}/${simuName}/simuProcess/s1_alignment/queryName_1 | wc -l)
queryLen2=$(cat ${outDir}/${simuName}/simuProcess/s1_alignment/queryName_2 | wc -l)
queryLenC=$(cat ${outDir}/${simuName}/simuProcess/s1_alignment/queryName.common | wc -l)
if [[ "$queryLen1" != "$queryLenC" ]] || [[ "$queryLen2" != "$queryLenC" ]] || [[ "$queryLen1" != "$queryLen2" ]]; then
echo "Matching simulation query names between end 1 and end2......"
${samtools} view -H ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam >${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam.header
${samtools} view -H ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam >${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam.header
awk 'NR==FNR{a[$0]=1;next}a[$1]' ${outDir}/${simuName}/simuProcess/s1_alignment/queryName.common ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam >>${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.\
sam.header
awk 'NR==FNR{a[$0]=1;next}a[$1]' ${outDir}/${simuName}/simuProcess/s1_alignment/queryName.common ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam >>${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.\
sam.header
mv ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam.header ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam
mv ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam.header ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam
fi
rm -rf ${outDir}/${simuName}/simuProcess/s1_alignment/queryName*
Best, Ye
Hi, I'm using FreeHiC for simulating Cow Hi-C data.
For fastqFile data using during training steps, SRR6691720 data (https://www.ncbi.nlm.nih.gov/sra/?term=SRR6691720) was used and other parameters are below.
train=1 simulate=1 postProcess=1 coreN=20 mismatchN=3 gapN=1 mismatchP="" gapP="" chimericP="" simuN=157702285 readLen=80 resolution=40000 lowerBound=80000 refragU=800 ligateSite="AAGCTAGCTT"
After simulating Hi-C data, I checked the output of the simulation and found that read IDs in each read file did not match. Also, the numbers of reads in read 1 and read 2 files are different. There was a warning message in log but I already checked that my input FASTQ files do not have any problems. (All reads are sorted equally in both input read files and all IDs are matched)
Read id in read 1 and read 2 file does not match. Please check the input read files and sort them correctly.
Could you tell me how to fix this problem?