yezhengSTAT / FreeHiC

FreeHi-C pipeline for high fidelity Hi-C data simulation.
MIT License
8 stars 9 forks source link

Different numbers of final read counts in _1 and _2 FASTQ files #4

Open suyeonwy opened 4 years ago

suyeonwy commented 4 years ago

Hi, I'm using FreeHiC for simulating Cow Hi-C data.

For fastqFile data using during training steps, SRR6691720 data (https://www.ncbi.nlm.nih.gov/sra/?term=SRR6691720) was used and other parameters are below.

train=1 simulate=1 postProcess=1 coreN=20 mismatchN=3 gapN=1 mismatchP="" gapP="" chimericP="" simuN=157702285 readLen=80 resolution=40000 lowerBound=80000 refragU=800 ligateSite="AAGCTAGCTT"

After simulating Hi-C data, I checked the output of the simulation and found that read IDs in each read file did not match. Also, the numbers of reads in read 1 and read 2 files are different. There was a warning message in log but I already checked that my input FASTQ files do not have any problems. (All reads are sorted equally in both input read files and all IDs are matched) Read id in read 1 and read 2 file does not match. Please check the input read files and sort them correctly.

Could you tell me how to fix this problem?

yezhengSTAT commented 4 years ago

Hello! May I know how did you download the data from GEO? Sometimes when you download a paired-ended sequencing data and split into two end fastq files, a suffix such as "-1" or "_1" may be appended to the read name to indicate this is end 1. In our case, we don't want this to happen and the read name should be exactly the same and in the same order.

The corresponding python codes to check the read name is as below, if they are not exactly the same, you will see the error message.

if r1.qname == r2.qname:
                if r1.is_unmapped == True and r2.is_unmapped == True:
                    unmapped_count += 1
                    continue
                elif r1.is_unmapped == False and r2.is_unmapped == False:
                    mapped_count += 1
                    if is_uniread(r1) == True and is_uniread(r2) == True:
                        uni_count += 1
                        (r1, r2) = sam_flag(r1, r2, readEnd1, readEnd2)
                        outfile.write(r1)
                        outfile.write(r2)

                else:
                    singleton_count += 1
            else:
                print("Read id in read 1 and read 2 file does not match. Please check the input read files and sort them correctly.")
                sys.exit()

Otherwise......may I see a fraction of your input fastq files?

Thanks, Ye

suyeonwy commented 4 years ago

I downloaded the data from SRA using fastq-dump and split 2 paired-end files using --split-files option fastq-dump --split-files SRR6691720

These are some parts of my input FASTQ files

``` [sy0819@europa00 Cow]$ head SRR6691720_1.fastq @SRR6691720.1 1 length=80 GTTCTNTTTAGTTGAGAATTCTATTTCAGTCTCTTTTATTCATCTGGTCTCCTTCTNCANTANNNNNNNCANNNNNNNNN +SRR6691720.1 1 length=80 AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEAEEEEEEEEEEEE#AE#EE####### ```[sy0819@europa00 Cow]$ head SRR6691720_2.fastq @SRR6691720.1 1 length=80 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +SRR6691720.1 1 length=80 ################################################################################ @SRR6691720.2 2 length=80 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +SRR6691720.2 2 length=80 ################################################################################ @SRR6691720.3 3 length=80 NANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ``` I also checked SAM files, which is the output of FreeHi-C rawDataTraining steps. These files are in the rawDataTraining/s1_alignment/ directory and part of SAM files are below. `SRR6691720_1.sam` ``` SRR6691720.1 4 * 0 0 * * 0 0 GTTCTNTTTAGTTGAGAATTCTATTTCAGTCTCTTTTATTCATCTGGTCTCCTTCTNCANTANNNNNNNCANNNNNNNNN AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEAEEEEEEEEEEEE#AE#EE#######
yezhengSTAT commented 4 years ago

yea.....looks fine to me as well. Can you also share the original complete log or screen output information with the error messages? Looks to me that this error may not appear in the training section but in the simulation sequence data processing section. In that case, there must be something wrong with the simulation.

suyeonwy commented 4 years ago

Here is the original complete log information! I had to simulate multiple Hi-C data sets, so I completed the training step first and simulate multiple sets with training=0 option

[ Training step log ]

projDir=/mss5/RACA2/programs/FreeHiC
fastqFile=/mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720
ref=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa
refrag=/mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed
outDir=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x
simuName=bosTau9_10x
summaryFile=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/summary/bosTau9_FreeHiC.summary
bwa=bwa
samtools=samtools
bedtools=bedtools
train=1
simulate=1
postProcess=1
coreN=20
mismatchN=3
gapN=1
mismatchP=
gapP=
chimericP=
simuN=394259238
readLen=80
resolution=40000
lowerBound=80000
refragU=800
ligateSite=AAGCTAGCTT
Pre-checking the input parameters......
  1. FreeHi-C package directory exist.
  2. /mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720 will be downloaded.
  3. Reference genome file exists and is not empty.
  4. Restriction enzyme fragment file exists and is not empty.
  5. Simulation run name is provided.
  6. Output directory is provided.
  7. Summary file path is provided.
  8. Path to BWA executable file is provided. Checking bwa help manual......

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1198-dirty
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual.

  9. Path to samtools executable file is provided. Checking samtools help manual......

Program: samtools (Tools for alignments in the SAM format)
Version: 1.9 (using htslib 1.9)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     fqidx          index/extract FASTQ
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

  10. Path to bedtools executable file is provided. Checking bedtools manual......
bedtools is a powerful toolset for genome arithmetic.

Version:   v2.28.0
About:     developed in the quinlanlab.org and by many contributors worldwide.
Docs:      http://bedtools.readthedocs.io/
Code:      https://github.com/arq5x/bedtools2
Mail:      https://groups.google.com/forum/#!forum/bedtools-discuss

Usage:     bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
    intersect     Find overlapping intervals in various ways.
    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.
    coverage      Compute the coverage over defined intervals.
    map           Apply a function to a column for each overlapping interval.
    genomecov     Compute the coverage over an entire genome.
    merge         Combine overlapping/nearby intervals into a single interval.
    cluster       Cluster (but don't merge) overlapping/nearby intervals.
    complement    Extract intervals _not_ represented by an interval file.
    shift         Adjust the position of intervals.
    subtract      Remove intervals based on overlaps b/w two files.
    slop          Adjust the size of intervals.
    flank         Create new intervals from the flanks of existing intervals.
    sort          Order the intervals in a file.
    random        Generate random intervals in a genome.
    shuffle       Randomly redistribute intervals in a genome.
    sample        Sample random records from file using reservoir sampling.
    spacing       Report the gap lengths between intervals in a file.
    annotate      Annotate coverage of features from multiple files.

[ Multi-way file comparisons ]
    multiinter    Identifies common intervals among multiple interval files.
    unionbedg     Combines coverage intervals from multiple BEDGRAPH files.

[ Paired-end manipulation ]
    pairtobed     Find pairs that overlap intervals in various ways.
    pairtopair    Find pairs that overlap other pairs in various ways.

[ Format conversion ]
    bamtobed      Convert BAM alignments to BED (& other) formats.
    bedtobam      Convert intervals to BAM records.
    bamtofastq    Convert BAM records to FASTQ records.
    bedpetobam    Convert BEDPE intervals to BAM records.
    bed12tobed6   Breaks BED12 intervals into discrete BED6 intervals.

[ Fasta manipulation ]
    getfasta      Use intervals to extract sequences from a FASTA file.
    maskfasta     Use intervals to mask sequences from a FASTA file.
    nuc           Profile the nucleotide content of intervals in a FASTA file.

[ BAM focused tools ]
    multicov      Counts coverage from multiple BAMs at specific intervals.
    tag           Tag BAM alignments based on overlaps with interval files.

[ Statistical relationships ]
    jaccard       Calculate the Jaccard statistic b/w two sets of intervals.
    reldist       Calculate the distribution of relative distances b/w two files.
    fisher        Calculate Fisher statistic b/w two feature files.

[ Miscellaneous tools ]
    overlap       Computes the amount of overlap from two intervals.
    igv           Create an IGV snapshot batch script.
    links         Create a HTML page of links to UCSC locations.
    makewindows   Make interval "windows" across a genome.
    groupby       Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
    expand        Replicate lines based on lists of values in columns.
    split         Split a file into multiple files with equal records or base pairs.

[ General help ]
    --help        Print this help menu.
    --version     What version of bedtools are you using?.
    --contact     Feature requests, bugs, mailing lists, etc.

  11. Action is defined for training raw input Hi-C sequencing data.
  12. Action is defined for simulate Hi-C sequencing data.
  13. Action is defined for whether to process simulated Hi-C sequencing data.
  14. Number of processing core is specified.
  15. Number of maximum alignment mismatches allowed is specified.
  16. Number of maximum alignment gaps allowed is specified.
  17. No number for the percentage of simulated reads that have mismatches is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
  18. No number for the percentage of simulated reads that have gap is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
  19. No number for the percentage of simulated reads that are chimeric reads is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
  20. Sequencing depth of simulation is specified.
  21. Read length of simulation is specified. Please make sure it does not exceed the read length of input raw Hi-C reads.
  22. Resolution for processing Hi-C reads is specified.
  23. Lower bound for valid long-range interactions is specified.
  24. Upper bound for the distance to the assigned restriction enyzme cutting site is specified.
  25. Sequences at the ligation site is provided.
Input parameter checking is successfull!!
Start FreeHi-C......
I. Training: 1 - Alignment!
(We assume the reads length are the same within the same fastq file.)...... 
Start alignment of read end 1
Step1.1 - BWA alignment
[bwa_aln_core] calculate SA coordinate... 271.55 sec
[bwa_aln_core] write to the disk... 0.05 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] calculate SA coordinate... 187.60 sec
[bwa_aln_core] write to the disk... 0.04 sec
[bwa_aln_core] 203342006 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720_1.fastq
[main] Real time: 12026.644 sec; CPU: 208603.968 sec
[bwa_aln_core] convert to sequence coordinate... 5.38 sec
[bwa_aln_core] refine gapped alignments... 1.06 sec
[bwa_aln_core] print alignments... 0.40 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] convert to sequence coordinate... 6.54 sec
[bwa_aln_core] refine gapped alignments... 1.55 sec
[bwa_aln_core] print alignments... 0.51 sec
[bwa_aln_core] 203342006 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_1.sai /mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720_1.fastq
[main] Real time: 8343.945 sec; CPU: 6698.826 sec
Step1.2 - Filter and get unmapped alignment sam file.
Step1.3 - Trim unmapped reads until the restriction enzyme cutting site.
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 73810881 reads
##Fastq file: /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_1.fastq
##Restriction sites:
AAGCTAGCTT
##Output File: /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_1.fastq
##Trimmed reads: 3676

Step1.4 - Rescue chimeric reads by re-aligning trimmed unmapped reads.
[bwa_aln_core] calculate SA coordinate... 6.69 sec
[bwa_aln_core] write to the disk... 0.00 sec
[bwa_aln_core] 2551 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_1.fastq
[main] Real time: 6.780 sec; CPU: 10.564 sec
[bwa_aln_core] convert to sequence coordinate... 4.56 sec
[bwa_aln_core] refine gapped alignments... 0.84 sec
[bwa_aln_core] print alignments... 0.00 sec
[bwa_aln_core] 2551 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_1.sai /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_1.fastq
[main] Real time: 7.280 sec; CPU: 5.413 sec
step1.5 - Merge chimeric reads with mapped full-length reads.
Start alignment of read end 2
Step1.1 - BWA alignment
[bwa_aln_core] calculate SA coordinate... 328.69 sec
[bwa_aln_core] write to the disk... 0.10 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] convert to sequence coordinate... 4.88 sec
[bwa_aln_core] refine gapped alignments... 0.95 sec
[bwa_aln_core] print alignments... 0.46 sec
[bwa_aln_core] 203342006 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_2.sai /mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720_2.fastq
[main] Real time: 8567.442 sec; CPU: 6501.996 sec
Step1.2 - Filter and get unmapped alignment sam file.
Step1.3 - Trim unmapped reads until the restriction enzyme cutting site.
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 89078768 reads
##Fastq file: /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_2.fastq
##Restriction sites:
AAGCTAGCTT
##Output File: /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_2.fastq
##Trimmed reads: 4339

Step1.4 - Rescue chimeric reads by re-aligning trimmed unmapped reads.
[bwa_aln_core] calculate SA coordinate... 6.39 sec
[bwa_aln_core] write to the disk... 0.00 sec
[bwa_aln_core] 3067 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_2.fastq
[main] Real time: 11.036 sec; CPU: 9.074 sec
[bwa_aln_core] convert to sequence coordinate... 3.82 sec
[bwa_aln_core] refine gapped alignments... 0.67 sec
[bwa_aln_core] print alignments... 0.01 sec
[bwa_aln_core] 3067 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_2.sai /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_unmapped_trim_filter_2.fastq
[main] Real time: 12.245 sec; CPU: 4.506 sec
step1.5 - Merge chimeric reads with mapped full-length reads.
I. Training: 2 - Joining read-ends!
Read 1 file =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_1.sam
Read 2 file =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720_2.sam
Output file =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720.sam
Summary of alignment results =  True
Summary file path =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/summary/bosTau9_FreeHiC.summary
Training =  1
Verbose =  True
Trainning!
------------Begin joining ends from read 1 file and read 2 file------------
# of reads processed:  1000000
.
.
.
# of reads processed:  203000000
I. Training: 3 - Categorize read-pairs!
RE fragment =  /mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed
Paired-ended read alignment file =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s1_alignment/SRR6691720.sam
Output directory =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/rawDataTraining/s2_validPairs
RE cutting site distance lower bound =  50
RE cutting site distance upper bound =  800
Minimum intrachromosomal contact distance =  80000
Binning method =  window
Binning resolution =  40000
Summary =  True
SummaryFile =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_10x/summary/bosTau9_FreeHiC.summary
Verbose =  True
## Loading Restriction File Intervals '/mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed'...
dict_keys(['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr23', 'chr24', 'chr25', 'chr26', 'chr27', 'chr28', 'chr29', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr999'])
------------Begin reading paired-ended file------------
------------Categorize aligned read pairs------------
I. Training: 4 - Remove duplicates!
I. Training: 5 - Binning!

[Simulation & Post-process step log]

projDir=/mss5/RACA2/programs/FreeHiC
fastqFile=/mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720
ref=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa
refrag=/mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed
outDir=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x
simuName=bosTau9_05x
summaryFile=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/summary/bosTau9_FreeHiC.summary
bwa=bwa
samtools=samtools
bedtools=bedtools
train=0
simulate=1
postProcess=1
coreN=20
mismatchN=3
gapN=1
mismatchP=
gapP=
chimericP=
simuN=157702285
readLen=80
resolution=40000
lowerBound=80000
refragU=800
ligateSite=AAGCTAGCTT
Pre-checking the input parameters......
  1. FreeHi-C package directory exist.
  2. /mss5/RACA2/read_simulation/hic/realData/Cow/SRR6691720 will be downloaded.
  3. Reference genome file exists and is not empty.
  4. Restriction enzyme fragment file exists and is not empty.
  5. Simulation run name is provided.
  6. Output directory is provided.
  7. Summary file path is provided.
  8. Path to BWA executable file is provided. Checking bwa help manual......

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1198-dirty
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual.

  9. Path to samtools executable file is provided. Checking samtools help manual......

Program: samtools (Tools for alignments in the SAM format)
Version: 1.9 (using htslib 1.9)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     fqidx          index/extract FASTQ
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

  10. Path to bedtools executable file is provided. Checking bedtools manual......
bedtools is a powerful toolset for genome arithmetic.

Version:   v2.28.0
About:     developed in the quinlanlab.org and by many contributors worldwide.
Docs:      http://bedtools.readthedocs.io/
Code:      https://github.com/arq5x/bedtools2
Mail:      https://groups.google.com/forum/#!forum/bedtools-discuss

Usage:     bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
    intersect     Find overlapping intervals in various ways.
    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.
    coverage      Compute the coverage over defined intervals.
    map           Apply a function to a column for each overlapping interval.
    genomecov     Compute the coverage over an entire genome.
    merge         Combine overlapping/nearby intervals into a single interval.
    cluster       Cluster (but don't merge) overlapping/nearby intervals.
    complement    Extract intervals _not_ represented by an interval file.
    shift         Adjust the position of intervals.
    subtract      Remove intervals based on overlaps b/w two files.
    slop          Adjust the size of intervals.
    flank         Create new intervals from the flanks of existing intervals.
    sort          Order the intervals in a file.
    random        Generate random intervals in a genome.
    shuffle       Randomly redistribute intervals in a genome.
    sample        Sample random records from file using reservoir sampling.
    spacing       Report the gap lengths between intervals in a file.
    annotate      Annotate coverage of features from multiple files.

[ Multi-way file comparisons ]
    multiinter    Identifies common intervals among multiple interval files.
    unionbedg     Combines coverage intervals from multiple BEDGRAPH files.

[ Paired-end manipulation ]
    pairtobed     Find pairs that overlap intervals in various ways.
    pairtopair    Find pairs that overlap other pairs in various ways.

[ Format conversion ]
    bamtobed      Convert BAM alignments to BED (& other) formats.
    bedtobam      Convert intervals to BAM records.
    bamtofastq    Convert BAM records to FASTQ records.
    bedpetobam    Convert BEDPE intervals to BAM records.
    bed12tobed6   Breaks BED12 intervals into discrete BED6 intervals.

[ Fasta manipulation ]
    getfasta      Use intervals to extract sequences from a FASTA file.
    maskfasta     Use intervals to mask sequences from a FASTA file.
    nuc           Profile the nucleotide content of intervals in a FASTA file.

[ BAM focused tools ]
    multicov      Counts coverage from multiple BAMs at specific intervals.
    tag           Tag BAM alignments based on overlaps with interval files.

[ Statistical relationships ]
    jaccard       Calculate the Jaccard statistic b/w two sets of intervals.
    reldist       Calculate the distribution of relative distances b/w two files.
    fisher        Calculate Fisher statistic b/w two feature files.

[ Miscellaneous tools ]
    overlap       Computes the amount of overlap from two intervals.
    igv           Create an IGV snapshot batch script.
    links         Create a HTML page of links to UCSC locations.
    makewindows   Make interval "windows" across a genome.
    groupby       Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
    expand        Replicate lines based on lists of values in columns.
    split         Split a file into multiple files with equal records or base pairs.

[ General help ]
    --help        Print this help menu.
    --version     What version of bedtools are you using?.
    --contact     Feature requests, bugs, mailing lists, etc.

  11. Action is defined for training raw input Hi-C sequencing data.
  12. Action is defined for simulate Hi-C sequencing data.
  13. Action is defined for whether to process simulated Hi-C sequencing data.
  14. Number of processing core is specified.
  15. Number of maximum alignment mismatches allowed is specified.
  16. Number of maximum alignment gaps allowed is specified.
  17. No number for the percentage of simulated reads that have mismatches is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
  18. No number for the percentage of simulated reads that have gap is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
  19. No number for the percentage of simulated reads that are chimeric reads is specified. FreeHi-C will use the default percentage that is equal to that of the input raw Hi-C sequencing data.
  20. Sequencing depth of simulation is specified.
  21. Read length of simulation is specified. Please make sure it does not exceed the read length of input raw Hi-C reads.
  22. Resolution for processing Hi-C reads is specified.
  23. Lower bound for valid long-range interactions is specified.
  24. Upper bound for the distance to the assigned restriction enyzme cutting site is specified.
  25. Sequences at the ligation site is provided.
Input parameter checking is successfull!!
Start FreeHi-C......
## *****************************************
## II. Simulating interaction read sequences
## *****************************************
II. Simulating!
Feature (chr23:52498684-52498764) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498577-52498657) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498553-52498633) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498591-52498671) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498556-52498636) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498609-52498689) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498652-52498732) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498611-52498691) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498651-52498731) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498551-52498631) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498666-52498746) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498547-52498627) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498640-52498720) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498584-52498664) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498544-52498624) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498663-52498743) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498648-52498728) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498580-52498660) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498570-52498650) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498594-52498674) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498647-52498727) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498599-52498679) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498578-52498658) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498553-52498633) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498598-52498678) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498629-52498709) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498651-52498731) beyond the length of chr23 size (52498615 bp).  Skipping.
Feature (chr23:52498538-52498618) beyond the length of chr23 size (52498615 bp).  Skipping.
check1
check2
check3
check4
RE fragment =  /mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed
Fragment interaction file =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/rawDataTraining/s2_validPairs/SRR6691720.validPairs.fragFreq
Output directory =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence
Output file name =  SRR6691720
Total number of interactions to simulate =  157702285
Read length to simulate =  80
Maximum distance away from restriction site =  800
Path to bedtools =  bedtools
Path to reference genome fasta file =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa
Path to summary file =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/summary/bosTau9_FreeHiC.summary
Verbose =  True
Read in restriction fragment dictionary!
Read in interaction frequency!
Convert dictionary and list into C structure!
Sample fragment interaction!
Read in trained parameters!
243797050
101747200
2285627
2097474
Simulate end 1!
0
.
.
.
157500000
Get fasta from the reference genome!
Add mutation!
500000
.
.
.
65500000
Add insertion!
Add deletion
Simulate end 2!
0
.
.
.
157500000
Get fasta from reference genome!
Add mutation!
500000
.
.
.
65500000
Add insertion!
Add deletion!
II. Simulation Post-processing: 1 - Alignment!
(We assume the reads length are the same within the same fastq file.)...... 
Start alignment of read end 1
Step1.1 - BWA alignment
[bwa_aln_core] calculate SA coordinate... 53.10 sec
[bwa_aln_core] write to the disk... 0.05 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] calculate SA coordinate... 40.82 sec
[bwa_aln_core] write to the disk... 0.05 sec
[bwa_aln_core] 1763658 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_1.fastq
[main] Real time: 41.735 sec; CPU: 383.532 sec
[bwa_aln_core] convert to sequence coordinate... 5.36 sec
[bwa_aln_core] refine gapped alignments... 0.93 sec
[bwa_aln_core] print alignments... 0.46 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] convert to sequence coordinate... 4.28 sec
[bwa_aln_core] refine gapped alignments... 0.62 sec
[bwa_aln_core] print alignments... 0.33 sec
[bwa_aln_core] 1763658 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720_1.sai /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_1.fastq
[main] Real time: 56.076 sec; CPU: 42.642 sec
Choose not to rescue chimeric reads or the chimeric reads have been trimmed before alignment!
Start alignment of read end 2
Step1.1 - BWA alignment
[bwa_aln_core] calculate SA coordinate... 53.93 sec
[bwa_aln_core] write to the disk... 0.07 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] calculate SA coordinate... 48.07 sec
[bwa_aln_core] write to the disk... 0.06 sec
[bwa_aln_core] 11505705 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_2.fastq
[main] Real time: 161.632 sec; CPU: 2509.953 sec
[bwa_aln_core] convert to sequence coordinate... 8.88 sec
[bwa_aln_core] refine gapped alignments... 0.69 sec
[bwa_aln_core] print alignments... 0.45 sec
[bwa_aln_core] 262144 sequences have been processed.
.
.
.
[bwa_aln_core] convert to sequence coordinate... 4.38 sec
[bwa_aln_core] refine gapped alignments... 0.59 sec
[bwa_aln_core] print alignments... 0.40 sec
[bwa_aln_core] 11505705 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa samse /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720_2.sai /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_2.fastq
[main] Real time: 263.851 sec; CPU: 262.748 sec
Choose not to rescue chimeric reads or the chimeric reads have been trimmed before alignment!
[bam_sort_core] merging from 0 files and 20 in-memory blocks...
[bam_sort_core] merging from 0 files and 20 in-memory blocks...
II. Simulation Post-processing: 2 - Joining read-ends!
Read 1 file =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720_1.sam
Read 2 file =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720_2.sam
Output file =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720.sam
Summary of alignment results =  True
Summary file path =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/summary/bosTau9_FreeHiC.summary
Training =  0
Verbose =  True
------------Begin joining ends from read 1 file and read 2 file------------
Read id in read 1 and read 2 file does not match. Please check the input read files and sort them correctly.
II. Simulation Post-processing: 3 - Categorize read-pairs!
RE fragment =  /mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed
Paired-ended read alignment file =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s1_alignment/SRR6691720.sam
Output directory =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuProcess/s2_validPairs
RE cutting site distance lower bound =  50
RE cutting site distance upper bound =  800
Minimum intrachromosomal contact distance =  80000
Binning method =  window
Binning resolution =  40000
Summary =  True
SummaryFile =  /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/summary/bosTau9_FreeHiC.summary
Verbose =  True
## Loading Restriction File Intervals '/mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed'...
dict_keys(['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr23', 'chr24', 'chr25', 'chr26', 'chr27', 'chr28', 'chr29', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr999'])
------------Begin reading paired-ended file------------
------------Categorize aligned read pairs------------
II. Simulation Post-processing: 4 - Binning!

If you need any other information, please ask me! Thanks, Sue

yezhengSTAT commented 4 years ago

Can you check the number of reads in /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_1.fastq and /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_2.fastq ? It seems that more than 157500000 reads are simulated for each end by the following log.

Simulate end 1!
0
.
.
.
157500000

However, only 1763658 reads of end 1 are aligned by bwa aln. Similarly for end 2. Much fewer reads are processed by bwa.

[bwa_aln_core] 1763658 sequences have been processed.
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa aln -n 3 -o 1 -t 20 /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9.soft-masked.rename.fa /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_1.fastq
[main] Real time: 41.735 sec; CPU: 383.532 sec

If the fastq files do not contain 157702285 reads, then there should be something wrong with the simulation section. BTW, have you tried simulating using the demo data and demo runs?

Ye

suyeonwy commented 4 years ago

There are 157,702,265 reads in simuSequence/SRR6691720_1.fastq file and 157,702,277 reads in simuSequence/SRR6691720_2.fastq file.

Yes, I have already run the demo data and got simulation data. It seems that the demo run did not have any problems in outputs.

yezhengSTAT commented 4 years ago

Emm......that is indeed very weird....looks like the simulation is almost complete. Do you mind running the simulation section again with 10 times fewer number of simuN, say, 15,770,228, or even smaller number to see if it works successfully? One of the concerns is that the memory on your server may not be enough to convert this many reads into sequences at the same time. Sometimes, to accelerate the simulation speed, I may also split the simulation into 10 runs and merge the final bin-pairs instead of one big simulation.

Let me know if this helps.

Ye

suyeonwy commented 4 years ago

Hi,

I ran the simulation section with 10 times fewer numbers (15,000,000) as you recommended There are 14,999,999 reads in both _1.fastq and _2.fastq files, so the numbers of simulated reads in both files are equal. However, the read ids were still not matched, and also the same error messages emerged. Read id in read 1 and read 2 file does not match. Please check the input read files and sort them correctly.

These are the examples of final simulation data [SRR6691720_1.fastq]

[sy0819@europa00 simuSequence]$ head -n 12 SRR6691720_1.fastq 
@0(-)
ggggtagaggggtgggatgaagagattgggattgacatatatatacatgattgatactattcataaaatagataagtaat
+
EAEEAEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEE/AEEEEEEEE//AEEEEAEEEEEEEE/EEEEEE/EE/
@2(+)
CTGTTGAACCATATGATGATTGTTGGCCTCCTAACCttcccccttaacagttcagcacttgtctccttagaacaagggct
+
AAAEEEEEE/EE6EEE/EEEEEEEEEEEEEE<EEEEAEEEEEEEEEEEEEEEEE/EEEE/EEEEEEEEEEEEEEEAAE/E
@3(-)
acatattaatcaaattaacaaagatcaaacacaaagaacaaatattaaaagcagcaagggaaaaacaacaaataacacac
+
AAEEAEEEEEE/EEEEEEEAEEEE//AEAEEEEEEAEEEEEEEAE6/EEEEEEEEE6EEEE/EEEEEEEEEEE<EAAAAE

[SRR6691720_2.fastq]

[sy0819@europa00 simuSequence]$ head -n 12 SRR6691720_2.fastq 
@0(+)
agaaaatcaatgaaactaaaagttgcttctttgaaaagattaataaaattgataaatgtctagccactactaaaaaaaat
+
EA/AAAEEE6EEEEAEEAEEAEE6EEEEEEEEEE<EE/EEEEEEEEEEEE<AEEEEEEEEEEAEEEEEEAEEE<EAAAEA
@1(-)
ttgctggcatattgagtgcagcactttcacagcatcatctttcaggatttggaatagctcaactggaattccatcacctc
+
AEEEAEAEEEEEEEEEEEEEEAEEEEEEEEE/E/EEEEE/EEEEEEEEAEEEE//EEEEEE/EE/EE6EEEEE/EAEAA<
@3(+)
gcttccctcctggccctctcccctcactcctccacctgctcccattcctcaaacacaccatgctcccttccgcttcgaga
+
AE/EAEEEEEEEEAAEEEE6EEEEAAEEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEE/AAEEEEE/AAEA

Do you think that I should run the simulation with much smaller numbers? Please let me know if you need other information

Thanks, Sue

yezhengSTAT commented 4 years ago

I see!! I just realized that bedtools has updated its getfasta function in recent released versions!! A quick solution is that if you can download and install 2.25.0 version of bedtools: https://github.com/arq5x/bedtools2/releases/tag/v2.25.0 and still use current freehic software, it should solve the problem. Otherwise, I will update the freehic repository later this week to accommodate the new output from the latest version of bedtools.

Let me know how it goes! Thanks for bringing it up! Appreciate your time and tries!

Ye

suyeonwy commented 4 years ago

I was using bedtools 2.28.0 version before and I downloaded bedtools 2.25.0 version as you advised I corrected the bedtools path in the parameter files and re-ran the simulation bedtools="/mss5/RACA2/programs/bedtools2/bin/bedtools"

[sy0819@europa00 bosTau9]$ /mss5/RACA2/programs/bedtools2/bin/bedtools -version
bedtools v2.25.0

But I still get the same error messages before and the numbers of reads in both files are still different Read id in read 1 and read 2 file does not match. Please check the input read files and sort them correctly. Total line counts in both reads and examples of FASTQ files are below

[sy0819@europa00 simuSequence]$ wc -l *
  59999996 SRR6691720_1.fastq
  60000000 SRR6691720_2.fastq
[sy0819@europa00 simuSequence]$ head SRR6691720_1.fastq 
@0
TGAGCCTGTGTTATCCCCCTCTTCTCATGTGTCCCCCTGCTAGAGGTGCAGCTCTAACTAGATCACTTCTCCTTTACTGG
+
EAAEAA/EEE6EAA/EEEEEEEA6EEEEEEEAEAAEEEE//EE/EEEEEEEEEEEEEEEEEEEEEEAEE<E/E/EAA6AA
@2
acttctgcaaagaagacatatagatggccaataaacacatgaaaagattctcgacatcactaactattCAGTTCAGTTCa
+
AE6EAAEEEEEEE/EEEEEEEEEEEEEEEEEEE//EE/EEAEEEEE/EEEEEEEEEEEE/EEEEEAE/EEEE<EEAEAAA
@4
AATAATAATAACAGTGAAAGAATGATTCTTCACCATTGGTGTGGTGTTTTTTCCCTTTCATGccccaaacacatgtgtat
[sy0819@europa00 simuSequence]$ head SRR6691720_2.fastq 
@1
gagaatcccatggaccgaggaacctggtaggctgcagtccatgaggtcgctaagagtcggacacgactgagagacttcac
+
AEA/AEEEEAEEAAEEEEAEEEEEEEEEEEEE/E<EEEEAEE/EEEEEAEE/EEE/E<EE/EEEEEAEEEEEEAEA/AAA
@2
AATAGAAAATGGAAAGTAATTGAGGAGACATGTCCATATATTGTCAAACTAAAAATAACTAACCCATGGATGCTTTGAAA
+
AA//AEEEEAEEEEEEEEEEEAE/E<EEEEEEE/E/E/E/EEEEEEEEAE6EEE/EAEEEEE/EEEE<EEEAEEEEEEAE
@3
AAAAATATACCTTTTAAATAAACATCTATAAAAGTCTAAAATAATTATAAGTATTGAAAAAAATCAGGAATTTTGCTTTC

I think bedtools could affect the final results but read unmatching problem is not that case. Could you check it again please?

Sue

yezhengSTAT commented 4 years ago

Hello Sue, I tried many ways but I still cannot reproduce the problem you have using demo data and other data on my side. Do you mind sharing the following two datasets or tell me where you downloaded online so that I can re-run your data and see what is happening?

ref=/mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa
refrag=/mss5/RACA2/read_simulation/hic/FreeHiC/cow_hindiii.bed

BTW, I noticed that you claim to use 20 cores for parallel alignment. Can you confirm that the computing resources are available on your side?

Apologize for the trouble that delays your studies!

Ye

suyeonwy commented 4 years ago

Hello Ye,

I used the bosTau9 chromosome level assembly which is soft-masked and changed the name of chrX to chr999.

[sy0819@europa00 FreeHiC]$ grep ">" bosTau9.soft-masked.rename.fa
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr19
>chr1
>chr20
>chr21
>chr22
>chr23
>chr24
>chr25
>chr26
>chr27
>chr28
>chr29
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chr999

You could just download the data from the link below and filtering regular chromosome sequences only (chr1~29 + chrX). After that, just rename chrX into chr999. http://hgdownload.soe.ucsc.edu/goldenPath/bosTau9/bigZips/bosTau9.fa.gz

BTW, I renamed the chromosome X as I got some error messages while using FreeHiC because of chrX. It would be great if FreeHiC supports non-integer chromosome names including chromosome X or Y also :)

cow_hindiii.bed was made by HiC-Pro's digest_genome.py script.

$PATH/HiC-Pro_2.11.1/bin/utils/digest_genome.py -r hindiii -o cow_hindiii.bed /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9.soft-masked.rename.fa

Or I could just share these data by email. So if you want me to send these data directly to you, please ask me!

Thanks, Sue

yezhengSTAT commented 4 years ago

While I am running, if you are interested, you can also check out the lite version of FreeHiC that we recently developed. It is an approximation of the original FreeHi-C simulation pipeline but much faster. ;-)

suyeonwy commented 4 years ago

Sounds great! I'll try the lite version also!

If you have any progress, please let me know! :)

Thanks, Sue

suyeonwy commented 4 years ago

Hello, ye!

Did you reproduce the same results from mine with the data I told you before? It'll be great if I could know any progress about it!

Thanks, Sue

yezhengSTAT commented 4 years ago

Apologize for the delay! I should be able to finish the runs later this week.

Ye

yezhengSTAT commented 4 years ago

Hello Sue, I have tried on your data and simulated at 15000 and 15000000. Both works successfully on my side: I got the same number of reads in fastq files and matching read query name after alignment and sorting (instead of checking the simulated fastq file, it is more reliable to check the read id in sam file as reads in fastq file are not sorted by read id yet.)

>wc -l *fastq
  60000000 cowData_demo_1.fastq
  60000000 cowData_demo_2.fastq
 120000000 total
>less cowData_demo_1.sam
@HD     VN:1.6  SO:queryname
@SQ     SN:chr1 LN:158534110
@SQ     SN:chr2 LN:136231102
@SQ     SN:chr3 LN:121005158
@SQ     SN:chr4 LN:120000601
@SQ     SN:chr5 LN:120089316
@SQ     SN:chr6 LN:117806340
@SQ     SN:chr7 LN:110682743
@SQ     SN:chr8 LN:113319770
@SQ     SN:chr9 LN:105454467
@SQ     SN:chr10        LN:103308737
@SQ     SN:chr11        LN:106982474
@SQ     SN:chr12        LN:87216183
@SQ     SN:chr13        LN:83472345
@SQ     SN:chr14        LN:82403003
@SQ     SN:chr15        LN:85007780
@SQ     SN:chr16        LN:81013979
@SQ     SN:chr17        LN:73167244
@SQ     SN:chr18        LN:65820629
@SQ     SN:chr19        LN:63449741
@SQ     SN:chr20        LN:71974595
@SQ     SN:chr21        LN:69862954
@SQ     SN:chr22        LN:60773035
@SQ     SN:chr23        LN:52498615
@SQ     SN:chr24        LN:62317253
@SQ     SN:chr25        LN:42350435
@SQ     SN:chr26        LN:51992305
@SQ     SN:chr27        LN:45612108
@SQ     SN:chr28        LN:45940150
@SQ     SN:chr29        LN:51098607
@SQ     SN:chr999       LN:139009144
@PG     ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:bwa-0.7.15/bwa samse cowData/data/bosTau9.soft-masked.rename.fa cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_1.sai cowData/results/bosTau9_test/simuSequence/cowData_demo_1.fastq
@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.10 CL:samtools_v1.10/bin/samtools sort -n -@ 4 -o cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_1.sorted.sam cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_1.sam
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.10 CL:samtools_v1.10/bin/samtools sort -n -@ 4 -o cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_1.sorted.sam cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_1.sam
0       0       chr2    101240873       37      80M     *       0       0       GGAAACCCAAAGAGGTGAAGAAGGAAATAGCAACCATAAGGCAACAAATTAAAAAAAGAGGAGTCAGCATAAAGCCTGGA        EAA<EEE/E/EEEEE/EEE/EAEEEEEEEEEEE/EEE6EEEEEEEEEEE6EEEEEEAEEEEEEAEEEEEEEEEEEAA/AA        XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:55T24
1       16      chr9    31279648        37      81M     *       0       0       GCTCTTCCCAGAATCAGCTTTAGCCTTGGCTACATCTTCCTGAGCGTTTCTCTTCTGGCTTTGGACCGTCTGTCAAATGAA       AEAEEA/EEEE6EEEAEEE6EEEA/EEEEEE/EEEEEEEEEEE6/EEEEEEEEEEEEAE<EEEEAAEAEE/EEEEEEEAAA       XT:A:U  NM:i:2  X0:i:1  X1:i:0  XM:i:2  XO:i:0  XG:i:0  MD:Z:28A51C0
2       0       chr3    29509100        37      80M     *       0       0       TCTACATCTGTTTCACTTTGGTGGTACATTTCACAGGGTCTTTAACACTAGGCAGATCTCCCCATTCTTACCACTTGTTA        AAEA/EEEEEEEEE<E6EEEEEE/EEEEE/EAEEEEEEEE/EEEEAEAEEEEEEEEEEEEEEEEEE/EEE/EE/EEAEA/        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:80
3       16      chr21   25837445        37      80M     *       0       0       TTCAACCATCTCATCCTCTGTCACTGTCTTCTGCCCTCAATCTCTCACAGCATCAGGGTCTTTTCCAATGAGTTGGCTCT        AEAAAEEA<6/EEEEEEEEEEE/EEAEEEEEEAE/EEEEAEEEEEE/EEEEEEAEEEEEEEEEEEE<EEEEEEEEAAAEA        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:80
4       16      chr9    35227839        37      80M     *       0       0       TTGTAAATCCCTTGACCTGGTCAAGCTTATTTCCTACATTCTAAATTTTATGCTCAAGAATTACTAACTTCCACAAATCT        AAAAAE/EEEEEEEEE/E/EEEEEEEEAEAE/EAEEEEEE/EEEEEEEE/EAEAAEEEEEEEEEE/AE/E<EE/EA/AEA        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:80
5       0       chr11   21856404        37      80M     *       0       0       TAGGAGCTCAAAGCATGTATCCCATTCCCCTTTCCCTGGGTCTGTTACACATGATAAGTTCTAAGAACAATCTAAGACTA        AAEAE6E/EEEEAEEAEEEE//EE/EEEEEEEEEEEEEEEEE/AE6A/E/6AEEEEAE/EEEEA6EEEEE/EEEEAEA<A        XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:28T51
6       16      chr11   21855034        37      80M     *       0       0       AGCTGTGTGGATCTATTTTACATCATGGATCTCCTTCAGAGAGTAGAGTTCTACCAGAGAATGTTGCAAAGAGTAACCAG        AEAAAEEEAAEE/EEEAEE/E/EEEEEEEE/A<EEEEEEEEEEEEEEEEEAEEEEEEEAEEEEAEAE6/EEEEE/EEEAA        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:80
7       0       chr29   5819250 37      80M     *       0       0       TAAACAGTGCTATGATGAACATTGGGGTACAGATGTCTCTTTCAATTCTGGTTTCTTCGGTGTGTATGCCCAGCAGTGGG        /EEEEEEEEE/EEEEEEEE/EEEEEEEAEEEEEEEEAE/EEEAE<EE/EEEEE/EEEEEEEAEEEEEEEEEEEEEAEAE/        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:80

>less cowData_demo_2.sam
@HD     VN:1.6  SO:queryname
@SQ     SN:chr1 LN:158534110
@SQ     SN:chr2 LN:136231102
@SQ     SN:chr3 LN:121005158
@SQ     SN:chr4 LN:120000601
@SQ     SN:chr5 LN:120089316
@SQ     SN:chr6 LN:117806340
@SQ     SN:chr7 LN:110682743
@SQ     SN:chr8 LN:113319770
@SQ     SN:chr9 LN:105454467
@SQ     SN:chr10        LN:103308737
@SQ     SN:chr11        LN:106982474
@SQ     SN:chr12        LN:87216183
@SQ     SN:chr13        LN:83472345
@SQ     SN:chr14        LN:82403003
@SQ     SN:chr15        LN:85007780
@SQ     SN:chr16        LN:81013979
@SQ     SN:chr17        LN:73167244
@SQ     SN:chr18        LN:65820629
@SQ     SN:chr19        LN:63449741
@SQ     SN:chr20        LN:71974595
@SQ     SN:chr21        LN:69862954
@SQ     SN:chr22        LN:60773035
@SQ     SN:chr23        LN:52498615
@SQ     SN:chr24        LN:62317253
@SQ     SN:chr25        LN:42350435
@SQ     SN:chr26        LN:51992305
@SQ     SN:chr27        LN:45612108
@SQ     SN:chr28        LN:45940150
@SQ     SN:chr29        LN:51098607
@SQ     SN:chr999       LN:139009144
@PG     ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:bwa-0.7.15/bwa samse cowData/data/bosTau9.soft-masked.rename.fa cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_2.sai cowData/results/bosTau9_test/simuSequence/cowData_demo_2.fastq
@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.10 CL:samtools_v1.10/bin/samtools sort -n -@ 4 -o cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_2.sorted.sam cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_2.sam
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.10 CL:samtools_v1.10/bin/samtools sort -n -@ 4 -o cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_2.sorted.sam cowData/results/bosTau9_test/simuProcess/s1_alignment/cowData_demo_2.sam
0       16      chr2    103068449       37      80M     *       0       0       ACTTGACATTCTAAAAGCAGGATCTTTACATTTTTTGTTCATTTAGAAAAAACATTTTAGAGAAAGATTCCTACCAAATA        EAAA/EEEAEAE/EEEEEEE/EEEEEEEEEEEEE/E<EAEEEEEEEAEEEEEEEEAE/AEE/EEA/EEE/6E/EEA/EEA        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:80
1       0       chr9    31688430        37      80M     *       0       0       GGAGATGGAAGATCTTTGCTTGGTGAGGCATTCCCTGAGAAAGCAATCCTGGCTAGTTCCTTTGCTCTGCAGAATGCGGA        A6E/<EEEEEAEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEAAEEEEAEEEEEEEEEEEEEEAAEA<        XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:77A2
2       0       chr3    30227271        37      80M     *       0       0       GAAAAAATATGAGTCTTTTAAAAATTCATTCTTCCATTCTCAGAAGGTACATCTGCACTGCACCTTGTCCTCATGGAAGA        EAAAAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAEEEEEEEEEEAEEAE/EEEEEEEEEEEEEEAEAEA        XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:51A28
3       16      chr29   22485763        37      80M     *       0       0       CAAAAATATGTTCAACTGCATATATAGCTCACCATTTAGGGTATTCATGAATCCAAAACCCAACCATAGCTTTTAAACCA        AEEAAAAEEEEEEEEEAEE//EEEE/EEEEEAEEEEAE/EEEAAEEEEEAE/EEEEEEEEEEEAEAEEEEEEEE<EA/EA        XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:67G12
4       16      chr9    37358606        37      80M     *       0       0       CCTTCACTATCTCCCGGAAGTTGTTCAAATTCACATTCACTGAGTTGGTGATGCCATCCAACTATTTCATCCTCTATTGT        EAE6/EEEEEE/EEEEE/EEEEE/EEEEAAEEEEAEEEE<EEEAEEEEEEAEEEEEAEEEEAEEEEEEE/EAEEEEAAAE        XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:23A56
5       16      chr999  11396651        37      80M     *       0       0       TACAGGGTCAGTTCATTCTATCACTCTGCAGCTTACCAGGTATCTAGGATTTGGGACAGAGTCATAGAAGACAACATGAA        AAAAAEEEEEEEEEEAEEEEEAEEEEEEEEEEEEEEEEEAAEE/EEEEEEEEEEEEEEAEEE6EEA<EEE/EE/EEA6A/        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:80
6       0       chr999  11396545        37      80M     *       0       0       ATCTCTGTTTTACACATGAGGAAACTGAGGCTTATGGAGGTTAATAAGTTGTCCAAAGTCTTTCATCCAGAAAGCAACAG        /AAAE/EEEEEEEEAE/EAEE6EEEEEAAE/EEEEEEEEEE/EAEE/EEEEEEEEEEEEEEEAEE/EE/EEEAE/AEAAA        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:80
7       0       chr29   7194098 37      80M     *       0       0       ATGCACAGTGCCCTAAGTGAGCTCAGCTGTTACATAACGATTATTTACAAAGTGTGGTGAGAGGGTGGGGTGAGAGAGCC        A66E/EEEAEEEEAEAEEEEAEEAEEE/EEEEEAEEEEA6EE/EEEEEEEEEAEEEEEE/EEEEEEEEEEEEEEEAAAAA        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:80

Therefore, the issue you came across is still not quite clear to me.....may be related to your computing resources? Maybe you can try using 1 core, larger memory resource, clear the /tmp folder where sorting temporary files are stored, and simulate 10000 reads or smaller to start.

Otherwise, if this read id mismatches issue continues, maybe the following filtering may help. The rationale is to match read query name from both ends in the sam file. I added right after the alignment step in the simulation session. You can replace the original freeHiC.sh by the newly added freeHiC_v2.sh and see if it will help. Please let me know how it goes and I can merge it with the original pipeline! Thanks!

## match sam queryname from both ends %%%%                                                                                                                                                                                     
        $samtools view ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam | awk '{print $1}' >${outDir}/${simuName}/simuProcess/s1_alignment/queryName_1                                                                     
        $samtools view ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam | awk '{print $1}' >${outDir}/${simuName}/simuProcess/s1_alignment/queryName_2                                                                     
        cat ${outDir}/${simuName}/simuProcess/s1_alignment/queryName_1 ${outDir}/${simuName}/simuProcess/s1_alignment/queryName_2 | sort | uniq -c | awk '$1 == 2 {print $2}' >${outDir}/${simuName}/simuProcess/s1_alignment/queryNam\
e.common                                                                                                                                                                                                                               
        queryLen1=$(cat ${outDir}/${simuName}/simuProcess/s1_alignment/queryName_1 | wc -l)                                                                                                                                            
        queryLen2=$(cat ${outDir}/${simuName}/simuProcess/s1_alignment/queryName_2 | wc -l)                                                                                                                                            
        queryLenC=$(cat ${outDir}/${simuName}/simuProcess/s1_alignment/queryName.common | wc -l)                                                                                                                                       

        if [[ "$queryLen1" != "$queryLenC" ]] || [[ "$queryLen2" != "$queryLenC" ]] || [[ "$queryLen1" != "$queryLen2" ]]; then                                                                                                        
            echo "Matching simulation query names between end 1 and end2......"                                                                                                                                                        
            ${samtools} view -H ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam >${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam.header                                                                      
            ${samtools} view -H ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam >${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam.header                                                                      

            awk 'NR==FNR{a[$0]=1;next}a[$1]' ${outDir}/${simuName}/simuProcess/s1_alignment/queryName.common ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam  >>${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.\
sam.header                                                                                                                                                                                                                             
            awk 'NR==FNR{a[$0]=1;next}a[$1]' ${outDir}/${simuName}/simuProcess/s1_alignment/queryName.common ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam  >>${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.\
sam.header                                                                                                                                                                                                                             
            mv  ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam.header ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_1.sam                                                                                       
            mv  ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam.header ${outDir}/${simuName}/simuProcess/s1_alignment/${name}_2.sam                                                                                       

        fi  
       rm -rf ${outDir}/${simuName}/simuProcess/s1_alignment/queryName*  

Best, Ye