We have developed a long reads simulater (called PBSIM) in which sampling-based and model-based simulations are implemented.
PBSIM simulates Continuous Long Reads (CLRs) of PacBio, and Nanopore reads.
To run model-based simulation:
pbsim --depth 20
--hmm_model data/P6C4.model
sample/sample.fasta
In the example above, simulated reads are randomly sampled from a reference sequence ("sample/sample.fasta") and differences (errors) are introduced into simulated reads. Coverage depth is 20, and quality scores are generated by FIC-HMM of P6C4 chemistry. The read length and accuracy reads are determined based on our model.
If the reference sequence is a multi-FASTA file, Output files for simulated reads are created for each FASTA. three output files are created for each FASTA. "sd_0001.ref": a single-FASTA file which is copied from the reference sequence. "sd_0001.fastq": a simulated read dataset in the FASTQ format. "sd_0001.maf": a list of alignments between the reference sequence and simulated reads in the MAF format.
To run sampling-based simulation:
pbsim --depth 20
--sample-fastq sample/sample.fastq
sample/sample.fasta
In the sampling-based simulation, read length and quality score are the same as those of a read sampled randomly from the real read dataset ("sample/sample.fastq").
If you want to create several simulated data with different coverage depths using the same real read dataset, you would be better to use --sample-profile-id option as below. You can skip parsing the same dataset.
(1) storing profile
pbsim --depth 20
--prefix depth20
--sample-fastq sample/sample.fastq
--sample-profile-id pf1
sample/sample.fasta
(2) reusing profile
pbsim --depth 30
--prefix depth30
--sample-profile-id pf1
sample/sample.fasta
pbsim --depth 40
--prefix depth40
--sample-profile-id pf1
sample/sample.fasta
For each read, the length is randomly drawn from the gamma distribution with given mean and standard deviation.
The exponential function which is fit to read accuracy distribution is utilized to simulate with given mean. For each read, the accuracy is randomly drawn from the simulated distribution.
Errors from single molecule sequencing which generates long reads are considered to be stochastical. However, it has been reported that basecalling qualities of PacBio sequencers are not uniform, and low quality regions are sometimes observed in reads. To simulate these low quality regions, the non-uniformity is generated by FIC-HMM trained using real long reads. "data/*.model" are FIC-HMM for each chemistry of PacBio and Nanopore.
Simulated reads are randomly sampled from a reference sequence. The percentage of both directions of reads is same. Differences (errors) are introduced into the simulated reads as follows: For each position of the read, all error types (substitution, insertion and deletion) are introduced according to quality score at that position. All error rates are calculated from quality score and the ratio of error types given by the user. With regards to a deletion, there is no quality score for the deletion itself; thus the quality score of the 5’neighbor is used. We observed that inserted nucleotides are often the same as their following nucleotides. According to this bias, half of the inserted nucleotides are chosen to be the same as their following nucleotides, and the other half are randomly chosen.
By setting minimum and maximum of read length, and range of that chosen from the distribution model can be restricted. Note that mean and standard deviation of the chosen read length are influenced by this restriction. Minimum and maximum of read accuracy are determined by mean of accuracy.
The lengths and quality scores of reads are simulated by randomly sampling them from real reads provided by the user. Subsequently, their nucleotide sequences are simulated by the same way as the model-based simulation. The restriction of read length and accuracy can be set by the options.
PBSIM requires reference sequences in the single- or multi-FASTA Format.
A real reads in FASTQ format is required for sampling-based simulation, specified with the --sample-fastq option. FASTQ format must be Sanger standard (fastq-sanger).
Input file must be a text file.
If a reference sequence file is multi-FASTA format, simulated datasets are generated for each reference sequence numbered sequentially. Three output files are created for each reference sequence.
"sd_
"sd" is prefix which can be specified with the --prefix option.
"data/*.model" are FIC-HMM for each chemistry of PacBio and Nanopore. We utilized HMM with the latest model selection criteria called factorised information criteria (FIC-HMM)(Hamada et al., 2015). The model were trained for each read accuracy of each chemistry. For read accuracy with insufficient training data, constant quality scores that match the accuracy were used.
When a coverage depth is 100x and a length of reference sequence is about 10M, PBSIM generates simulated dataset in several minutes. The runtime is roughly proportional to the coverage depth and the length of reference sequence. PBSIM requires memory of the length of reference sequence plus several mega bytes.
Simulation of sequencing error is performed using nucleotide sequences input by the user as templates.
pbsim --template-fasta templates(FASTA format)
--hmm_model data/P6C4.model
PBSIM2 uses model-based simulation to introduce errors into the templates. You can use the error rate options, but not the length options. Also, the --depth option is disabled and the template is sequenced once.