yukiteruono / pbsim3

PBSIM3: a simulator for all types of PacBio and ONT long reads
GNU General Public License v2.0
60 stars 5 forks source link

PacBio and ONT new models #12

Closed farchaab closed 9 months ago

farchaab commented 10 months ago

Hello PBSIM3 team, I love your tool and I frequently use it for Nanopore and PacBio read simulations !

I have some questions regarding the models that you provide:

Thank you for your help !

yukiteruono commented 10 months ago

Thank you for your using PBSIM3. QSHMM-ONT-HQ was created by learning ONT R10.4 (https://www.ncbi.nlm.nih.gov/sra/SRX17402533). In PacBio Sequel, the quality code is a fixed value and cannot be learned, so there are no plans to create the Sequel quality score model.

farchaab commented 10 months ago

Thank for your quick response !

A couple of additional questions:

Thank you in advance for your response.

yukiteruono commented 10 months ago

We plan to try training on R10.4.1 soon. Please let me know if you have any recommended training data. The building of error or quality score model is not yet fully automated. It is not yet possible to provide model building tools to users.

farchaab commented 10 months ago

Excellent news !

For training data, I would suggest using Genome in a Bottle (GIAB) reference data. ONT sequenced GIAB samples with the latest R10.4.1 (described here) and made the data publicly available on their AWS S3 bucket.

To download the data you need aws-cli.

You can list the analysis directory :

aws s3 ls --no-sign-request s3://ont-open-data/giab_2023.05/analysis/
                           PRE benchmarking/
                           PRE hg001/
                           PRE hg002/
                           PRE hg003/
                           PRE hg004/
                           PRE small_variants_happy/
                           PRE stats/
                           PRE variant_calling/
2023-05-25 23:56:56       6148 .DS_Store

which contains reads in CRAM format for each sample (hac and sup basecalls) :

aws s3 ls --no-sign-request s3://ont-open-data/giab_2023.05/analysis/hg002/hac/
2023-05-26 00:36:40 9432359388 PAO83395.fail.cram
2023-05-26 00:44:47      58913 PAO83395.fail.cram.crai
2023-05-26 00:44:47 110409153743 PAO83395.pass.cram
2023-05-26 00:45:18     714776 PAO83395.pass.cram.crai
2023-05-26 00:45:19 21557845303 PAO89685.fail.cram
2023-05-26 00:55:23     131347 PAO89685.fail.cram.crai
2023-05-26 00:55:24 81761558471 PAO89685.pass.cram
2023-05-26 01:03:22     545627 PAO89685.pass.cram.crai

I guess they used GRCh38 as the reference genome :

 aws s3 ls --no-sign-request s3://ont-open-data/giab_2023.05/analysis/benchmarking/
                           PRE GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.sdf/
                           PRE truthset/
2023-05-25 23:56:56 3144230986 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
2023-05-25 23:56:56       7804 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai
2023-05-25 23:57:39 3153506519 human_g1k_v37.fasta
2023-05-25 23:57:41       2746 human_g1k_v37.fasta.fai

Command to download the dataset:

aws s3 sync --no-sign-request s3://ont-open-data/giab_2023.05 giab_2023.05

Let me know if this contains all the files you need to train the model.

Looking forward for the new R10.4.1 model !

yukiteruono commented 9 months ago

Thanks for suggesting the training data.

We created new R10.4.1 models using the suggested data (giab_2023.05/analysis/hg002/hac/) and evaluated the performance of the models in a read simulation with an accuracy of 98%. As a result, the new R10.4.1 models showed almost the same performance as QSHMM-ONT-HQ and ERRHMM-ONT-HQ models in terms of error ratio (substitution:insertion:deletion) and non-uniformity of quality scores (or errors) . Therefore, we currently recommend using QSHMM-ONT-HQ and ERRHMM-ONT-HQ models for R10.4.1 read simulations.

farchaab commented 9 months ago

Awesome !

Thank you for your response, I will use the ONT-HQ models to simulate R10.4.1 reads.

Many thanks for taking the time to train and evaluate the models !