yukiteruono / pbsim3

PBSIM3: a simulator for all types of PacBio and ONT long reads
GNU General Public License v2.0
46 stars 5 forks source link

Improvement for HiFi : relate number of passes to dna fragment size #25

Open Sebastien-Raguideau opened 2 months ago

Sebastien-Raguideau commented 2 months ago

Hello,

Thanks for your software it is extremely useful.

I suppose this is just a shameless request for feature.

I have a slight misgiving on how number of passes and fragment length are not being related in data generation.

It is my understanding that polymerase reads follows a distribution which is unrelated to DNA fragment length. In effect that means that longer DNA fragment can't go through as many passes as shorter ones, leading to quite a different quality as a function of read length (high quality for small, low quality for big). That is why HiFi library preparation tends to limit maximum DNA fragment size, so that a minimum number of passes can be assured.

I am unsure of how to use your tools to reproduce a similar pattern. I suppose I could do something silly: when I want a coverage of 30, use pbsim 30 times asking for a coverage of 1 and having different parameters for number of passes and fragment length. That seems hacky and not too correct.

Would you be able to add this feature?

Best

yukiteruono commented 2 months ago

Thank you for your very interesting suggestion.

As you say, the longer the DNA fragment, the fewer the number of passes, so the longer the read, the lower the quality. Even in PBSIM3 simulations, changing the number of passes changes the quality of HiFi reads, as shown in Table S8 of the PBSIM3 paper. However, we do not have an accurate understanding of the relationship between read length and number of passes, and it is currently not possible to implement this relationship in PBSIM3.

If you understand the relationship between read length and number of passes, your method (repeat the simulation 30 times with different parameters) is a simple and good method.

Sebastien-Raguideau commented 2 months ago

Thanks for your quick answer!

I would just give a distribution for the polymerase reads length, lets say centered around 200k + some std (this can be learned or left as a parameter for user). Then for each read, sample a dna fragment length, sample a polymerase fragment length and deduce the expected number of passes by taking the ratio.

I can generate easily a file which list all couples (dna fragment length, nb of passes) for all reads, so to obtain a set coverage. Though pbsim3 would not be able to take that as input at the moment.

I am not too keen on using the 30 times methods: that imply having a weird discretization and I do intend to simulate coverage going as low as 0.5 (metagenomic mix).