yukiteruono / pbsim2

PBSIM2: a simulator for long read sequencers with a novel generative model of quality scores
GNU General Public License v2.0
69 stars 15 forks source link

Feature Request #12

Open bioinfoMMS opened 2 years ago

bioinfoMMS commented 2 years ago

Hello pbsim2 developer(s),

Thank you for this tool, it works well!

I do have one request and that would be to add an option for the simulated read names to be based off the reference FASTA headers. I want to simulate reads from a FASTA file that has many different species in it and I would love to be able to retain header info in the read names. For instance, if the FASTA header is >NC_002200.1, the read name could be NC_002200.1_1, NC_002200.1_2, NC_002200.1_3, etc. Would this be possible to add?

Thank you!

yukiteruono commented 2 years ago

Thank you for your using PBSIM2 and valuable request. We would like to improve the usability of PBSIM2, including your request, but we cannot respond to your request immediately. We recommend the following steps.

  1. If your reference FASTA is a multi-FASTA file, separate it into single-FASTA files (NC_00200.1.fasta, ...).

  2. Using each single-FASTA file, run pbsim as follows:

    pbsim --depth 20 --hmm_model data/P6C4.model --prefix NC_002200.1. --id-prefix NC_002200.1. NC_00200.1.fasta

bioinfoMMS commented 2 years ago

That might work but would be tricky to implement as my FASTA file has over 9000 species. Also, I would like the error model to be applied to the entire FASTA file rather than on a per sequence basis. It looks like PBSIM2 splits the FASTA file apart and gives the files number names: sd_0001.fastq, sd_0002.fastq, sd_0003.fastq... are these files the reads generated per FASTA record? If so, is the mapping between the header and newly assigned number save in the code (as a table or printed to a file)? Is there a way I could access this mapping information?

yukiteruono commented 2 years ago

As you presumed, PBSIM2 splits the FASTA file apart and gives the files number names, and these files are the reads generated per FASTA record. The header of sd_ *. ref and reference FASTA are the same. So you can get the number assigned to the header.

bioinfoMMS commented 2 years ago

Great, thank you! That should be easy enough to do