downstream hifi read stimulation

yukiteruono / pbsim3

PBSIM3: a simulator for all types of PacBio and ONT long reads

GNU General Public License v2.0

46 stars 5 forks source link

downstream hifi read stimulation #10

Closed zhuyixin43 closed 8 months ago

zhuyixin43 commented 8 months ago

Hi there, I am also trying to stimulate hifi reads from a genome. and I have a followup question to this problem. After I run pbsim3, I get multiple sam files. Do I merge them into one big sam file and then feed it as input to ccs software?

I'm a little confused about how the next steps works. It will also be great if you could provide an example downstream ccs command. Thank you so much

Originally posted by @zhuyixin43 in https://github.com/yukiteruono/pbsim3/issues/8#issuecomment-1767009067

yukiteruono commented 8 months ago

Thank you for your using PBSIM. Sam files are handled separately without merging. First, convert the sam files to bam files using something like samtools (https://www.htslib.org/) Next, generate consensus sequences from the sam files using ccs (https://github.com/PacificBiosciences/ccs).

samtools view -bS sd_0001.sam > sd_0001.bam
ccs sd_0001.bam sd_0001.fastq.gz

samtools view -bS sd_0002.sam > sd_0002.bam
ccs sd_0002.bam sd_0002.fastq.gz

  :
  :

zhuyixin43 commented 8 months ago

Hi,

Thank you so much for the clear instructions!

I came across another question while using pbsim3. I am trying to stimulate hifi reads from a genome that is about 5GB with 500 sequences in the fasta file. The parameters I put in is 30x coverage and multipass = 10. As I understood, a sam file is generated for each sequence and I then manually convert it to bam file. Can I think of these bam files as .subreads.bam? Also, my first sequence in the genome has length 248956422. Is it normal for me to get a sam file that is >150G? This seems incredibly big to me since there will be 500 of them. Am I missing something here? what is making the file so big?

Thank you so much for your help and developing this awesome tool!

yukiteruono commented 8 months ago

Can I think of these bam files as .subreads.bam?

Yes. The amount of data in SAM format for subreads is about three times larger than that in FASTQ format. Since the number of passes is 10, the amount of subreads is about 30 times that of the consensus sequence (FASTQ) generated by ccs. Since your genome is about 20 times the size of the first sequence (249Mb), the total amount of your SAM file will also be about 20 times larger.

zhuyixin43 commented 8 months ago

Thank you. I haven’t run ccs yet. I was talking about the Sam files from pbsim3 alone. Are you suggesting that for my first sequence it will have a Sam file with size 249Mb x 20 x 30 ?

Would you suggest me to merge the sequences in my fasta file together to reduce the total amount of space I need? Tens of terabytes seems impossible to work with for me.

Thank you!

zhuyixin43 commented 8 months ago

Since the number of passes is 10, the amount of subreads is about 30 times that of the consensus sequence (FASTQ) generated by ccs.

Sorry did you mean 300 here instead? my coverage is 30 and number of passes is 10.

yukiteruono commented 8 months ago

If the genome size is 5G, the number of passes is 10, and the coverage is 30, the total data amount of the sam file will be about 10T, and the HiFi read will be about 300G. How about dividing the genome file into files for each chromosome (about 500 files) and running simulations (PBSIM+samtools+ccs) separately? Once HiFi reads have been generated by ccs, the sam and bam files for that chromosome can be deleted. If you handle it well, a 1T HDD is sufficient

zhuyixin43 commented 8 months ago

Thank you so much! Your suggestion is very helpful, I will try it next!