yukiteruono / pbsim3

PBSIM3: a simulator for all types of PacBio and ONT long reads
GNU General Public License v2.0
65 stars 5 forks source link

Storage space issue #36

Open sagarc88 opened 1 week ago

sagarc88 commented 1 week ago

Hi,

I am trying to use PBSIM3 to simulate HiFi Revio reads from a simulated reference genome with some insertions of interest. I use the following command:

pbsim 
--strategy wgs 
--genome refGenome.fasta 
--method errhmm 
--errhmm /usr/local/data/ERRHMM-SEQUEL.model 
--depth 30 
--pass-num 10 
--prefix /home/Sim_Reads
--length-min 100 
--length-max 1000000 
--length-mean 9000 
--length-sd 7000 > stdout.log

I am running this with a porcine genome (SusScrofa11.1) which also contains ~700 scaffolds. When running this command and providing 4TB of storage space, it still runs out of disk space. Is this expected? Any way to improve the disk usage? Maybe i am running something incorrectly?

Thank you for your help.

yukiteruono commented 1 week ago

Thank you for using PBSIM3. Your command will generate about 4.5TB of sam files. Other users have suggested that PBSIM3 should output bam files instead of sam files, and we are reconsidering whether to do so, but we cannot implement it immediately. Run PBSIM3 with --depth=10, convert sam files to bam files, and delete sam files. We recommend repeating this three times. Change --prefix and --id-prefix for each run to avoid duplicate IDs when merging later.

sagarc88 commented 1 week ago

Thank you for the prompt reply.

For depth of 30X, we generally get about 50Gb HiFi reads file from PacBio. Are these files really big because they still need to be ran through ccs command?

As a feature request, it would be helpful to multithread each chromosome and output the sam file to stdout. This way, the user can easily convert the file to BAM or write it as SAM.

I will use the approach you have suggested for now. Thank you very much for your help.