PacBio and Illumina FASTQ Files - one coverage from multiple input files

wwood commented 2 years ago

Discussed in https://github.com/wwood/CoverM/discussions/120

^{Originally posted by **kevinmyers** June 28, 2022} I have a set of 217 MAGs that I would like to use coverM to determine the overall coverage of a number of FASTQ files. For some of our experiments, we have both Illumina and PacBio sequencing files. Should I run CoverM on the Illumina and PacBio sequencing files separately or is it okay to run everything together? I'm relatively new to CoverM and this kind of thing. Our goal is to determine relative abundance of the MAGs across different experiments, if that helps.

rhysnewell commented 1 year ago

Rosella has a method for handling both long and short read inputs at once. Aviary also has a method for combining the output of multiple CoverM runs.

This person should definitely run CoverM on each read type by itself though. Combining the outputs is simple enough to do

wwood commented 1 year ago

Right, the code to implement this would mostly be around book-keeping i.e. which read sets go with each other, which mapping parameters for each, etc.

Is there Rust code somewhere that would be worth copying (or learning) from?

rhysnewell commented 1 year ago

I can already think of some much better ways for handling this than what is implemented in Rosella/Lorikeet. What's in Rosella was done ages ago, as such is probably not very viable.

Basically involves having separate command line flags for longreads entirely -> Longreads then get separated from short reads from the start. Problem is that you then end up juggling around a bunch of different struct types which can change depending upon whether or not you are using BAMs, performing read mappings, or using different read types.

I think a better method would be to have all of the bam generator structs be contained within an enum, that way you don't have to worry about types so much when passing the bam generators into functions. The arms of the enum would then contain information about what the read types are and such.

Users could potentially provide multiple read types via command line alognside a list of mapping parameters to use. The mapping parameter list would match up with the order of provided reads?:

-c read1_1.fq.gz read1_2.fq.gz --single nanopore_1.fq.gz --single pacbio_1.fq.gz -p minimap2-sr minimap2-nanopore minimap2-pb

This is all jumbled in my head, might be best to talk to you about it

wwood / CoverM

PacBio and Illumina FASTQ Files - one coverage from multiple input files #121

Discussed in https://github.com/wwood/CoverM/discussions/120