wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
297 stars 30 forks source link

coverm seems only used the forward reads #194

Open quliping opened 9 months ago

quliping commented 9 months ago

CoverM is useful to caculate the abundance of genomes. However, I am confused about the output of coverm. Here are to command line that I have tried before (one with '-1' and '-2', one with '--coupled'):

1.  
coverm genome --coupled ${workdir}/01_READ_QC/${line}_1.fastq ${workdir}/01_READ_QC/${line}_2.fastq -x fa -d ${genome_dir} -p minimap2-sr -m relative_abundance -o ${outdir}/${line}.comverm-out.txt --output-format dense -t $threads --bam-file-cache-directory ${outdir}/temp_dir
2. 
coverm genome -1 ${workdir}/01_READ_QC/${line}_1.fastq -2 ${workdir}/01_READ_QC/${line}_2.fastq -x fa -d ${genome_dir} -p minimap2-sr -m relative_abundance -o ${outdir}/${line}.comverm-out.txt --output-format dense -t $threads --bam-file-cache-directory ${outdir}/temp_dir

However, coverm seems only used the forward reads of the two paired reads? Is this just a descriptive textual error or the software really only used the forward sequences?

image image

rhysnewell commented 9 months ago

Sample names in the coverage file are generated using the forward reads when using -1 and -2 parameters, both pairs are used during mapping

quliping commented 9 months ago

Sample names in the coverage file are generated using the forward reads when using -1 and -2 parameters, both pairs are used during mapping

Thank you for your reply. Sorry for too many questions, but I have to ask some other questions.

  1. how to make coverM use the bam file which was generated by previous run of itself? I specified bam files with the - b parameter, but there was an error. image

  2. I understand the TPM caculation for a single contig, but how coverm caculate the TPM for a genome, becasue contigs in a genome have their own and different TPM value. By the way, is the TPM value of a genome similar to the genome abundance caculate by the quant_bins module of metawrap software?

  3. If I want to compare the abundance of one or mutiple MAGs in different samples, but these MAGs were only parts of all MAGs retrieved from these samples or even obtained from other unrelated samples, which method should I use best? For example, I have 12 genomes (12 different species) of a genus, some of them were retrieved from my 80 samples, some were reference genomes. I want to know the abundance of the genus in the 80 samples, which method of coverm should I choose? TPM seems inappropriate because I will got 80 '10,00,000'... I can only compare the relative abundance difference of the 12 species in 80 samples rather than the abundance difference of the entire genus in the 80 samples. Or what are the suitable situations for different methods? (For example, I will use the 'realtive_abundance' method to estimate if my MAGs (all MAGs retrieved from all samples) could cover most members of the community. If the value of my MAGs is 100%, it means that the binning method is very good for the sample that I retrieved all species of the community. Am I right?)

Thanks a lot!

wwood commented 8 months ago

Hi, apologies for the slow response.

  1. Use the -s '~' flag. When mapping using a bam file cache directory, CoverM uses the ~ separator to associate contig names and genomes.
  2. There's no reason you cannot calculate TPM for a genome - you just consider all contigs at once using the number of reads mapped to all contigs.
  3. There's a lot of sub-questions mixed up in this one. To me it isn't clear what you are trying to assess exactly. The total abundance of your genus? The relative abundance of each species in the genus? etc. Plain old relative_abundance is probably fine for most of these questions.
clementcoclet commented 7 months ago

Hi Ben,

I'm currently facing some challenges while trying to utilize coverM genome. I have a folder containing (sorted) BAM files and another folder with genomes, which consists of multiple membership sequences per genome.

Initially, I ran the command: coverm genome -b LOXAHATCHEE_PROJECT/04_READ_MAPPING/*/*_sorted.bam -d LOXAHATCHEE_PROJECT/07_BINNING/07A_vRHYME_OUTPUT/vRhyme_best_bins_fasta/ -x fasta

However, I encountered an error similar to the one reported by another user: "Error: There are no found reference sequences that are a part of a genome." Following your suggestion, I added the -s '~' option: coverm genome -b LOXAHATCHEE_PROJECT/04_READ_MAPPING/*/*_sorted.bam -d LOXAHATCHEE_PROJECT/07_BINNING/07A_vRHYME_OUTPUT/vRhyme_best_bins_fasta -x fasta -s '~'

Unfortunately, this led to another error, stating that --genome-fasta-directory cannot be used when the -s argument is provided. It appears that we need to use the --reference argument in such cases. However, using --reference prevents us from using the bam folder, as --bam-files cannot be used when the --reference argument is provided.

My understanding is that when using the-s argument, it seems we are unable to utilize existing BAM files. Could you provide any insights or recommendations on how to address this situation?

I appreciate your assistance. Clément

wwood commented 7 months ago

Coverm needs to know which contigs belong to which genomes. You are confusing it by providing the info twice, once with -s and once with genome fasta files.

You shouldn't need to specify -r with -s. Just -b and -s should do it.

Hth

-------------- Ben Woodcroft Group leader, Centre for Microbiome Research, QUT


From: clementcoclet @.> Sent: Saturday, February 10, 2024 4:56:17 AM To: wwood/CoverM @.> Cc: Ben J Woodcroft @.>; Comment @.> Subject: Re: [wwood/CoverM] coverm seems only used the forward reads (Issue #194)

Hi Ben,

I'm currently facing some challenges while trying to utilize coverM genome. I have a folder containing (sorted) BAM files and another folder with genomes, which consists of multiple membership sequences per genome.

Initially, I ran the command: coverm genome -b LOXAHATCHEE_PROJECT/04_READ_MAPPING//_sorted.bam -d LOXAHATCHEE_PROJECT/07_BINNING/07A_vRHYME_OUTPUT/vRhyme_best_bins_fasta/ -x fasta However, I encountered an error similar to the one reported by another user: "Error: There are no found reference sequences that are a part of a genome." Following your suggestion, I added the -s '~' option: coverm genome -b LOXAHATCHEE_PROJECT/04_READ_MAPPING//_sorted.bam -d LOXAHATCHEE_PROJECT/07_BINNING/07A_vRHYME_OUTPUT/vRhyme_best_bins_fasta -x fasta -s '~' Unfortunately, this led to another error, stating that --genome-fasta-directory cannot be used when the -s argument is provided. It appears that we need to use the --reference argument in such cases. However, using --reference prevents us from using the bam folder, as --bam-files cannot be used when the --reference argument is provided.

My understanding is that when using the -s argument, it seems we are unable to utilize existing BAM files. Could you provide any insights or recommendations on how to address this situation?

I appreciate your assistance. Clément

― Reply to this email directly, view it on GitHubhttps://github.com/wwood/CoverM/issues/194#issuecomment-1936446135, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAADX5EECDOLVKRIOP4HNJLYSZWNDAVCNFSM6AAAAABBBLZSYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWGQ2DMMJTGU. You are receiving this because you commented.Message ID: @.***>