wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
273 stars 30 forks source link

[Feature request] Set read group in BAM file? #128

Closed akiledal closed 1 year ago

akiledal commented 1 year ago

Would it be possible for CoverM to produce a .bam file with the read group (or sample or library name) set based on the input reads/read pairs? I.e. have the RG set to sample1, sample2, etc for coverm make -c <sample1_R1.fq.gz> <sample1_R2.fq.gz> <sample2_R1.fq.gz> <sample2_R2.fq.gz> ..

Motivation: I'd like to use the bam file generated by CoverM for additional steps in a pipeline, in this particular case SemiBin. Unlike other binners that use a coverage matrix, SemiBin take per-sample bam files, e.g. SemiBin single_easy_bin -i contig.fa -b S1.bam S2.bam S3.bam -o output. I thought I might be able to split the .bam file from CoverM (via minimap2) with something like gatk SplitReads but it seems like the .bam from CoverM doesn't have these set.

Understand if this is beyond the scope of CoverM. Thanks for a great tool!

wwood commented 1 year ago

Hi Anders,

Doesn't CoverM make already produce 1 bam file per read set ? Thanks for the positive comments.

Ben WoodcroftMicrobial informatics group leader, ARC Future Fellow (+617) 3443 7334 Centre for Microbiome Research, Level 3, Translational Research Institute, School of Biomedical Sciences, Faculty of Health, Queensland University of Technology

On Aug 17 2022, at 6:41 pm, Anders Kiledal @.***> wrote:

Would it be possible for CoverM to produce a .bam file with the read group (or sample or library name) set based on the input reads/read pairs? I.e. have the RG set to sample1, sample2, etc for coverm make -c .. Motivation: I'd like to use the bam file generated by CoverM for additional steps in a pipeline, in this particular case SemiBin (https://github.com/BigDataBiology/SemiBin). Unlike other binners that use a coverage matrix, SemiBin take per-sample bam files, e.g. SemiBin single_easy_bin -i contig.fa -b S1.bam S2.bam S3.bam -o output. I thought I might be able to split the .bam file from CoverM (via minimap2) with something like gatk SplitReads (https://gatk.broadinstitute.org/hc/en-us/articles/4414594417947-SplitReads) but it seems like the .bam from CoverM doesn't have these set.

Understand if this is beyond the scope of CoverM. Thanks for a great tool! — Reply to this email directly, view it on GitHub (https://github.com/wwood/CoverM/issues/128), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AAADX5HO5Q3TAKHEEP4FAVTVZUI45ANCNFSM562G52OQ). You are receiving this because you are subscribed to this thread.

akiledal commented 1 year ago

Thanks, Ben--indeed it does. I had sample ids in the directory name but not the actual read file name which was leading to the single bam file output. Including sample ids in the read file names seems to have done the trick.

wwood commented 1 year ago

Thanks. Actually I think what might be happening is that the BAM file is being overwritten for each sample, rather than 1 BAM files with all mappings. 'make' should foresee this and fail for the user, so thanks for the bug report, I guess.