zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

Add new BAM example for full PG entry #245

Closed jamestwebber closed 4 months ago

jamestwebber commented 5 months ago

This is an example script that adds a full PG tag to the header of a BAM file (sample.bam is the output of bam_write):

@HD VN:1.6
@PG ID:noodles-bam
@PG ID:bam_add_tag  PN:bam_add_tag  PP:noodles-bam  VN:0.57.0   CL:target/release/examples/bam_add_tag_to_header sample.bam
@PG ID:samtools PN:samtools PP:bam_add_tag  VN:1.19 CL:samtools view --with-header
@CO an example BAM written by noodles-bam
*   4   *   0   255 *   *   0   0   *   *

I spent some time yesterday figuring out how to do this, and I figured the example might be useful. I'm also opening the PR to see if there are any improvements you can see.

The name could be more clear, perhaps bam_add_pg_entry would be better. People might expect "tag" to refer to the reads.

Constructing the PN, PP, VN, and CL tags seems cumbersome. Maybe these should be built-in tags (I believe they're part of the spec). Or if there's a way to construct them more cleanly I'd be happy to use that.

jamestwebber commented 5 months ago

I modified this to use the defined tags in map::program::tag, but it's okay to close this for now until you feel the API is ready. This PR was just a sneaky way for me to ask about those tags. 😂

zaeleus commented 4 months ago

The SAM header programs are now wrapped by sam::header::Programs, which includes Programs::add. This handles either creating a new program chain or adding to existing ones, e.g.,

https://github.com/zaeleus/noodles/blob/f23008df69ee5c1a84b662f502815354d638d6b8/noodles-sam/src/header/programs.rs#L52-L69

Note how pg0 is the root of the chain and pg1 links to pg0 via the previous program ID (PP) tag. It also covers more convoluted cases, e.g.,

https://github.com/zaeleus/noodles/blob/f23008df69ee5c1a84b662f502815354d638d6b8/noodles-sam/src/header/programs.rs#L229-L291

I added creating the self program record in the reheader examples, e.g.,

https://github.com/zaeleus/noodles/blob/f23008df69ee5c1a84b662f502815354d638d6b8/noodles-bam/examples/bam_reheader.rs#L15-L23

$ cargo --quiet run --example bam_write | cargo --quiet run --example bam_reheader /dev/stdin | samtools view --no-PG --with-header
@HD VN:1.6
@PG ID:noodles-bam
@PG ID:noodles  PN:bam_reheader VN:0.59.0   CL:target/debug/examples/bam_reheader /dev/stdin    PP:noodles-bam
@CO an example BAM written by noodles-bam
@CO a comment added by noodles-bam
*   4   *   0   255 *   *   0   0   *   *

Thanks for initial exploration of this issue!