zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

Why `Flags::PROPERLY_ALIGNED` instead of `Flags::PROPERLY_PAIRED` as name? #236

Closed ghuls closed 5 months ago

ghuls commented 5 months ago

Why Flags::PROPERLY_ALIGNED instead of Flags::PROPERLY_PAIRED? As far as I am aware this flag is never set for single-end reads so Flags::PROPERLY_ALIGNED looks confusing to me.

samtools flags output:

$ samtools flags
About: Convert between textual and numeric flag representation
Usage: samtools flags FLAGS...

Each FLAGS argument is either an INT (in decimal/hexadecimal/octal) representing
a combination of the following numeric flag values, or a comma-separated string
NAME,...,NAME representing a combination of the following flag names:

   0x1     1  PAIRED         paired-end / multiple-segment sequencing technology
   0x2     2  PROPER_PAIR    each segment properly aligned according to aligner
   0x4     4  UNMAP          segment unmapped
   0x8     8  MUNMAP         next segment in the template unmapped
  0x10    16  REVERSE        SEQ is reverse complemented
  0x20    32  MREVERSE       SEQ of next segment in template is rev.complemented
  0x40    64  READ1          the first segment in the template
  0x80   128  READ2          the last segment in the template
 0x100   256  SECONDARY      secondary alignment
 0x200   512  QCFAIL         not passing quality controls or other filters
 0x400  1024  DUP            PCR or optical duplicate
 0x800  2048  SUPPLEMENTARY  supplementary alignment

Noodles flag names:

    #[test]
    fn test_contains() {
        assert!(Flags::SEGMENTED.is_segmented());
        assert!(Flags::PROPERLY_ALIGNED.is_properly_aligned());
        assert!(Flags::UNMAPPED.is_unmapped());
        assert!(Flags::MATE_UNMAPPED.is_mate_unmapped());
        assert!(Flags::REVERSE_COMPLEMENTED.is_reverse_complemented());
        assert!(Flags::MATE_REVERSE_COMPLEMENTED.is_mate_reverse_complemented());
        assert!(Flags::FIRST_SEGMENT.is_first_segment());
        assert!(Flags::LAST_SEGMENT.is_last_segment());
        assert!(Flags::SECONDARY.is_secondary());
        assert!(Flags::QC_FAIL.is_qc_fail());
        assert!(Flags::DUPLICATE.is_duplicate());
        assert!(Flags::SUPPLEMENTARY.is_supplementary());
    }

HTSlib required lists the mandatory SAM fields and meanings of flag values:

<h1 id="DESCRIPTION"><a href="https://www.htslib.org/doc/sam.html#DESCRIPTION">DESCRIPTION</a></h1>
Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the header lines, which are started
with the `@' symbol, each alignment line consists of:

1 | QNAME | Query template/pair NAME
-- | -- | --
2 | FLAG | bitwise FLAG
3 | RNAME | Reference sequence NAME
4 | POS | 1-based leftmost POSition/coordinate of clipped sequence
5 | MAPQ | MAPping Quality (Phred-scaled)
6 | CIGAR | extended CIGAR string
7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME)
8 | MPOS | 1-based Mate POSition
9 | TLEN | inferred Template LENgth (insert size)
10 | SEQ | query SEQuence on the same strand as the reference
11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality)
12+ | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE

<p>
where the second column gives the string representation of the FLAG field.
</p>[DESCRIPTION](https://www.htslib.org/doc/sam.html#DESCRIPTION)
Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the header lines, which are started with the `@' symbol, each alignment line consists of:
1   QNAME   Query template/pair NAME
2   FLAG    bitwise FLAG
3   RNAME   Reference sequence NAME
4   POS 1-based leftmost POSition/coordinate of clipped sequence
5   MAPQ    MAPping Quality (Phred-scaled)
6   CIGAR   extended CIGAR string
7   MRNM    Mate Reference sequence NaMe (`=' if same as RNAME)
8   MPOS    1-based Mate POSition
9   TLEN    inferred Template LENgth (insert size)
10  SEQ query SEQuence on the same strand as the reference
11  QUAL    query QUALity (ASCII-33 gives the Phred base quality)
12+ OPT variable OPTional fields in the format TAG:VTYPE:VALUE

Each bit in the FLAG field is defined as:
0x0001  p   the read is paired in sequencing
0x0002  P   the read is mapped in a proper pair
0x0004  u   the query sequence itself is unmapped
0x0008  U   the mate is unmapped
0x0010  r   strand of the query (1 for reverse)
0x0020  R   strand of the mate
0x0040  1   the read is the first read in a pair
0x0080  2   the read is the second read in a pair
0x0100  s   the alignment is not primary
0x0200  f   the read fails platform/vendor quality checks
0x0400  d   the read is either a PCR or an optical duplicate
0x0800  S   the alignment is supplementary
where the second column gives the string representation of the FLAG field.

https://www.htslib.org/doc/sam.html

zaeleus commented 5 months ago

This was likely named "properly aligned" in the context of knowing it's typically only set for segmented reads. The specification definition is a bit fuzzy since it's up to the aligner to determine whether the segments are sensibly placed.

Retaining the "proper" terminology, is renaming PROPERLY_ALIGNED to PROPERLY_SEGMENTED sufficient?

ghuls commented 5 months ago

Retaining the "proper" terminology, is renaming PROPERLY_ALIGNED to PROPERLY_SEGMENTED sufficient?

I think so.