zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

noodles sam successfully read the recording but failed to write it #228

Closed natir closed 6 months ago

natir commented 6 months ago

Hello,

I was creating random sam records (for test data sets) and when I wanted to transform them into bam via noodles_util I got a strange bug.

How to reproduce:

  1. create test.sam
  2. run cargo run --example sam_count -- ~/tmp.sam -> get 5
  3. run cargo run --example sam_view -- ~/tmp.sam -> Error !

sam_view crash with Error: Kind(InvalidInput) on record record_@]APF. I haven't been able to find an explanation of what's wrong with this recording (maybe the flag, but I'm not sure).

test.sam ``` @HD VN:1.0 @SQ SN:1 LN:2147483647 @SQ SN:2 LN:2147483647 @SQ SN:3 LN:2147483647 @SQ SN:4 LN:2147483647 @SQ SN:5 LN:2147483647 @SQ SN:6 LN:2147483647 @SQ SN:7 LN:2147483647 @SQ SN:8 LN:2147483647 @SQ SN:9 LN:2147483647 @SQ SN:10 LN:2147483647 @SQ SN:11 LN:2147483647 @SQ SN:12 LN:2147483647 @SQ SN:13 LN:2147483647 @SQ SN:14 LN:2147483647 @SQ SN:15 LN:2147483647 @SQ SN:16 LN:2147483647 @SQ SN:17 LN:2147483647 @SQ SN:18 LN:2147483647 @SQ SN:19 LN:2147483647 @SQ SN:22 LN:2147483647 @SQ SN:X LN:2147483647 @SQ SN:Y LN:2147483647 @SQ SN:MT LN:2147483647 @SQ SN:chr1 LN:2147483647 @SQ SN:chr2 LN:2147483647 @SQ SN:chr3 LN:2147483647 @SQ SN:chr4 LN:2147483647 @SQ SN:chr5 LN:2147483647 @SQ SN:chr6 LN:2147483647 @SQ SN:chr7 LN:2147483647 @SQ SN:chr8 LN:2147483647 @SQ SN:chr9 LN:2147483647 @SQ SN:chr10 LN:2147483647 @SQ SN:chr11 LN:2147483647 @SQ SN:chr12 LN:2147483647 @SQ SN:chr13 LN:2147483647 @SQ SN:chr14 LN:2147483647 @SQ SN:chr15 LN:2147483647 @SQ SN:chr16 LN:2147483647 @SQ SN:chr17 LN:2147483647 @SQ SN:chr18 LN:2147483647 @SQ SN:chr19 LN:2147483647 @SQ SN:chr22 LN:2147483647 @SQ SN:chrX LN:2147483647 @SQ SN:chrY LN:2147483647 @SQ SN:chrMT LN:2147483647 record_`IbUX 4025 chrX 3136 74 50M * 0 50 gAAtCGCgtGTTAGTTAagccAcggtAatGcTtgtaCgcAGgAtaTcgAA 2?8C,30C5-D.$.=A@2/&='6A0A$@D&4,1+=!/'@ED:C577DF%" record_D]MO] 2169 chr18 7114 16 50M * 0 50 cAtgCtGCAAtTacCGtTAAcaGGtatTCaTCctcTGgAActTgCGAcaA FG>!$!3A6+9#(7E7-"=BH3?"6;%13=A-?!2FH record_@]APF 427 10 13635 47 50M * 0 50 aCGctGagattTGtgCttaAGggTcCTGcGTAGCTGTCCACgTTTGagtG >61-B'!01"'!H":,=$*$6*-95FH5D2?BA,+@58%75BH0D?G0+@ record_dE^c] 115 10 61882 50 50M * 0 50 CTacgtCTaTgTCAGgCtaGTtcCCTcgcTgAgGgAtCAAatTCTATTGT H/6DHFB;'.<<&0A=(@9!DA+-D/,:*B7C+'=07$C&&C9%H;B=!6 record_E]PA` 3624 chr2 17136 111 50M * 0 50 AtaatcaCtGcTAGCCAgaTTgcAaTtaTGgACTTagGgtATACCtcTct .'/!$D()7D,',GB55&(!**$F=@0?3G183F?>6<.C$$6AB2FH4# ```
zaeleus commented 6 months ago

The reader will now decode a superset of the specification, but the writer still requires valid fields. record_@]APF is not a valid read name because it includes the @ symbol. See § 1.4 "The alignment section: mandatory fields" (2023-05-24):

Col Field Type Regexp/Range Brief description
1 QNAME String [!-?A-~]{1,254} Query template NAME
natir commented 6 months ago

Thank a lot.

After wasting several long minutes, I finally understood what this regex meant.

Perhaps a more explicit error message that (indicates the field concerned) would be useful, but I know it's not necessarily easy.