zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

`invalid CIGAR op length` when reference sequence length bigger than ((1>>28) -1) #187

Closed wjwei-handsome closed 1 year ago

wjwei-handsome commented 1 year ago

Hi!

when I using noodles-bam to write some bam records, a Err encountered:

invalid CIGAR op length .

I noticed that: the function overflowing_put_cigar_op_count will write Softclip and Skip when record.cigar().len() can't convert to u16: https://github.com/zaeleus/noodles/blob/c5aadc5f659a13da1ba826a98463b748472fd7b7/noodles-bam/src/record/codec/encoder.rs#L187-L204

But when processing function put_cigar to write SoftClip and Skip, if reference_sequence.length() > 268435455, it will return a Err that i encountered:

https://github.com/zaeleus/noodles/blob/c5aadc5f659a13da1ba826a98463b748472fd7b7/noodles-bam/src/record/codec/encoder/cigar.rs#L21-L36

Actually, some specie's chromosome length is more bigger than 268435455, so I want to hear yours advices.

Thanks!

zaeleus commented 1 year ago

BAM has a limitation of only being able to encode up to 65535 CIGAR operations in a CIGAR field. The workaround is to use a soft clip and a skip operation as a flag to tell a decoder that the CIGAR operations are stored in a auxiliary data field. See § 4.2.2 "N_CIGAR_OP field" (2022-08-22) for more details.

Unfortunately, the case when the alignment record has > 65535 CIGAR operations and the reference sequence length is > 228-1 is incompatible with the BAM format. I can only recommend using SAM or CRAM instead, both of which do not have this limitation.

wjwei-handsome commented 1 year ago

I actually learned this point for the first time (after using bam/sam for so long)!!!

Thank you very much for your answer! I learned useful new knowledge!

In response to this problem, I will try other methods, thank you!

As for why my cigar operation length is so long, it is because I have stored the results of the whole genome comparison, and the starting position of query genome is represented by HardClip.

Best wishes, Wei