zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

Nondeterministic TBI index creation #213

Closed holtgrewe closed 10 months ago

holtgrewe commented 10 months ago

I believe the use of HashMap in the following place in noodles-csi/src/index/reference_sequence.rs create nonderministic behaviour when creating TBI indices. I observe this when creating indices of the same .vcf.gz file multiple times and comparing the resulting binary index files.

/// A CSI reference sequence.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct ReferenceSequence {
    bins: HashMap<usize, Bin>,
    linear_index: Vec<bgzf::VirtualPosition>,
    metadata: Option<Metadata>,
}

Edit -- probably, indexmap::IndexMap would be a better choice here?

zaeleus commented 10 months ago

Agreed, this can be changed to an ordered map to preserve insertion order, which will allow indices to be (re)serialized in the same way. Thanks for the suggestion!