zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
512 stars 53 forks source link

Nondeterministic TBI index creation #213

Closed holtgrewe closed 1 year ago

holtgrewe commented 1 year ago

I believe the use of HashMap in the following place in noodles-csi/src/index/reference_sequence.rs create nonderministic behaviour when creating TBI indices. I observe this when creating indices of the same .vcf.gz file multiple times and comparing the resulting binary index files.

/// A CSI reference sequence.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct ReferenceSequence {
    bins: HashMap<usize, Bin>,
    linear_index: Vec<bgzf::VirtualPosition>,
    metadata: Option<Metadata>,
}

Edit -- probably, indexmap::IndexMap would be a better choice here?

zaeleus commented 1 year ago

Agreed, this can be changed to an ordered map to preserve insertion order, which will allow indices to be (re)serialized in the same way. Thanks for the suggestion!