zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

Question: different coalescing of bytes for index queries? #222

Closed tshauck closed 7 months ago

tshauck commented 8 months ago

Hi,

I'm not sure if I'm even conceptualizing this correctly, but I was wondering about the feasibility of "coalescing" the byte ranges returned by a query in different ways than into only contiguous chunks, which is how I think it currently works. Put another way, is it possible to return adjacent, smaller chunks that constitute the entire chunk.

As an oversimplification, say the query of an indexed vcf file for a given range would return a single chunk of 1-1000, I'm curious if it'd be possible to say return 10 chunks of size 100? The reason I ask, is I'd like to request the ranges in parallel (e.g. on s3), work on them separately, then reconstitute the results.

Thanks!

zaeleus commented 7 months ago

This is not feasible with how chunks are (typically) stored in binning indices. Indexers (not just noodles) merge overlapping chunks to reduce the index size, i.e., chunks do not tend to represent individual records. Chunks must start and end at record boundaries, so you can't arbitrarily split a chunk into smaller segments. Information to recover the original set of chunks is effectively lost.

While optimize_chunks does merge overlapping chunks, it's more useful for complete overlapping intervals (bin hierarchies). E.g., take the two chunks 8-13 and 5-21. This is merged to a single chunk 5-21 because it fully encloses 8-13, which prevents both rereading a chunk and the duplication of records.

I would recommend starting with the chunk list given by BinningIndex::query, though the size of each chunk will be unpredictable. You can also get a list of raw bins if you want all of the chunks pre-merge by manually calling ReferenceSequence::query.

tshauck commented 7 months ago

Thanks for the info! will play around with the methods you linked to