zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.5k stars 278 forks source link

Make the `ByteRangeRequest` more literate #2437

Open d-v-b opened 2 hours ago

d-v-b commented 2 hours ago

Continuing a conversation from #1661

We have IO operations that can fetch ranges of bytes. That byte range parameter is parameterized by an (optional) 2-tuple of nullable ints, i.e. tuple[int | None, int | None] | None. The tuple part is ambiguous between two similar specifications of a range of integers -- (start, step) and (start, stop); this ambiguity is bad.

We should rework this type so that it's harder to be confused about these two possibilities. One option suggested by @kylebarron would be a dataclass, e.g. :

@dataclasses.dataclass
class ByteRangeRequest:
  start: int | None = 0
  step: int | None = None

It would also be nice if there was an instance of this class (e.g., the default) that conveyed "fetch the whole thing", and then byte_range wouldn't need to be optional in all the store methods. I'm also not sure we get much from None in this type, since there are integers that can express "the end" of a range, e.g. -1. Curious to hear other ideas here.

paraseba commented 2 hours ago

This would be a good opportunity to think if we should also consider a type that allows doing things like "last N bytes" (without knowing the size beforehand). Most (all?) object stores, for example, allow that type of query.

d-v-b commented 2 hours ago

This would be a good opportunity to think if we should also consider a type that allows doing things like "last N bytes" (without knowing the size beforehand). Most (all?) object stores, for example, allow that type of query.

Today the elements of ByteRangeRequest can be negative, so I think (-1, -100) would express "read the the last 100 bytes". Maybe @normanrz can correct me if this is wrong.

kylebarron commented 2 hours ago

@paraseba Suffix requests are already used (https://github.com/zarr-developers/zarr-python/pull/1661#issuecomment-2435452928), we just want to ensure the type is well defined so that semantics are clear.

Most (all?) object stores, for example, allow that type of query.

All except Azure.

Today the elements of ByteRangeRequest can be negative, so I think (-1, -100) would express "read the the last 100 bytes".

From here it's (-100, None) that would express the last 100 bytes: https://github.com/zarr-developers/zarr-python/blob/bc588a760a804f783c4242d4435863a43a5f3f9f/src/zarr/codecs/sharding.py#L700-L702

normanrz commented 2 hours ago

Yes, it is (-100, None)

paraseba commented 2 hours ago

That is great, then making this type more clear is even more important.

normanrz commented 2 hours ago

So, the dataclass would be like this?

@dataclass
class ByteRangeRequest:
  start: int # can be negative for indexing from the back
  length: int | None # None = read to the end