Open TomNicholas opened 4 months ago
But what if you did want to estimate total size of the dataset?
Yes, fair question, that doubt is why I haven't merged #227 yet.
Perhaps we should have two ways to display the size, one normal ds.nbytes
and one using the accessor e.g. ds.virtualize.nbytes
. I'm not sure whether it would be more intuitive to have the normal ds.nbytes
be the actual in-memory size used or the memory that would be taken up by the whole dataset though.
+1 on the accessor idea to show the virtual dataset size and maintaining the behavior of the current .nbytes
attribute
Xarray uses the optional property
.nbytes
to indicate the size of wrapped arrays. Currently we don't implement.nbytes
onManifestArrays
, so xarray defaults to estimating the size as basicallyarr.size * arr.dtype.itemsize
. I.e. currently it returns what the full size of the dataset would be if you loaded every referenced chunk into memory at once. But does this make sense for an array that can never be loaded into memory?There is another completely different size to consider - that of the in-memory representation of the references - see discussion in #104. This is a known fixed number, it's not lazy-loading, but it's much smaller than the current nbytes.
This latter number is what's relevant if you're trying to estimate RAM usage whilst manipulating references, so it's possibly related to the
__sizeof__
discussion in https://github.com/pydata/xarray/issues/5764.