xnd-project / libndtypes

Subsumed into xnd
https://xnd.io/
BSD 3-Clause "New" or "Revised" License
25 stars 17 forks source link

Enhancement request: Support nested `var` dimensions #55

Closed acu192 closed 5 years ago

acu192 commented 5 years ago

Diving into ndtypes, I quickly ran into an unsupported case which I feel might be common. I hope to get some info about why this case isn't supported, and also offer to help develop it if everyone agrees it would be valuable.

First some context: I'm interested in using XND for NLP. Such a use-case will leverage some of XND's unique features:

Once this use-case is working on a single machine, it would be a good exercise to scale it out using Dask (i.e. create a dask-xnd package). But that's a future goal, not pertaining to the issue described below.

I can easily create a type to hold the raw corpus:

>>> from ndtypes import ndt
>>> corpus = ndt("var * string")
>>> corpus
ndt("var * string")

I can also easily create a type to hold the tokenized corpus:

>>> corpus_tokenized = ndt("var * var * string")
>>> corpus_tokenized
ndt("var * var * string")

Now, it would be nice to show some design intent here -- I would like to name the outer dimension's size N_DOCS so that readers know what it represents (the number of documents I have in my corpus). Notice below I can do this for my corpus type but not for my corpus_tokenized type.

>>> corpus = ndt("N_DOCS * string")
>>> corpus
ndt("N_DOCS * string")
>>> corpus_tokenized = ndt("N_DOCS * var * string")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: nested or non-uniform var dimensions are not supported

Okay, well, that's fine, I will just stick with corpus_tokenized = ndt("var * var * string"), but I'm going to run into this same error again below.

I need to retain some extra information about each of the documents in my corpus, so I'll use ndtype's records feature.

>>> corpus = ndt("""
... var * {
...     isbn: string,
...     author: string,
...     date_published: int,
...     genre: categorical('fiction', 'nonfiction'),
...     full_text: string
... }""")
>>> corpus
ndt("var * {isbn : string, author : string, date_published : int, genre : categorical('fiction', 'nonfiction'), full_text : string}")
>>> corpus_tokenized = ndt("""
... var * {
...     isbn: string,
...     author: string,
...     date_published: int,
...     genre: categorical('fiction', 'nonfiction'),
...     tokens: var * string
... }""")
Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
TypeError: nested or non-uniform var dimensions are not supported

The workaround here would be to to keep the tokens outside of this structure... :(

Looking through the source a bit, I feel it's possible to support nested var dimensions. I haven't fully pieced it together yet though. One (possibly) simple solution might be to have users wrap inner var dimensions in a ref so that the you don't need an offset array for the outer dimension?

Can anyone (@skrah, @saulshanabrook, @pearu) provide some background for me about why this isn't currently supported? Or let me know if I'm thinking about this incorrectly...

Thank you!

acu192 commented 5 years ago

Here's another use-case I'd like to be able to represent (health care patient records):

from ndtypes import ndt

pats = """
N_PATS * {
    first_name: string,
    last_name: string,
    lab_tests: var * {
        test_name: string,
        result: float32,
        abnormal: categorical('normal', 'high', 'low'),
    },
    medications: var * {
        med_name: string,
        start_date: Date(uint64),
        end_date: Date(uint64),
        frequency: categorical('q1', 'q2', 'q4', 'q12', 'q24'),
    },
}
"""

t = ndt(pats)
print(t)

Output:

Traceback (most recent call last):
  File "bla.py", line 45, in <module>
    t = ndt(pats)
TypeError: nested or non-uniform var dimensions are not supported
pearu commented 5 years ago

Btw, when I started working with xnd/ndtypes, I had exactly the same concerns that you raised in this issue.

The short answer is that fixing variable and fixed is not supported mainly due to the complexity of the required implementation, IIRC. @skrah can explain all the technical details what makes it complex and perhaps provide workarounds to your examples. But as I understand, to support "N * var * T" requires that the set of fixed dimension attributes needs to be extended with additional attributes making such an extension essentially a variable dimension. Even when "var * N * T" does not seem to be complex to be supported (as a hack, for instance), it is cleaner to not support it if an arbitrary combination of variable and fixed dimensions cannot be supported, for instance, "N * var * var * M" etc.

Btw, one can use "var * (N * T)" but anything in parenthesis need to have fixed dimensions as it represents a type (that has to have fixed itemsize).

hameerabbasi commented 5 years ago

I guess one could "fix" a variable dimension. Just display it as a number and possibly add verification that the length is always the same.

hameerabbasi commented 5 years ago

The bigger issue I see is something like this being unsupported:

ndt("""var * {
    isbn: string,
    author: string,
    date_published: int,
    genre: categorical('fiction', 'nonfiction'),
    tokens: var * string
}""")

Here, there are no fixed dimensions anywhere, but from what I can see a mix of scalars and variable dimensions inside another dimension isn't supported.

pearu commented 5 years ago

I guess "fixing" a variable dimensions is not really an option for large data. Consider the example with patients where the medications record may be highly variable between patients.

hameerabbasi commented 5 years ago

It might be nice to have the option. Consider the case where a 3x3 matrix is associated with some data.

hameerabbasi commented 5 years ago

Ah, I just realised we're talking about different things, by "fixing" I don't mean allocating the maximum possible space, I mean just adding a flag that it's fixed, its length, and some verification.

skrah commented 5 years ago

There are multiple problems:

1) We have Arrow compatible var dimensions that have the advantage that all the data is in a single, pointer-free memory block. Accessing elements requires offsets, and offset management is already very tricky for leading var dimensions. Last year I proposed a scheme where offsets would be interleaved with the data, but then the data would not be Arrow compatible.

2) Var dimensions are not like e.g. string, which is really variable length. If you have 3 * string, you can of course have three strings with different lengths.

If you have 3 * var * char('utf8'), you can choose the length of the "variable" dimension once but not have three different lengths, which defeats the purpose. The same applies to nested var dimensions in a record, as soon as the records are dtypes of an array.

3) Slicing and especially dropping dimensions is amazingly complex for variable dimensions. One goal is not to make a copy of the immutable offsets. Mixing fixed/var makes slicing even harder.

Point 2) could be addressed by skipping Arrow compatibility and making var dimensions pointer trees. Point 3) is still valid even then.

hameerabbasi commented 5 years ago

My proposal is simple, in a sense. Add a "flag" to var dimensions saying it is fixed in a case like "3 * var * int64" and make note of the length, and perform verification everything is the same length if it is fixed. Slicing will be exactly the same in this case. Import from arrow will make this flag false by default.

And this is the secondary (in a sense, this is more important) objective, that mixed scalars and var dims should be supported like my example in https://github.com/plures/ndtypes/issues/55#issuecomment-431264301 which contains no fixed dimensions, only var dims and scalars.

Later we can extend this by saying that if all the dims below a given dimension are fixed, i.e. if the type is "var * 3 * 3 * int64", we can say that the array x[0] is a fixed length array. Since the data is already in a continuous pointer we only need to calculate the start pointer and the rest is C-contiguous and can be sliced as usual.

teoliphant commented 5 years ago

Having a flag for arrow compatibility that changes to allow more flexibility sounds like a good idea.

Have you all looked at TileDB? Using TileDB for default storage of xnd should be investigated.

These examples from Ryan are good test cases.

Travis

On Fri, Oct 19, 2018, 6:59 AM Hameer Abbasi notifications@github.com wrote:

My proposal is simple, in a sense. Add a "flag" to var dimensions saying it is fixed in a case like "3 var int64" and make note of the length, and perform verification everything is the same length if it is fixed. Slicing will be exactly the same in this case. Import from arrow will make this flag false by default.

And this is the secondary (in a sense, this is more important) objective, that mixed scalars and var dims should be supported like my example in #55 (comment) https://github.com/plures/ndtypes/issues/55#issuecomment-431264301 which contains no fixed dimensions, only var dims and scalars.

Later we can extend this by saying that if all the dims below a given dimension are fixed, i.e. if the type is "var 3 3 * int64", we can say that the array x[0] is a fixed length array. Since the data is already in a continuous pointer we only need to calculate the start pointer and the rest is C-contiguous and can be sliced as usual.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/plures/ndtypes/issues/55#issuecomment-431326181, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPjoElD5OWW3Ohc3V3NJoJMyQbkAAyqks5umbClgaJpZM4Xvo04 .

acu192 commented 5 years ago

I agree with @hameerabbasi that "var * 3 * 3 * int64" should be supported as well. I hadn't come across that yet, but I feel it would not be hard to support since it only requires one offset array.

Regarding my example:

ndt("""var * {
    isbn: string,
    author: string,
    date_published: int,
    genre: categorical('fiction', 'nonfiction'),
    tokens: var * string
}""")

I would propose we require the user wrap the inner var dimension in a ref so that the outer record size is constant -- no offset array needed for the outer records. E.g.

ndt("""var * {
    isbn: string,
    author: string,
    date_published: int,
    genre: categorical('fiction', 'nonfiction'),
    tokens: ref(var * string)
}""")

Does that make sense? Am I correct about the ref simplifying things?

skrah commented 5 years ago

This is now supported:

>>> t = ndt("""
... var * {
...   isbn: string,
...   author: string,
...   date_published: int64,
...   genre: categorical('fiction', 'nonfiction'),
...   tokens: var * string
... }""")
>>> 
>>> t
ndt("var * {isbn : string, author : string, date_published : int64, genre : categorical('fiction', 'nonfiction'), tokens : var * string}")

Making types and offsets reference-counted is the reason why can work now.