Closed acu192 closed 6 years ago
Here's another use-case I'd like to be able to represent (health care patient records):
from ndtypes import ndt
pats = """
N_PATS * {
first_name: string,
last_name: string,
lab_tests: var * {
test_name: string,
result: float32,
abnormal: categorical('normal', 'high', 'low'),
},
medications: var * {
med_name: string,
start_date: Date(uint64),
end_date: Date(uint64),
frequency: categorical('q1', 'q2', 'q4', 'q12', 'q24'),
},
}
"""
t = ndt(pats)
print(t)
Output:
Traceback (most recent call last):
File "bla.py", line 45, in <module>
t = ndt(pats)
TypeError: nested or non-uniform var dimensions are not supported
Btw, when I started working with xnd/ndtypes, I had exactly the same concerns that you raised in this issue.
The short answer is that fixing variable and fixed is not supported mainly due to the complexity of the required implementation, IIRC. @skrah can explain all the technical details what makes it complex and perhaps provide workarounds to your examples. But as I understand, to support "N * var * T"
requires that the set of fixed dimension attributes needs to be extended with additional attributes making such an extension essentially a variable dimension. Even when "var * N * T"
does not seem to be complex to be supported (as a hack, for instance), it is cleaner to not support it if an arbitrary combination of variable and fixed dimensions cannot be supported, for instance, "N * var * var * M"
etc.
Btw, one can use "var * (N * T)"
but anything in parenthesis need to have fixed dimensions as it represents a type (that has to have fixed itemsize
).
I guess one could "fix" a variable dimension. Just display it as a number and possibly add verification that the length is always the same.
The bigger issue I see is something like this being unsupported:
ndt("""var * {
isbn: string,
author: string,
date_published: int,
genre: categorical('fiction', 'nonfiction'),
tokens: var * string
}""")
Here, there are no fixed dimensions anywhere, but from what I can see a mix of scalars and variable dimensions inside another dimension isn't supported.
I guess "fixing" a variable dimensions is not really an option for large data. Consider the example with patients where the medications record may be highly variable between patients.
It might be nice to have the option. Consider the case where a 3x3 matrix is associated with some data.
Ah, I just realised we're talking about different things, by "fixing" I don't mean allocating the maximum possible space, I mean just adding a flag that it's fixed, its length, and some verification.
There are multiple problems:
1) We have Arrow compatible var dimensions that have the advantage that all the data is in a single, pointer-free memory block. Accessing elements requires offsets, and offset management is already very tricky for leading var dimensions. Last year I proposed a scheme where offsets would be interleaved with the data, but then the data would not be Arrow compatible.
2) Var dimensions are not like e.g. string
, which is really variable length. If you have 3 * string
, you can of course have three strings with different lengths.
If you have 3 * var * char('utf8')
, you can choose the length of the "variable" dimension once but not have three different lengths, which defeats the purpose. The same applies to nested var dimensions in a record, as soon as the records are dtypes of an array.
3) Slicing and especially dropping dimensions is amazingly complex for variable dimensions. One goal is not to make a copy of the immutable offsets. Mixing fixed/var makes slicing even harder.
Point 2) could be addressed by skipping Arrow compatibility and making var dimensions pointer trees. Point 3) is still valid even then.
My proposal is simple, in a sense. Add a "flag" to var dimensions saying it is fixed in a case like "3 * var * int64"
and make note of the length, and perform verification everything is the same length if it is fixed. Slicing will be exactly the same in this case. Import from arrow will make this flag false by default.
And this is the secondary (in a sense, this is more important) objective, that mixed scalars and var dims should be supported like my example in https://github.com/plures/ndtypes/issues/55#issuecomment-431264301 which contains no fixed dimensions, only var dims and scalars.
Later we can extend this by saying that if all the dims below a given dimension are fixed, i.e. if the type is "var * 3 * 3 * int64"
, we can say that the array x[0]
is a fixed length array. Since the data is already in a continuous pointer we only need to calculate the start pointer and the rest is C-contiguous and can be sliced as usual.
Having a flag for arrow compatibility that changes to allow more flexibility sounds like a good idea.
Have you all looked at TileDB? Using TileDB for default storage of xnd should be investigated.
These examples from Ryan are good test cases.
Travis
On Fri, Oct 19, 2018, 6:59 AM Hameer Abbasi notifications@github.com wrote:
My proposal is simple, in a sense. Add a "flag" to var dimensions saying it is fixed in a case like "3 var int64" and make note of the length, and perform verification everything is the same length if it is fixed. Slicing will be exactly the same in this case. Import from arrow will make this flag false by default.
And this is the secondary (in a sense, this is more important) objective, that mixed scalars and var dims should be supported like my example in #55 (comment) https://github.com/plures/ndtypes/issues/55#issuecomment-431264301 which contains no fixed dimensions, only var dims and scalars.
Later we can extend this by saying that if all the dims below a given dimension are fixed, i.e. if the type is "var 3 3 * int64", we can say that the array x[0] is a fixed length array. Since the data is already in a continuous pointer we only need to calculate the start pointer and the rest is C-contiguous and can be sliced as usual.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/plures/ndtypes/issues/55#issuecomment-431326181, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPjoElD5OWW3Ohc3V3NJoJMyQbkAAyqks5umbClgaJpZM4Xvo04 .
I agree with @hameerabbasi that "var * 3 * 3 * int64"
should be supported as well. I hadn't come across that yet, but I feel it would not be hard to support since it only requires one offset array.
Regarding my example:
ndt("""var * {
isbn: string,
author: string,
date_published: int,
genre: categorical('fiction', 'nonfiction'),
tokens: var * string
}""")
I would propose we require the user wrap the inner var
dimension in a ref
so that the outer record size is constant -- no offset array needed for the outer records. E.g.
ndt("""var * {
isbn: string,
author: string,
date_published: int,
genre: categorical('fiction', 'nonfiction'),
tokens: ref(var * string)
}""")
Does that make sense? Am I correct about the ref
simplifying things?
This is now supported:
>>> t = ndt("""
... var * {
... isbn: string,
... author: string,
... date_published: int64,
... genre: categorical('fiction', 'nonfiction'),
... tokens: var * string
... }""")
>>>
>>> t
ndt("var * {isbn : string, author : string, date_published : int64, genre : categorical('fiction', 'nonfiction'), tokens : var * string}")
Making types and offsets reference-counted is the reason why can work now.
Diving into
ndtypes
, I quickly ran into an unsupported case which I feel might be common. I hope to get some info about why this case isn't supported, and also offer to help develop it if everyone agrees it would be valuable.First some context: I'm interested in using XND for NLP. Such a use-case will leverage some of XND's unique features:
Once this use-case is working on a single machine, it would be a good exercise to scale it out using Dask (i.e. create a
dask-xnd
package). But that's a future goal, not pertaining to the issue described below.I can easily create a type to hold the raw corpus:
I can also easily create a type to hold the tokenized corpus:
Now, it would be nice to show some design intent here -- I would like to name the outer dimension's size
N_DOCS
so that readers know what it represents (the number of documents I have in my corpus). Notice below I can do this for mycorpus
type but not for mycorpus_tokenized
type.Okay, well, that's fine, I will just stick with
corpus_tokenized = ndt("var * var * string")
, but I'm going to run into this same error again below.I need to retain some extra information about each of the documents in my corpus, so I'll use ndtype's
records
feature.The workaround here would be to to keep the tokens outside of this structure... :(
Looking through the source a bit, I feel it's possible to support nested
var
dimensions. I haven't fully pieced it together yet though. One (possibly) simple solution might be to have users wrap innervar
dimensions in aref
so that the you don't need an offset array for the outer dimension?Can anyone (@skrah, @saulshanabrook, @pearu) provide some background for me about why this isn't currently supported? Or let me know if I'm thinking about this incorrectly...
Thank you!