Open wenleix opened 2 years ago
HomoNumericDataFrame
: Useful to model a group of dense features (backed by 2D Tensor with shape (batch, num_features)
)GenericDataFrame
: Backed by Dict[str, Union[ColumnBase, DataFrameBase]]
GenericListColumn
: The classic "flattened value" + offset encoding:
class GenericListColumn
values: Union[ColumnBase, DataFrameBase]
lengths: Optional[torch.Tensor]
offsets: Optional[torch.Tensor]
Note in standard columnar database / arrow encoding, we should provide offsets. But in ML seems lengths
are often more preferred :) . (Also, Velox also use both offsets
and lengths
)
NumericListColumn
: To represent List[Numeric]
column; seems no difference from GenericListColumn
, except values
can be a 1D Tensor directly (e.g. avoid one level of wrapper; but not sure whether it's a big deal :) )List[List[Numeric]]
, we can leverage NestedTensor
as the internal representation? Struct[NumericList]
, simliar to KeyedJaggedTensor
.
Will this be useful in preproc time???DataFrameBase
be a subclass of ColumnBase
? -- or it can just be an type alias to StructColumnBase
? :) Revised design based on initial iterations:
NumericListColumn
specialization (which can support not only List[Numeric]
but also List[List*[Numeric]]
so the underlying storage can potentially leverage NestedTensor
)FixedSizeListCol
seems to be a quite good fit for feature-first Tensor layout.FixedTypeStructColumn
seems to capture the use case of KeyedJaggedTensor
quite well :) StructColumn
while it provides some DataFrame APIs , to reduce the concepts introduced to user, as Axolotls is designed to be a model developer directly interfacing library.