worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
200 stars 23 forks source link

missing data #70

Open limhasic opened 4 months ago

limhasic commented 4 months ago

in paper [ Missing values

No transformation is done for missing values present in the data.

We let the model learn the distribution of the missing values.

This strategy gives us the flexibility to let the model impute or generate missing values during the sampling process ]

but error occur by missing data

how do i have to?

avsolatorio commented 4 months ago

Hello @limhasic , could you please share more detail about the error you are getting?

The model should be able to handle NaN values in the raw dataset and you will also have the option to impute or generate NaN values in the synthetic data as well.

To impute, you just need to pass the token id of the NUMERIC_NA_TOKEN to the sample() method.

from realtabformer.data_utils import NUMERIC_NA_TOKEN

model = <REaLTabFormer Model>
model.fit(...)

data = model.sample(..., suppress_tokens=[model.vocab["decoder"]["token2id"][NUMERIC_NA_TOKEN]])
limhasic commented 4 months ago

I also know that this model needs to learn Nan values as well.

Your also option contradicts the model.

And from the "model.fit" stage, an error occurs due to the presence of Nan values.