worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
200 stars 23 forks source link

How can we create more synthetic data? #77

Closed limhasic closed 2 months ago

limhasic commented 4 months ago

i tried

child_samples = child_model.sample(
    n_samples = len(child_df),
    input_unique_ids=parent_samples[join_on],
    input_df=parent_samples.drop(join_on, axis=1),
    gen_batch=64,
     )

but, n_samples = len(child_df), wasn't work

limhasic commented 4 months ago

Is this the only alternative?

child_samples = [] for n_child, df in parent_samples.sort_values("n_child").groupby("n_child"): \ print(n_child) _child_samples = child_rtf.sample(input_unique_ids=df[join_on], input_df=df.drop(join_on, axis=1),gen_batch=64) child_samples.append(_child_samples)

avsolatorio commented 2 months ago

Is this the only alternative?

child_samples = [] for n_child, df in parent_samples.sort_values("n_child").groupby("n_child"): \ print(n_child) _child_samples = child_rtf.sample(input_unique_ids=df[join_on], input_df=df.drop(join_on, axis=1),gen_batch=64) child_samples.append(_child_samples)

@limhasic , yes, that looks like a reasonable workaround, but it really depends on the application.

In theory, the number of children given a row from the parent table should be learned by the model. But if your use case does not require constraints over the number of children conditional on the parent, your alternative could work.