How can we create more synthetic data?

limhasic commented 7 months ago

i tried

child_samples = child_model.sample(
    n_samples = len(child_df),
    input_unique_ids=parent_samples[join_on],
    input_df=parent_samples.drop(join_on, axis=1),
    gen_batch=64,
     )

but, n_samples = len(child_df), wasn't work

limhasic commented 7 months ago

Is this the only alternative?

child_samples = [] for n_child, df in parent_samples.sort_values("n_child").groupby("n_child"): \ print(n_child) _child_samples = child_rtf.sample(input_unique_ids=df[join_on], input_df=df.drop(join_on, axis=1),gen_batch=64) child_samples.append(_child_samples)

avsolatorio commented 5 months ago

Is this the only alternative?

child_samples = [] for n_child, df in parent_samples.sort_values("n_child").groupby("n_child"): \ print(n_child) _child_samples = child_rtf.sample(input_unique_ids=df[join_on], input_df=df.drop(join_on, axis=1),gen_batch=64) child_samples.append(_child_samples)

@limhasic , yes, that looks like a reasonable workaround, but it really depends on the application.

In theory, the number of children given a row from the parent table should be learned by the model. But if your use case does not require constraints over the number of children conditional on the parent, your alternative could work.

worldbank / REaLTabFormer

How can we create more synthetic data? #77