worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
200 stars 23 forks source link

Conditional generation? #48

Closed gminorcoles closed 7 months ago

gminorcoles commented 11 months ago

Hi I found your work today I think from googling about overfitting and data copying in these kinds of models. There are some very interesting ideas re: DCR and the Q metric that I think are pretty interesting. I have a need to generate data conditionally. I have some labels for medical imaging data and I need to sample using the timestamp and a label as the conditional information.

Could I add the conditional data to the text of the input? Is this a use case which anyone has explored?

thanks

Contributorrandom commented 11 months ago

This is an interesting use case . Did you get any thread ?

gminorcoles commented 11 months ago

I have found some tutorials and examples showing how to add conditional fine tuning to pretained gpt-2, so I think I have a handle on how to do it. there are a lot of layers of code in this project, however, so its a bit of work to do. Also I am looking at breaking out the Trainer so that is is persistent across fit() calls, so that I can train in batches on lots of data, but the trainer is also embedded in a lot of nested code. I was working on a different approach to generating data using diffusion models before I found this project and I need to test that more before moving to this.

Contributorrandom commented 11 months ago

Okay, cool. I was thinking of a different use case .

avsolatorio commented 9 months ago

Hi I found your work today I think from googling about overfitting and data copying in these kinds of models. There are some very interesting ideas re: DCR and the Q metric that I think are pretty interesting. I have a need to generate data conditionally. I have some labels for medical imaging data and I need to sample using the timestamp and a label as the conditional information.

Could I add the conditional data to the text of the input? Is this a use case which anyone has explored?

thanks

Hello @gminorcoles , thanks for looking into this project! Conditional generation is possible provided that your conditioning variables are located at the first columns of your table.

There is a seed_input argument that can then be used to conditionally sample given the values for the first N columns that is provided.

See: https://github.com/worldbank/REaLTabFormer/blob/4f3fa54a52085985b6e03e29dd71850b2a9ba324/src/realtabformer/realtabformer.py#L1159C4-L1159C4

Indeed, the code is a bit convoluted currently. I need to find more time to refactor this and make things simpler! 😅

But any contributions are welcome! And let me know if you have any questions. 😊