worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
200 stars 23 forks source link

Parallelization of inference/generation in both tabular and child models. #63

Open efstathios-chatzikyriakidis opened 6 months ago

efstathios-chatzikyriakidis commented 6 months ago

Hi @avsolatorio,

I want to generate houndred of thousands rows using both tabular and relational models. However, it is a bit slow because of the auto-regressive nature of transformers. Currently I am generating e.g. 100k rows in tabular model and then I let the child model to generate the child rows.

1) Is there something I can do to optimize generation time either for tabular or child models? We touched this a little in the past https://github.com/worldbank/REaLTabFormer/issues/15.

2) From my understanding multi-GPU cannot be used in the inference part, like training does. However, instead of running a parent_model.sample(n_samples=100_000) I can run in parallel multiple parent_model.sample() calls using a different cuda device for each? Let's say break it in 10x10_000 and run 10 rtf_model.sample() calls, each on different GPU card, or have at least a pool of GPU cards to utilize.

3) Can I use a single GPU card and run multiple parent_model.sample() calls in multiple threads in parallel? Probably this is going to fail as the GPU memory will explode, right?

4) Is there any special argument I need to use in parent_model.sample() to support that? Is it device argument to specify the cuda device for each batch generation?

5) How can I partition and batch the rows of a child model? Currently the relationship cardinality is something that is learned by the model, so I can't specify it. I can estimate it outside of the library, #orders a customer could have, #products an order could have. Maybe I can built such a model, However, is it possible to tell the child model how many child rows to create for each parent? If that is possible, I will be able to pass that number from my external estimation.

Thanks!