worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
203 stars 23 forks source link

Is there a rule of thumb for NUM_BOOTSTRAP? #13

Closed echatzikyriakidis closed 1 year ago

echatzikyriakidis commented 1 year ago

Hi @avsolatorio,

In my experiments I have the default value (500) for the bootstrap rounds when estimating the sensitivity threshold. I see in the implementation that this process is very CPU-bound and utilizes multicore if possible.

In my environment I have 8 CPU cores and usually on large tables it takes 1-2 hours to complete before training starts. All this time the GPU in my runtime environment is idle waiting the sensitivity threshold estimation to complete. (Also, in Colab sometimes it disconnects the runtime because it notices that the runtime uses mainly CPU).

I know that by setting this to a smaller value it will run faster but I wonder if there is a rule of thumb or it is just a matter of try-and-error. I understand that it is important to estimate correctly this threshold as it will be used for early stopping the training.

Thanks!

avsolatorio commented 1 year ago

Hello @echatzikyriakidis , 100 can be a reasonable trade-off. A higher value of the bootstrap round helps in producing a stable threshold. So you will have to take note of this.

One potential solution is to allow for precomputation of the sensitivity threshold outside the fit function. When fitting with the data, one can specify a file containing the pre-computed value. It must, however, first check if the parameters used in the pre-computation are consistent with the parameters passed in the fit function.

With this implemented, you can perform the pre-computation on an instance without an accelerator, save it, then change the colab instance having a GPU.

If you're open to contributing to this feature, that would be very welcome! See: https://github.com/avsolatorio/REaLTabFormer/issues/16

echatzikyriakidis commented 1 year ago

Hi @avsolatorio!

I have managed to overpass the problem with the disconnects in Colab by buying the Colab Pro+ which never disconnects.