worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
200 stars 23 forks source link

Bug in REaLTabFormer.sample() when relational model generates no data #51

Closed efstathios-chatzikyriakidis closed 9 months ago

efstathios-chatzikyriakidis commented 9 months ago

Hi @avsolatorio!

I have a super urgent bug ticket that I need to solve asap and I have solve it here. I wonder if you could just push it and create a new PyPI release.

It is related to the library in the def _processes_sample() method.

Bug:

I have trained a relational model and some times the data are not so many to learn the relationship and the sampling of the relational model generates an empty synth_sample and as a consequence the following line if synth_sample[col].iloc[0].startswith(col): fails with an exception because it assumes that at least one line at iloc 0 will always exist: https://github.com/worldbank/REaLTabFormer/blob/main/src/realtabformer/rtf_sampler.py#L449

The function def _processes_sample() is used also when sampling a tabular model and probably this can happen also in this tabular model sampling. Regardless the case (tabular/relational) 2-3 lines below the bug line (L449) the function already (and wisely) has some logic to check if the sampled data synth_df is empty and if that's the case it throws the SampleEmptyError exception.

Resolution:

So, I think we can add also a similar check above the line https://github.com/worldbank/REaLTabFormer/blob/main/src/realtabformer/rtf_sampler.py#L443 checking synth_sample dataframe instead, something like:

if synth_sample.empty:
    # Handle this exception in the sampling function.
    raise SampleEmptyError(in_size=len(sample_outputs))

Then in my application I can catch the specific exception:

from realtabformer.rtf_exceptions import SampleEmptyError

try:
   // run some REalTabFormer sampling code
except SampleEmptyError as e:
  // handle the specific exception

What do you think?

Can you do the fix and build a new PyPI package?

Thanks!

avsolatorio commented 9 months ago

Hello @efstathios-chatzikyriakidis, thank you for sharing this issue. I have made the changes and also published a release with that patch. Please try installing the new version.

In the future, I recommend that you submit a pull request. This is entirely your solution, and I want people like you who find issues and implement solutions to be part of the official contributors! 🤗

efstathios-chatzikyriakidis commented 9 months ago

Thank you @avsolatorio!

I appreciate your effort and the work you did for resolving this. Yes, in the future for any issue if I have a resolution I will send a PR.

Regarding the commit changes I see that also FrozenSeq2SeqTrainer is replaced with Seq2SeqTrainer. Is this something that could cause any problem or what was the intuition behind it, does it optimize anything?

Thanks!

avsolatorio commented 9 months ago

Regarding the commit changes I see that also FrozenSeq2SeqTrainer is replaced with Seq2SeqTrainer. Is this something that could cause any problem or what was the intuition behind it, does it optimize anything?

No, this should work the same. I needed to implement the FrozenSeq2SeqTrainer before because the Seq2SeqTrainer on the transformers library had a bug. I reported it here: https://github.com/huggingface/transformers/issues/21182.

It was already fixed, so the FrozenSeq2SeqTrainer is no longer needed.