Possible Improvements for CPU inference

Hi, I am currently trying to improve the inference time. However for a given batch size of 512 sample generation the inference time of the gpu is twice as the cpu. Any idea on it ?

child_samples = model.sample(n_samples=512, input_unique_ids=query[self.join_on], input_df=query.drop(self.join_on, axis=1), gen_batch=512,device=self.device)

Note that the model is relational and no frozen encoder given. Moreover if there is a general tips for cpu inference for the RealTabformer I am eager to learn. Thanks for the neat repo. Cheers

worldbank / REaLTabFormer

Possible Improvements for CPU inference #49