ziatdinovmax / gpax

Gaussian Processes for Experimental Sciences
http://gpax.rtfd.io
MIT License
205 stars 27 forks source link

Feature: allow the user to scale X and y before fitting, predicting, etc. #40

Closed matthewcarbone closed 1 year ago

matthewcarbone commented 1 year ago

Scaling features and targets to have "reasonable" values (usually between -1 and 1) is usually a pre-processing step that people who generally use GPs will know to do. However, it would be nice to have this functionality built into the GPs themselves.

So in other words, if a user presented an X with features that were not within the range of -1 to 1, these could be optionally scaled before fitting. Those scaling parameters could then be saved and reapplied during prediction.

Similarly, we could consider scaling the outputs as well. @ziatdinovmax what do you think?

ziatdinovmax commented 1 year ago

Thanks for the suggestion! However, I would generally prefer to limit the number of operations "hidden" from users and keep the data processing part separate from the models part. Given that the target audience is PhD scientists, I believe they are generally capable of doing the feature scaling themselves following the tutorials. In a pre-ChatGPT era, I would add a few utility functions for that, but now a user can just ask ChatGPT to write a feature scaling code :-)

matthewcarbone commented 1 year ago

@ziatdinovmax I'm not sure I agree with the GPT part but I definitely respect that you feel it's out of scope. Keep in mind though that I just had an interaction with a PhD level scientist who got confused by this (thinking that the code would do it for them).

Would you concede that at least adding a warning if the user does not scale their data is appropriate?

matthewcarbone commented 1 year ago

@ziatdinovmax I know you already said no, but something else for you to consider on this. Maybe I can change your mind 😁

Given that the target audience is PhD scientists, I believe they are generally capable of doing the feature scaling themselves following the tutorials.

Certainly true. However, a scaled GP output is not a priori obvious to the average scientist how to scale back. For example, the mean is obviously unscaled but the variance is not. It's a simple transformation, but it's perhaps not something the average experimentalist might think of. I feel like a normalizer built in as an option which is off by default couldn't hurt, and it would make the experience more seamless for the user.

I realize some of my suggestions might seem a bit nitpick-y here but keep in mind the annoyance of having to do all of this over and over in outer loops during e.g. campaigning will eventually compound and become a headache.