What if use "continuous one-hot" or "bi-PLE" to alternate PLE

yandex-research / rtdl-num-embeddings

(NeurIPS 2022) On Embeddings for Numerical Features in Tabular Deep Learning

https://arxiv.org/abs/2203.05556

MIT License

269 stars 31 forks source link

What if use "continuous one-hot" or "bi-PLE" to alternate PLE #7

Closed xumwen closed 1 year ago

xumwen commented 1 year ago

Considering the motivation of using one-hot-like approach PLE to improve model capabilities, there are two questions I don't quite understand. 1."Continuous one-hot" means to replace the 1 of the hit bin with a continuous value while keeping the other bins still 0. For the left bins' 1 seems to be added to hit bin's embedding by MLP which just like Ax+B (where x is the embedding of hit bin, A represents the location in this interval and B represents the sum of left bins' embeddings) 2.If left bins' 1 is built for training low-value more times, can we encode the feature with bi-PLE, which means to concatenate left-to-right PLE and right-to-left PLE to train high-value more times at the meantime.

Hope to hear your understanding. Thanks.

Yura52 commented 1 year ago

I don't understand the question. IIRC the whole PLE operation can be rewritten using only traditional linear layers and ReLU activations with the input being the original scalar x.
I tested "bi-PLE" when working on the project with exactly the same motivation :) Interestingly, I did not see any difference in results, but I did not invest too much time in this idea and I did not run the full "tune-evaluate" cycle on all datasets, so my observations (and implementation) may be unreliable. If you run any experiments on that, would be interesting to see the results.

xumwen commented 1 year ago

Thanks for your reply. For example of question 1, what if replace the encoding of [1, 1, (x - b_(t-1)) / (bt - b(t-1)), 0] to [0, 0, (x - b_(t-1)) / (bt - b(t-1)), 0]?

Yura52 commented 1 year ago

The informal answer is "this encoding does not preserve the notion of order".

Formally, the problem is that without 1 in the left bins you can have two embeddings representing very different values but being very close in terms of L2 distance (eps = 1e-9):

[eps, 0, ..., 0] vs [0, ..., 0, eps]

Similarly, you can have very close values with very different embeddings:

[0, ..., 1 - eps, 0, 0, ..., 0] vs [0, ..., 0, eps, 0, ..., 0]

xumwen commented 1 year ago

I see.

Thanks!