Which kernel if X,Y, Z are binary variables

MehdiZouitine commented 3 months ago

Hello, Thanks again for this amazing tools.

I'm working with binary data and i'm wondering which kernel should I use ?

Have a nice day

zh2395 commented 3 months ago

When working with categorical data, a natural kernel to use is the discrete kernel, which is defined as K(x, y) = I(x = y).

For binary data, all shift-invariant kernels might have similar performance, since they all have the form: K(x, y) = c1 when x is not equal to y and K(x, x) = c2, where c1 and c2 are two constants.

MehdiZouitine commented 3 months ago

Hello ! Thanks you very much for the fast answer !

So for example which exact kernel do you recommand for X,Z,Y binary (I'm not very aware of the choice as a computer scientist ahah) ?

Thanks you again !

zh2395 commented 3 months ago

Hello! For categorical variables including binary variables, I would recommend using the discrete kernel, that is, K(x, y) = 1 if x = y, and K(x, y) = 0 otherwise.

MehdiZouitine commented 3 months ago

Hello again ! Great, thank you! One last remark: it seems that when X and Z are also binary variables, the current code does not work as RANN::nn2 computes the Euclidean distance matrix (see the get_neighbors function and in the readme " This R package provides implementations of two empirical versions of KPC when X, Z are Euclidean"). Are there any other changes needed in the code to make it work for X, Z, and Y being binary (and not just Y being binary)?

Thank you very much again.

zh2395 commented 3 months ago

There are two empirical KPC described in our paper (https://www.jmlr.org/papers/volume23/21-493/21-493.pdf). One empirical KPC is RKHS-based, requiring X, Z space to be kernel-endowed. Another empirical KPC is graph-based, requiring the X, Z space to be a metric space (so that a k-NN graph can be defined).

For the graph-based empirical KPC, in our paper, we primarily focused on continuous data (X and Z). If X and Z are binary, estimating the population KPC will be a much easier task --- as we motivated our method at the beginning of Section 3 in the above paper --- one may use Equation (10)

$\frac{1}{n} \sum_{i=1}^n \frac{1}{number\ of\ j : j \neq i , X_j = Xi } \sum{j:j \neq i , X_j = X_i} k_Y(Y_i, Y_j)$

to estimate $\mathbb{E}[\mathbb{E}[k_Y(Y_1,Y_1')]]$ and then to estimate the population KPC, which will be more efficient and does not even require the computation of a k-NN graph.

MehdiZouitine commented 3 months ago

Ok I got it. But it seems that the denominator can be equal to 0 in your formula

zh2395 commented 3 months ago

In such a case, you may re-use the nearest neighbor idea: For example, if (Xi, Zi) = (1, 1) is the unique data point with the value (1, 1), then its nearest neighbors would be all (Xj, Zj) with values (Xj, Zj) = (1, 0) or (0, 1).

However, note that (Xi, Zi) can only take 4 possible values (in your binary situation). If we observe a value only once, then it is quite likely that our data is insufficient (n is too small) and thereby we may not be able to obtain an accurate estimation in such a case.

MehdiZouitine commented 3 months ago

I understand that in the simplest case, we have pairs like ((X_i, Z_i)) where (i = 1). However, when using the KFOCI test, we may have a scenario involving ((X_i, Z_1, Z_2, \ldots)) and so on. In such cases, it becomes impractical to use Equation 10 directly.

Given this complexity, it seems necessary to revisit the Nearest Neighbor (NN) approach. Since we are dealing with binary data, utilizing the Hamming distance would be appropriate right ?

Thank you for your attention.

zh2395 commented 3 months ago

Yes, I agree that using the Hamming distance seems natural and appropriate for the products of binary variables.

zh2395 / KPC

Which kernel if X,Y, Z are binary variables #3