Open MehdiZouitine opened 3 months ago
When working with categorical data, a natural kernel to use is the discrete kernel, which is defined as K(x, y) = I(x = y).
For binary data, all shift-invariant kernels might have similar performance, since they all have the form: K(x, y) = c1 when x is not equal to y and K(x, x) = c2, where c1 and c2 are two constants.
Hello ! Thanks you very much for the fast answer !
So for example which exact kernel do you recommand for X,Z,Y binary (I'm not very aware of the choice as a computer scientist ahah) ?
Thanks you again !
Hello! For categorical variables including binary variables, I would recommend using the discrete kernel, that is, K(x, y) = 1 if x = y, and K(x, y) = 0 otherwise.
Hello again !
Great, thank you! One last remark: it seems that when X and Z are also binary variables, the current code does not work as RANN::nn2
computes the Euclidean distance matrix (see the get_neighbors
function and in the readme " This R package provides implementations of two empirical versions of KPC when X, Z are Euclidean"). Are there any other changes needed in the code to make it work for X, Z, and Y being binary (and not just Y being binary)?
Thank you very much again.
There are two empirical KPC described in our paper (https://www.jmlr.org/papers/volume23/21-493/21-493.pdf). One empirical KPC is RKHS-based, requiring X, Z space to be kernel-endowed. Another empirical KPC is graph-based, requiring the X, Z space to be a metric space (so that a k-NN graph can be defined).
For the graph-based empirical KPC, in our paper, we primarily focused on continuous data (X and Z). If X and Z are binary, estimating the population KPC will be a much easier task --- as we motivated our method at the beginning of Section 3 in the above paper --- one may use Equation (10)
$\frac{1}{n} \sum_{i=1}^n \frac{1}{number\ of\ j : j \neq i , X_j = Xi } \sum{j:j \neq i , X_j = X_i} k_Y(Y_i, Y_j)$
to estimate $\mathbb{E}[\mathbb{E}[k_Y(Y_1,Y_1')]]$ and then to estimate the population KPC, which will be more efficient and does not even require the computation of a k-NN graph.
Ok I got it. But it seems that the denominator can be equal to 0 in your formula
In such a case, you may re-use the nearest neighbor idea: For example, if (Xi, Zi) = (1, 1) is the unique data point with the value (1, 1), then its nearest neighbors would be all (Xj, Zj) with values (Xj, Zj) = (1, 0) or (0, 1).
However, note that (Xi, Zi) can only take 4 possible values (in your binary situation). If we observe a value only once, then it is quite likely that our data is insufficient (n is too small) and thereby we may not be able to obtain an accurate estimation in such a case.
I understand that in the simplest case, we have pairs like ((X_i, Z_i)) where (i = 1). However, when using the KFOCI test, we may have a scenario involving ((X_i, Z_1, Z_2, \ldots)) and so on. In such cases, it becomes impractical to use Equation 10 directly.
Given this complexity, it seems necessary to revisit the Nearest Neighbor (NN) approach. Since we are dealing with binary data, utilizing the Hamming distance would be appropriate right ?
Thank you for your attention.
Yes, I agree that using the Hamming distance seems natural and appropriate for the products of binary variables.
Hello, Thanks again for this amazing tools.
I'm working with binary data and i'm wondering which kernel should I use ?
Have a nice day