yaozhong / SCLSC

Supervised contrastive learning for single-cell annotation
0 stars 1 forks source link

What to use for n_rep_mat? #1

Open KSimi7 opened 8 months ago

KSimi7 commented 8 months ago

Hi,

Thank you for sharing your code, it is very helpful. I am trying to use a similar approach for a different kind of label transfer tasks. I understand that the model training involves creating a representative matrix that is mean of gene expression for all labels. I noticed that n_rep_mat determines how many times the mean gene expression matrix is repeated in representative matrix. Is that correct? If not, what exactly is n_rep_mat and can you recommend a good approximation for this parameter?

Thanks, Harsimran Kaur

yusri-dh commented 8 months ago

Thanks for the question.

Yes, you are right. 'n_rep_mat' refers to the number of matrix replicates, with each replicate corresponding to an epoch. In SCLSC, we need cell instances and cell type instances as the input to the MLP. Cell type input is derived from the mean gene expression of cells categorized under that specific type. In the older version of our implementation, we provided a sampling ratio option for sampling cells of the specific cell type on each epoch. The average of these sampled cells gene expression were used the cell type representation. So, the cell type representation will be different for every epoch. The 'n_rep_mat' variable is originally used for recording sampled cells for calculating cell types in each epoch for visualization (i.e. 'n_rep_mat' = 'n_epoch'). The schematic figure as follow:

repr_tensor

In the reported results of our paper,  the cell type representation is calculated as mean gene expression from all cells belonging to the respective cell type (i.e. not from random sampling). So, the cell type representation is identical for every epoch. Therefore, 'n_rep_mat' is not necessary for the case when sampling rate is 100% or no visualization requirement.  Based on the data-scale and computational resources, lower sampling rate can also work well and minimal 100 cell samples per cell type is suggested according to our empirical evaluation.

The current version is correct and can be used to replicate our study. However, because the 'n_rep_mat'  is redundant and replicating an identical matrix in the current version is just a waste of memory, we will update this in the new version.

KSimi7 commented 8 months ago

Thank you for swift response. I just had one more question, did you notice any difference in your results when representative matrix was created by random sampling compared to taking average of all cells?

yusri-dh commented 8 months ago

When using random sampling, the projection is not stable and sometimes have some weird projections. Especially, when we took only small samples. The projection is better when we use large sample. However, taking large sample on every epoch increases time/memory consumption. So we decided just take one time measurement of the mean. Using mean of gene expression of each cell type produce more stable results.