zhengrongbin / MEBOCOST

A python-based package and software to predict metabolite mediated cell-cell communications by single-cell RNA-seq data
BSD 3-Clause "New" or "Revised" License
66 stars 10 forks source link

Proportions and Normalisation #4

Closed dbdimitrov closed 2 years ago

dbdimitrov commented 2 years ago

Hi Rongbin,

First of all, congrats for your great work on mebocost! I really like the idea and it's certainly something that can be very useful for the field.

I do however have a few of questions.

First, I noticed that metabolites with negative 'presence'/scores would still be counted as present in mebocost - see HMDB0000112 from your example. E.g. you have scores that are negative, but these are still counted as if the metabolite is present in that cell type. My intuition would be that one should count only the ones with positive scores.

Second, I noticed that you mentioned that you normalize the scores by the means of the permutatations, and I assumed that this is done using a z-score-like normalization (i.e. x - mean(perms)/sd(perms), where x is a metabolite/sensor. However, I was not able to reproduce the scores that you return using this approach.

Third, do you calculate the p-values before or after normalization?

Your thoughts on these points would be highly appreciated!

Daniel

zhengrongbin commented 2 years ago

Hi Daniel, Thank you for trying our tool! Here are answers to your questions: 1) we calculated the communication score for all metabolite-sensor partners in all pairs of cell types (pairwise). For metabolites with negative scores, the communication score will be negative, then we assign a non-significant p-value for them, that is 1. So you should not find any negative communication scores if you consider p-value or FDR. 2) We calculated the communication score by taking the product of metabolite presence score in the sender and sensor gene expression in the receiver for real scRNA-seq data. Then shuffled cell labeling a thousand times by default, so a thousand communication scores from cell label shuffling were obtained. On the one hand, these will be used to evaluate the significance of the observed value. On the other hand, will be used for normalizing the communication score of real scRNA-seq, in which we divide the real communication score by the mean of a thousand background communication scroes.

dbdimitrov commented 2 years ago

Hi @zhengrongbin,

Thanks for the response and for the clarification.

  1. To clarify my point. When you calculate the metabolite scores, sometimes you would have negative values, and then these are counted as present when you calculate the 'expression proportions' for the metabolites per cluster. I think it would be more appropriate that when you get a negative score for a metabolite for a given cell that you then count the metabolite as absent when aggregating into cell types. See HMDB000011 from the example provided :)

It might be easier if you have time to meet over Zoom and discuss this? I also have some questions regarding the reactions network. :)

zhengrongbin commented 2 years ago

Hi Daniel -

I am more than happy to discuss details with you. If you want to meet over ZOOM, please contact me by email. Thanks!

Further clarification for your question: Basically, the "metabolite proportion (fraction)" in a cell group was calculated by the ratio of the number of cells with a metabolite score greater than a cutoff (by default is 0) to the total number of cells in that cell group, then communication with "metabolite proportion" less 0.25 will be assigned as non-significant in my analysis. Again, communication with a communication score less than 0 means the average metabolite score in that cell group was still negative, that kind of communication will also be assigned as non-significant. In this way, the significant communication events only happened for metabolites that have enough positive values in the cell group. Those cutoffs, such as metabolite value and metabolite proportion, can be changed by users in create_obj, _checkaboundance, and _filter_lowlyaboundant functions. Our idea is to calculate the communication score for each pair of cell types and each pair of metabolite-sensors, and then filter out those lowly abundant metabolites or sensors. Such design enables us to keep the original data and set cutoffs to focus on highly confident communications.

dbdimitrov commented 2 years ago

@zhengrongbin Thanks a lot!