zellerlab / siamcat

R package for Statistical Inference of Associations between Microbial Communities And host phenoType
https://siamcat.embl.de/
52 stars 16 forks source link

Can the function "normalize.features" be applied to normalize only one single test sample? #44

Closed liaoherui closed 1 year ago

liaoherui commented 1 year ago

Hi, thanks for the amazing tool!

Recently, I tried to use the "normalize.features" function (default parameter with log.std method) in SIAMCAT package to normalize the species abundance features of one single test sample without a known label. However, the program reported errors saying "label only consists of healthy" or "not enough samples" (this error happened when I added some additional samples to the input data).

Thus, I'd like to ask whether this function can be applied to a single test sample. If not, how does SIAMCAT normalize the features for the new single test sample?

Thanks!

jakob-wirbel commented 1 year ago

Hi @liaoherui sorry for the delay - I was travelling!

As far as I understood, you tried to normalize a single sample? This will not work with the log.std method, since this method does for each feature: (x - mean(x))/sd(x) and with only one sample, you cannot really compute the mean or the standard deviation of the feature. Alternatively, did you use the frozen normalization (with parameters taken from another dataset)? Then, the normalization of a single sample should be possible. Could you clarify, please?

Cheers, Jakob

liaoherui commented 1 year ago

Hi, Jakob! Clear now. Thanks for your reply!

I have tried frozen normalization before, but the "siamcat_reference" in the PDF manual example is unclear. Thus, I can not run the function successfully on my data. However, my current solution is: I manually achieved the function that applies the parameter from the training set to the test set to do the normalization (in Python code). Thanks for your answer and help again!

Best regards, Herui

jakob-wirbel commented 1 year ago

Okay, it's not super clear to me what exactly you want to achieve :D Do you try to normalize a single test sample with the parameters from another dataset? If yes, you can try the frozen normalization (let's say that you have your single test sample in sc.obj.test and the other dataset (which you normalized first) in sc.obj.ref)

sc.obj.test <- normalize.features(sc.obj.test, norm.param=norm_param(sc.obj.ref))

Is that the answer to your question? :D

liaoherui commented 1 year ago

Hi, Jakob,

Yes. I am trying to normalize a single test sample with the parameters from another dataset. And I have tried the frozen normalization before and got the error below (there is only one test sample in object siamcatx).
image

jakob-wirbel commented 1 year ago

ah okay, I see :D The thing about the test label is only a warning, not an error, so you can ignore this :D The error stems from the fact that you have a different set of features in both of your datasets. In order for the frozen normalziation to work, you need to have exactly the same features in both tables

liaoherui commented 1 year ago

Thanks for the guidence!

I checked my training and test input matrices (see below). It seems they have the same dimensions in terms of features. image

jakob-wirbel commented 1 year ago

hmmm, but are the feature names the exact same?

liaoherui commented 1 year ago

Yes. I have checked this. The feature names are exact same, too. -。-

jakob-wirbel commented 1 year ago

alright, thank you! I tried to reproduce your error and did not manage with my example dataset :/

Can you send me the exact commands that you executed and their output? Or a mock example dataset that I can use to reproduce the error?

Can you also run?

all(rownames(train.matrix) %in% rownames(test.matrix))

This seems to be the call that throws the error in your case...

liaoherui commented 1 year ago

Hi, Jakob,

I ran the suggested command, and the result showed "TRUE". all(rownames(train.matrix) %in% rownames(test.matrix))

I have sent the commands and data I used to your email. You can reproduce the problem by running "sh run_norm.sh". Thanks!

jakob-wirbel commented 1 year ago

Thank you for your email; with the data I could reproduce the error :) Luckily, i can see now where the problem is :D You filter the features before you try to apply the frozen normalization, that's why lots of features get filtered out and the error occurs: all(norm.param$retained.feat %in% row.names(feat)) is not TRUE

You can easily fix this by using the normalize.features function on the original set of features (without filtering):

simacat_test <- normalize.features(siamcat_test, 
     norm.param=norm_params(siamcat_train), 
     feature.type = 'original')

Please let me know if that helps!

liaoherui commented 1 year ago

Hi, Jakob,

Thanks for your help! The problem is solved now! Thanks again for this useful tool! :)