yiluheihei / microbiomeMarker

R package for microbiome biomarker discovery
https://yiluheihei.github.io/microbiomeMarker
GNU General Public License v3.0
169 stars 40 forks source link

run_lefse, subgroup argument gives Error in if ((sx < sy) != dir_cmp || sx == sy) { : missing value where TRUE/FALSE needed #62

Closed akhst7 closed 2 years ago

akhst7 commented 2 years ago

I run a following phyloseq obj;

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 625 taxa and 12 samples ]
sample_data() Sample Data:       [ 12 samples by 2 sample variables ]
tax_table()   Taxonomy Table:    [ 625 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 625 tips and 624 internal nodes ]
refseq()      DNAStringSet:      [ 625 reference sequences ]

I run a following;

lefse.new<-run_lefse(
ps,
group="Treatment",
subgroup = "Sex",
taxa_rank = "all",
transform = "log10p",
norm = "CPM",
norm_para = list(),
kw_cutoff = 0.1,
lda_cutoff = 2,
bootstrap_n = 100,
bootstrap_fraction = 2/3,
wilcoxon_cutoff = 0.1,
multigrp_strat = FALSE,
strict = "0",
sample_min = 10,
only_same_subgrp = FALSE,
curv = FALSE
)

I get an error as follows;

Error in if ((sx < sy) != dir_cmp || sx == sy) { : 
  missing value where TRUE/FALSE needed

Here is a sample_data of ps;

Sample Data:        [12 samples by 2 sample variables]:
        Sex   Treatment
1      M           HH
2      M           HH
3      F           HH
4      F           HH
5      M           RA
6      M           RA
7      F           RA
8      F           RA
9      M           HH
10      F           HH
11      M           RA
12      F           RA

what am I missing here ?

Thanks.

cpavloud commented 2 years ago

I have the same error.... I am running microbiomeMarker (v 1.0.2) in R 4.1.1

akhst7 commented 2 years ago

@cpavloud

Can you post a result of following ?

str(sample_data(phyloseq.obj))

cpavloud commented 2 years ago

Yes, of course

> str(sample_data(physeq)) 'data.frame': 26 obs. of 2 variables: Formal class 'sample_data' [package "phyloseq"] with 4 slots ..@ .Data :List of 2 .. ..$ : chr "kidneys with calcification" "SG_affected" "SG_affected" "SG_affected" ... .. ..$ : chr "Sick" "Sick" "Sick" "Sick" ... ..@ names : chr "Condition" "Health_state" ..@ row.names: chr "ASB1_LL12_1" "KOK1_LL1_1" "KOK10_LL5_2" "KOK2_LL1_2" ... ..@ .S3Class : chr "data.frame"

The functions runs if I have only group = "Condition" or only group = "Health_state" But I get the error when I try to run with group = "Health_state" and subgroup = "Condition"

akhst7 commented 2 years ago

@cpavloud ,

I did remember now that there is a bug in the code for run-lefse, right in area of test_rep_wilcoxon within lefse-utilities.R, which has not been fixed. Only thing you can do is to make another column of combined two groups in to one ;

Sex   Treatment  Sex_Treatment
1      M           HH        HH_M
2      M           HH       HH_M
3      F           HH        HH_F

Alternatively, you can start using another package called, MicrobiomeProcess. It will let you do lefse with second group as follows;

mp_diff_analysis(
.data = ps,
.abundance=normalize,
.group= Treatment,
.sec.group = Sex,

It is pretty simple to use but powerful. Overall, it is a better package for a 16S microbiome analysis.

cpavloud commented 2 years ago

Thank you! Great package! I didn't know about it. And I also found this very nice website: https://yulab-smu.top/MicrobiotaProcessWorkshop/articles/MicrobiotaProcessWorkshop.html

akhst7 commented 2 years ago

@cpavloud, Yeah, it is an upcoming R package for the Microbiome stuff. The link you had is actually for the older version, though commands listed in this tutorial still work. I like the older version better than the newer one.

I am gonna close this issue, since I don't think the developer will fix this issue any time soon.

yiluheihei commented 2 years ago

@akhst7 @cpavloud Thanks for your interested in microbiomeMarker, and I'm Sorry for the later reply. Could you provide a reproducible example that would help me to fix this issue?

For more details on how to make a great minimal reproducible example, see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example and https://www.tidyverse.org/help/#reprex.

cpavloud commented 2 years ago

I cannot use reprex to create a reproducible example as you want it, as it only renders the last line of code in the console.

I am using R version 4.1.1 and tidyverse (1.3.1), phyloseq (1.38.0) and microbiomeMarker (1.0.2)

When I try to run lefse with a group and a subgroup

OTU_lefse <- run_lefse( physeq, group = "Health_state", subgroup = "Condition", taxa_rank = "none", transform = c("identity"), norm = "CPM", kw_cutoff = 0.05, lda_cutoff = 2, bootstrap_n = 30, bootstrap_fraction = 2/3, wilcoxon_cutoff = 0.05, multigrp_strat = FALSE, strict = "0", sample_min = 7, only_same_subgrp = FALSE, curv = FALSE )

I get this error Error in if ((sx < sy) != dir_cmp || sx == sy) { : missing value where TRUE/FALSE needed

But when I run it with group = "Health_state" and no subgroup I get a result

OTU_lefse microbiomeMarker-class inherited from phyloseq-class normalization method: [ CPM ] microbiome marker identity method: [ lefse ] marker_table() Marker Table: [ 8 microbiome markers with 5 variables ] otu_table() OTU Table: [ 460 taxa and 26 samples ] sample_data() Sample Data: [ 26 samples by 2 sample variables ] tax_table() Taxonomy Table: [ 460 taxa by 1 taxonomic ranks ]

Also, when I run it with group = "Condition" and no subgroup again I get a result

OTU_lefse microbiomeMarker-class inherited from phyloseq-class normalization method: [ CPM ] microbiome marker identity method: [ lefse ] marker_table() Marker Table: [ 19 microbiome markers with 5 variables ] otu_table() OTU Table: [ 460 taxa and 26 samples ] sample_data() Sample Data: [ 26 samples by 2 sample variables ] tax_table() Taxonomy Table: [ 460 taxa by 1 taxonomic ranks ]

This is what my sample_data look like

sample_data(OTU_lefse) Sample Data: [26 samples by 2 sample variables]: Condition Health_state ASB1_LL12_1 kidneys with calcification Sick KOK1_LL1_1 SG_affected Sick KOK10_LL5_2 SG_affected Sick KOK2_LL1_2 SG_affected Sick KOK3_LL1_3 SG_affected Sick KOK4_LL5_3 SG_affected Sick KOK5_MM2_1 SG_affected Sick KOK6_LL12_2 SG_affected Sick KOK7_LL12_4 SG_affected Sick KOK8_LL12_3 SG_affected Sick ASB10_OO10_4 kidneys with calcification Sick KOK9_LL5_1 SG_affected Sick O_1_OO7_2 healthy Healthy O_2_OO3_3 healthy Healthy O_3_MM9_1 healthy Healthy O_4_MM9_4 healthy Healthy O_5_NN6_1 healthy Healthy O_6_NN6_2 healthy Healthy O_8_MM9_3 healthy Healthy ASB2_NN8_2 kidneys with calcification Sick ASB3_NN8_3 kidneys with calcification Sick ASB4_NN11_2 kidneys with calcification Sick ASB5_OO3_1 kidneys with calcification Sick ASB6_OO7_1 kidneys with calcification Sick ASB7_OO7_3 kidneys with calcification Sick ASB8_OO10_1 kidneys with calcification Sick

yiluheihei commented 2 years ago

@cpavloud Please install the latest development version (1.3.2) from github in which your issue has been fixed.

devtools::install_github("yiluheihei/microbiomeMarker")
cpavloud commented 2 years ago

Ok! That worked! Thank you!

akhst7 commented 2 years ago

@cpavloud

Could you do me a big favor ? If you have a chance, could you run lefse on MicrobiotaProcess and compared to MicrobiomeMaker and post the result somewhere so that I can see them. Thanks.

yiluheihei commented 2 years ago

@akhst7 lefse in microbiomeMarker is a reimplementation of LEfSe. For details of the method, please see the LEfSe paper

cpavloud commented 2 years ago

This is quite interesting.

The result of mp_diff_analysis function

new_lefse <- mp_diff_analysis( .data = MPSE, .abundance=normalize, .group= Health_state, .sec.group = Condition, relative = TRUE)

is that There are not significantly discriminative features after internal first test !

Also, as probably expected, the result of the diff_analysis function

deres <- diff_analysis(obj = physeq, classgroup = "Health_state", subclass = "Condition", mlfun = "lda", filtermod = "pvalue", firstcomfun = "kruskal_test", firstalpha = 0.05, strictmod = TRUE, secondcomfun = "wilcox_test", subclmin = 3, subclwilc = TRUE, secondalpha = 0.01, lda=3)

is that There are not significantly discriminative features after internalwilcox_test !

However, the run_lefse function identifies 2 microbiome markers

microbiomeMarker-class inherited from phyloseq-class normalization method: [ CPM ] microbiome marker identity method: [ lefse ] marker_table() Marker Table: [ 2 microbiome markers with 5 variables ] otu_table() OTU Table: [ 460 taxa and 26 samples ] sample_data() Sample Data: [ 26 samples by 2 sample variables ] tax_table() Taxonomy Table: [ 460 taxa by 1 taxonomic ranks ]

Object of class "marker_table" feature enrich_group ef_lda pvalue padj marker1 Otu16 Healthy 3.503214 0.04537015 0.04537015 marker2 Otu185 Sick 3.041596 0.02413744 0.02413744

akhst7 commented 2 years ago

@cpavloud

OK that's what I thought you would get. I get the exactly the same result. Initially I thought it was due to different default normalization approaches used by MP and MB. MP uses "hellinger"(I think, according to the vignette) while MB uses "cpm". But then, when I swapped MP's RelRareAbundance with MB's cpm, and did diff_analysis, I got the same result. I have not done the opposite and have not manually calculate lefse (basically annova followed by LDA) Huttonhower's lefse package yet but it appears that different normalization significantly affects lefse results, particularly data with a borderline diff. abundance. Perhaps, yiluheihei could address which normalization method we should be using for lefse.

yiluheihei commented 2 years ago

@akhst7 As you said, the choice of normalization method has large effect on the result of different analysis, and there is no consensus on which method is better than others. For lesfse, I recommend use the "CPM", the default normalization method in lefse paper. And try other methods if the result is not good enough.

Moreever, microbiomeMarker currently provides a function compare_DA() which can be used to perform comparison of different methods and help users to choose a suitable normalization/differential method for their own dataset.

library(microbiomeMarker)
data(kostic_crc)

# Remove taxa not seen in at least 20% of the samples
kostic_crc <- phyloseq::filter_taxa(
    kostic_crc,
    function(x) sum(x > 0) > (0.2*length(x)), TRUE)

compare_res <- compare_DA(
    ps = kostic_crc,
    group = "DIAGNOSIS",
    methods = "lefse",
    args = list(lefse = list(list(norm = "CPM"), list(norm = "TSS"))),
    BPPARAM = BiocParallel::SnowParam(workers = 2, progressbar = TRUE),
    n_rep = 2,
    effect_size = 5)
#>   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
compare_summary <- summary(compare_res)
#> Warning: Best score is <= 0.
#> You might require to preprocessing your data or re-run with a higher effect size.
compare_summary
#>                                                                                  call
#> 2 run_lefse(ps = "kostic_crc", group = "DIAGNOSIS", taxa_rank = "none", norm = "TSS")
#> 1 run_lefse(ps = "kostic_crc", group = "DIAGNOSIS", taxa_rank = "none", norm = "CPM")
#>         auc        fpr     power       fdr      score score_0.05 score_0.95
#> 2 0.5000000 0.00000000 0.0000000 0.0000000  0.0000000       0.00  0.0000000
#> 1 0.8755844 0.03832834 0.3333333 0.6388889 -0.5136941      -0.55 -0.4823232

Created on 2022-06-07 by the reprex package (v2.0.1)

The higher the score in the summary result, the better the method is estimated to be. However, this feature is in beta version, and only taxa_rank = "none" is allowed.

Hope this helps.

akhst7 commented 2 years ago

Thanks. It is much clear now.