zhouhj1994 / LinDA

35 stars 3 forks source link

Cannot allocate vector of size ... #3

Closed g-antonello closed 2 years ago

g-antonello commented 2 years ago

Dear Huijuan, I'm happy about this tool for many reasons, first of all because it is fast, compared to other ones (DESeq2, corncob). However, I am having an issue with a differential abundance run on 172 samples and 781 SGBs from MetaPhlAn (each sample sums to 100):

linda(otu.tab = abundances(phyloseq)), meta = meta(phyloseq), formula = "~ age + sex + using_drugs + trait_of_interest", type = "proportion", adaptive = TRUE) This outputs: Error: cannot allocate vector of size 9.6 Gb.

Is this really that memory-intensive? Or am I doing something wrong?

The variables values are:

zhouhj1994 commented 2 years ago

Hi,

Thanks for your interest! It seems to be a memory issue judging from the error message, but a dataset with 721 features and 172 samples is far from too big.

The largest intermediate variable produced by the procedure would be about a 5000 by 5000 covariance matrix, and the memory size of which is less than 0.5Gb.

Could you please check if "abundances(phyloseq))" is a 721 by 172 dataframe, the "meta(phyloseq)" is a dataframe with appropriate dimensions (the number of rows should be 172), and the "age" variable is not stored as a factor? (otherwise the covariance matrix could actually be large). I will think about what we can do next if none of these explains the error.

Thanks, Huijuan

On Tue, May 17, 2022 at 11:56 PM ubiminor @.***> wrote:

Dear Huijuan, I'm happy about this tool for many reasons, first of all because it is fast, compared to other ones (DESeq2, corncob). However, I am having an issue with a differential abundance run on 172 samples and 721 SGBs from MetaPhlAn (each sample sums to 100):

linda(otu.tab = abundances(phyloseq)), meta = meta(phyloseq), formula = "~ age + sex + using_drugs + trait_of_interest, type = "proportion", adaptive = TRUE) This outputs: Error: cannot allocate vector of size 9.6 Gb.

Is this really that memory-intensive? Or am I doing something wrong?

The variables values are:

  • age: integer
  • sex: binary
  • using_drugs: binary
  • trait_of_interest: integer taking values 1, 2 or 3

— Reply to this email directly, view it on GitHub https://github.com/zhouhj1994/LinDA/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALF4E4CKOT64HXDM66BAHXTVKO6TBANCNFSM5WFNPFZQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

g-antonello commented 2 years ago

Dear Huijuan, Thank you for your detailed reply. I guess it was a variable encoding issue, i don't know.. because when I started from scratch with a cleaner code it worked, Thank you for this! Another point is: since metaphlan returns counts summing to 100, these are proportions, which I made sure they were accounted for by each sample's proportions sum to 1 and then using type = "proportion" in linda.

  1. Is LinDA's approach still valid with this setup, even if it can't leverage on sequencing depth?
  2. Do you think it is a valid approach to include the number of reads mapped per sample as a covariate, to mimic what LinDA would do internally with counts?

Best, Giacomo

zhouhj1994 commented 2 years ago

Hi Giacomo,

I'm glad that it worked and thanks for your insightful feedback.

  1. Yes. For LinDA, the proportion data and the count data are pretty much the same as the data will be CLR (centered log-ratio) transformed anyway. The only difference is that if the data is count, the sequencing depth information is available, which we have utilized in LinDA to impute zeros. If the data is proportion, we use the "half minimum approach" (the half of the minimum proportion value among samples for a feature) to replace zeros.
  2. In the model that motivates LinDA, the absolute abundance in the ecosystem is the dependent variable, to which the sequencing depth is not related. So the number of total reads is not included as a regressor. But you‘ve made a good point, in some methods (like those employing negative binomial models), the sequencing depth is a component.

Although the real proportion (the proportion in the ecosystem) is not related to the sequencing depth, the low sequencing depth would cause under-sampling especially for the rare features, which means that the sequencing depth does influence the observed proportions. I believe a thoughtful procedure is required to address this issue.

Best, Huijuan

On Thu, May 19, 2022 at 3:30 PM ubiminor @.***> wrote:

Dear Huijuan, Thank you for your detailed reply. I guess it was a variable encoding issue, i don't know.. because when I started from scratch with a cleaner code it worked, Thank you for this! Another point is: since metaphlan returns counts summing to 100, these are proportions, which I made sure they were accounted for by each sample's proportions sum to 1 and then using type = "proportion" in linda.

  1. Is LinDA's approach still valid with this setup, even if it can't leverage on sequencing depth?
  2. Do you think it is a valid approach to include the number of reads mapped per sample as a covariate, to mimic what LinDA would do internally with counts?

— Reply to this email directly, view it on GitHub https://github.com/zhouhj1994/LinDA/issues/3#issuecomment-1131331005, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALF4E4FRABC5WDJ7AIXM37LVKXUY5ANCNFSM5WFNPFZQ . You are receiving this because you commented.Message ID: @.***>

g-antonello commented 2 years ago

Thank you for the insight! Is the model's performance (accuracy/power/...) good without imputing zeros with the default method for counts data?

Giacomo

zhouhj1994 commented 2 years ago

The zero treatment is necessary in LinDA as it involves logarithms. If data is counts, we provide two zero-handling strategies: add pseudo counts (e.g., 0.5) to all counts or impute zeros by some sort of ratios between the sequencing depths. The choice of these two strategies affects the performance (accuracy/power/...) indeed. We have supplied an adaptive approach to choose between them. If data is proportions, we use the "half minimum approach" to replace zeros.

Huijuan

On Sat, May 21, 2022 at 3:07 AM ubiminor @.***> wrote:

Thank you for the insight! Is the model's performance (accuracy/power/...) good without imputing zeros with the default method for counts data?

Giacomo

— Reply to this email directly, view it on GitHub https://github.com/zhouhj1994/LinDA/issues/3#issuecomment-1133222972, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALF4E4BBQF7UGHIZF3JKG4LVK7PHDANCNFSM5WFNPFZQ . You are receiving this because you commented.Message ID: @.***>