wwylab / DeMixT

GNU General Public License v3.0
32 stars 14 forks source link

Normalized data #14

Closed ederisoud closed 3 years ago

ederisoud commented 3 years ago

Hello

I wrote to you about 1 year ago to understand how to run DeMixT (https://github.com/wwylab/DeMixTallmaterials/issues/2).

I performed the analysis on my data but I still have some questions.

So, last year, one of you advice to normalize data before running DeMixT. In a first time, I did not wantto use a normalization because I perform Deseq2 differential analysis (which require a counting table). So I did not use a normalization at all (even depending on the depth of sequencing for exemple). As you advice to peerform a normalization, with a colleague, we decided to use rlog normalization (which is a king of log normalization) from Deseq2 package.

But now, I got this error: Error in if (sum(obj == 0) > 1) { : valeur manquante là où TRUE / FALSE est requis

I have read in another issue that it is due to negative values or very small values but if I use rlog normalization, I will have a lot of values < 1. Therefore which is the best way to use DeMixT? with normalization or not? and if normalization is better, which one can I use? I need a reversible normalization to give a counting table to Deseq2.

Moreover, in the analysis without normalization, I have filtered out all genes with a 0 in at least one sample but with my co-authors we find it too restrictive (because some genes are not expressed at all or not enough in one kind of tissue and are filtered out while they are actually important to differenciate the tissues). Therefore, we wanted to replace the 0 by 1 in genes that are expressed in some samples but not in others. What do you think about it? Is it a problem for the deconvolution?

Thank you for your answer and best wishes for 2021.

Emilie

pengyang0411 commented 3 years ago

Hi Emilie,

DeMixT assumes genes follow a log2-normal distribution across samples for each component. Therefore, I would suggest you not to use the rlog normalization, which will violate the assumption. In our pipeline, we applied scale normalization at the seventy-fifth percentile based on the DSS package (Wu et al., 2013). To do so, we first combine normal samples and mixed tumor samples together, and by DSS package, we calculate a scale factor for each sample. Then, gene counts for each sample are divided by its scale factor to ensure the seventy-fifth percentile for each sample equal.

Quantile.Normalization.scale<-function(Count.matrix){ newt <- Count.matrix colnames(newt)=NULL rownames(newt)=NULL

designs=c(rep("0", dim(Count.matrix)[2])) seqData=newSeqCountSet(as.matrix(newt), designs) seqData=estNormFactors(seqData, "quantile") k3=seqData@normalizationFactor mk3=median(k3) k3=k3/mk3

temp<-newt

for(i in 1:ncol(newt)){ temp[,i] = temp[,i]/k3[i] } Count.matrix.normalized<-temp colnames(Count.matrix.normalized)<-colnames(Count.matrix) rownames(Count.matrix.normalized)<-rownames(Count.matrix)

return(Count.matrix.normalized) }

MixData = Quantile.Normalization.scale(cbind(data_ICM, data_TE)) data_ICM = MixData[ ,1:dim(data_ICM)[2]] data_TE = MixData[,-c(1:dim(data_ICM)[2])]

Note the data_ICM and data_TE here would be matrix format, unlike the SummarizedExperiment format for the input of DeMixT function.

In our simulation experience, if you replace 0 by 1 in the normal reference, which is data_TE in your case, it should be fine. But it might affect the estimation accuracy if you replace 0 by 1 in the tumor component, which is data_ICM in your case. And we are developing zero-inflated DeMixT to account for genes with 0 counts and will be updated in the future.

Hope I answer your question, wish you a happy new year.

Best, Peng

ederisoud commented 3 years ago

Thank you for this answer !

Based on your response, we have decided not to normalize at all because the normalization you recommend is not suitable for Deseq2 analysis. Moreover, I do not have a batch effect as all embryos were collected one by one and then treated all together but I do have an individual effect.

Moreover, we will need genes with 0 because some are very important for our project. Therefore, we choose to add 1 at all values in the matrix (in data_ICM and data_TE). In simulated data, this seems to less alter fold changes than just change 0 in 1 . Did you try this option.

I am very happy to read that this issue will be corrected in the next updates! I look forward to use it!

Thank you

Emilie

ederisoud commented 3 years ago

Update: Actually I can't add 1 to all values in the table because I obtain this error:

Step 1: Estimation of Proportions

Gene selection starts Error in inputdata[id1, ] : (subscript) indice logique trop long

Do you know why I obtain this error?

I did not change anything else and when I remove the lign including the adding, DeMixT is running correctly until deconvolution (error because I have 0 in data).

thank you

Emilie

pengyang0411 commented 3 years ago

Hi Emilie,

This error looks odd to me and I never saw this before, would that be possible there are some NA values in the inputdata variable? I suggest you define the input values for the DeMixT_GS function and run it line by line till Optimum_KernelC function. You should be able to check if id1 is binary outcomes, which is supposed true/false, and inputdata contains NA or not.

Let me know if you have more questions. :)

Best, Peng

ederisoud commented 3 years ago

Hello Peng

I checked and I do not have NA values in the input data.

DeMixT_GS function returns exactly the same error. Actually, the estimation of Pi does not begin at all in both function. When I add the line dataset=dataset+1, I got this error but when I skip this line, estimation of Pi runs perfectly.

Do you have any idea?

Thank you

Emilie

pengyang0411 commented 3 years ago

Hi Emilie,

I would suggest you not to add 1 for the input of DeMixT_GS, which will only take a subset of differentially expressed and highly identifiable genes for the proportion estimation.

However, if you expect as many as deconvolved gene expression, I suggest you can add 1 for the input of DeMixT_S2.

Let me know if this would help you. Thanks, Peng

ederisoud commented 3 years ago

Hello Peng

I tried to use your solution but I am still getting the error: Error in if (sum(obj == 0) > 1) { : missing value where TRUE/FALSE needed

Actually, I remember that I always have this error when I use the option if.filter = TRUE with my data.

Therefore, I do not think the problem was linked to the add of 1 but to my data in first place. The first time, I filtered out all 0 in my data. I do not know how to do if I need to use the option if.filter=TRUE What should I do? Continue with the option if.filter=FALSE which is the only one actually working or try to fix the issue but I do not know how.

Thank you

Emilie

pengyang0411 commented 3 years ago

Hi Emillie,

The error shows DeMixT function fails to estimate the likelihood and parameters. if.filter = TRUE means that only a subset of differentially expressed and informative genes will be selected for proportion estimations. And we recommend you set if.filter = TRUE. Besides, I think the problem may because the filder.sd is too strict in the default setting, which is 0.5. So we will filter genes from normal reference whose standard deviation, after log2-transformed, is greater than 0.5. The reason for this parameter is we want to pick up the genes whose normal references are consistent across samples. So I suggest you relax filder.sd to 0.8 or even 1 while keeping if.filter = TRUE.

Let me know if that helps, Peng

ederisoud commented 3 years ago

Hello

Thank you for your solution but neither filter.sd=0.8 or 1 did not avoid the issue...

Thank you

Emilie

ederisoud commented 3 years ago

Hello Peng With the bioinformatician from my team, we found the origin of the error (the one with id1) when adding 1 to my data to avoid 0.
When if.filter=TRUE, the function needs variance > 0 in data.N1 but when one gene is not expressed at all in the reference part, I replaced all 0 by 1 and therefore variances were equal to 0. As filtering removes genes without a variance >0, when it returns to the first matrix, there is a problem of dimension. This is why I had this error.

So I succeed to use if.filter=TRUE with my data by removing these genes (I let filter.sd=0.5 and as recommanded ngene.selected.for.pi = 0.2) but I had some warnings at the end of DeMixT_GS function:

Warning messages:
1: In theta.valid + sqrt(abs(qchisq(0.95, 1) * C)) :
  the size of a longer object is not a multiple of the size of a shorter object
2: In theta.valid - sqrt(abs(qchisq(0.95, 1) * C)) :
  the size of a longer object is not a multiple of the size of a shorter object

I think it is because some genes are removed by the filtering (in the input data I had 16714 rows and in the output have 16711 rows but when filtering is off, the output file contains 16714 rows) but I do not know why these genes are removing actually (variance are not equal to 0, no 0 in data). I will work on that next week with the bioinformatician. I will let you know.

Thank you for your help.

Emilie

pengyang0411 commented 3 years ago

Hi Emilie,

The warning may come out the dimension of Hessian matrix has been changed after you removed the genes with 0 counts. Therefore, I suggest you delete the Hessian folder under your working path and re-run DeMixT.

Also, I suggest you tune the parameter ngene.selected.for.pi = ngene.Profile.selected = 1,000/1,500/2,500 each time, which the proportion estimation will achieve best performance based on our simulation.

Let me know if that helps,

Peng

ederisoud commented 3 years ago

Hello Peng

Thank you for your answer.

We found out why we have the warnings. The warnings appear because some genes (in our case 3) are filtered by the function and therefore the function does not find them when the final table is created. If I understood correctly, when MuN - Mu <= 4 then the gene is filtered and this means that the deconvolution for this gene is "bad" and therefore the result is not reliable. Consequently, it means that for all unfiltered genes, the deconvolution done is rather reliable. Have we got it all figured out?

Regarding the parameters ngene.selected.for.pi and ngene.Profile.selected, I had in mind to put 0.2 as advised in my previous question. To improve the function, what do you think I should choose (0.2/1000/1500/2500/other) ? I tried to simulate data (with simulate_2comp) and do a DeMixT analysis but I'm not sure what to do next to check that the settings applied are the best. Currently my formula is :

Res2<-DeMixT(data.Y=data_ICM, data.N1=data_TE, data.N2 = NULL, niter = 10, nbin = 50, if.filter = TRUE, output.more.info = TRUE, ngene.selected.for.pi = 0.2, ngene.Profile.selected = 0.2, nspikein=0, nthread=8)

Thank you for your help and your time.

Emilie

pengyang0411 commented 3 years ago

Hi Emillie,

Yes, we currently have QC control for the deconvolution genes. And you can run some downstream analysis based on the remaining deconvoluted gene expression.

Regarding the parameters ngene.selected.for.pi and ngene.Profile.selected, I recommend you try to set both parameters equals to 1,500 or 2,500, which is what we used in TCGA pipeline.

Best, Peng

ederisoud commented 3 years ago

Hello Peng

Thank you for your answer.

I close this issue as everything is working so far.

Thank you for your time and help.

Emilie