Using DeMixT for RNAseq embryo mixed cell

ederisoud commented 4 years ago

Hello to all wwylab group

I am currently working on equine embryos. We collected embryos from mares of different conditions at the stage of the expanded blastocyst (D8). At this stage, the embryo is composed of 3 parts (as in the picture):

The inner cell mass (ICM), which is normally composed of one kind of cells at this stage and which will become the fetus
The trophoblast (TE), in black on the picture, which is composed of another one kind of cells at this stage and which will become the placenta.
The capsule (in black), which is composed of glycoproteins.

Capture

I cut the embryo into two parts to obtain a part 1 with pure trophoblast and a part 2 rich in Inner Cell Mass. In part 2, there are ICM and TE cells. I do not know which is the proportion of ICM and TE inside the part 2. For both parts, a RNAseq was performed. I would like to study the effect of our different conditions on the pure TE and ICM but as I do not know the number of cells of each part in ICM it is impossible.

I would like to obtain estimated tables with pure ICM expression. Even if it is not a tumor, I think I can compare ICM cells to tumor cells and TE to normal tissue; and use your package.

Do you think it is possible?

So I wanted to use in input:

Data.Y = the matrix of normalized gene counts (in rows, with a number of G genes) with samples of part 2 with ICM and TE cells (samples in the column),
Data.comp1 = the matrix of normalized gene counts (in rows, with a number of G genes) with samples of part 1 with pure TE cells (samples in the column)

Am I right?

I have some other questions:

could I mention in the DEMixT function that my samples with mixed cells and pure cells are from the same embryo? I do not understand enough the algorithm to evaluate if it is important or not.
how does your function deal with 0 values? Should I filter my data before this analysis?
I have 5 embryos per condition, should I run the analysis altogether or in a separate manner according to the group?
Is it possible to obtain an estimation of TE expression in mixed cell samples to compare the results to the pure TE cell samples?

Thank you for your answer

Emilie

ShaolongCao commented 4 years ago

Hi Emilie,

Thanks for your questions. First of all, I suggest you to use our latest version of DeMixT: https://github.com/wwylab/DeMixT. It has been updated recently and will keep updating in the future.

For you questions:

You don't need to mention whether the pure and mixed cells are from the same embryo. If I understand your experiment correctly, the algorithm will be able to estimate the transcripts proportion and deconvolved expressions of ICM and TE cells from the mixed cells.
We have a filter to remove the genes with 0 counts. You don't need to worry about the 0 count genes. But our algorithm is favorable for high read counts genes and the algorithm will automatically filter out some low quality genes before estimating the cell proportions. If a gene have very low read counts across samples, you may want to filter out the lowly expressed gene before input to the model.
How many different conditions do you have? If the gene expression profile remains largely similar between each condition (only a few gene expression changed), I would recommend to run them together to increase the accuracy to algorithm. For example, what is the sample size (columns) of your Data.Y and Data.comp1? The recommendation is more than 10 samples for Data.comp1 and more than 20 samples for Data.Y.
It is possible to get an estimation of TE expression in the mixed cell samples. Based on your input design, the deconvolved TE expression would be "ExprN1" of the final output file.

Feel free to let me know if you have further questions.

Thanks, Shaolong

ederisoud commented 4 years ago

Hello Shaolong

Thank you for this answer

From each embryo, I have just 2 parts (part 1 with pure TE and part 2 with TE + ICM) but I have 5 embryos per group and 3 groups in one case and 2 groups in the other case. Is it enough?

Thank you

Emilie

ShaolongCao commented 4 years ago

Yes. I think it would be better to merge two cases and run two cases together, so that you have 25 pure TE and 25 mixed cells. But it depends on how different are two cases. If the gene expression are similar for pure TE samples in two cases, I would suggest to run two cases together. Otherwise, it may better to run each case separately. For example, you can apply a hierarchical clustering on the expression profile of pure TE samples of two cases together and see if the two cases can separate well.

Let me know if you have any questions or bugs of the software.

Best, Shaolong

ederisoud commented 4 years ago

Hello I can't merge the two cases. There are too different for that (we checked with a PCA).

I am trying the DeMixT function on the first case and I have an error message. I could not find where is the problem. I expect it is linked to my data.

My code: DeMixT(data_ICM, data_TE, data.comp2 = NULL, niter = 10, nbin = 50, if.filter = TRUE, output.more.info = TRUE) With data_ICM = matrix with values of mixed cell samples (genes in row and samples in column) data_TE = matrix with values of pure cell samples (genes in row and samples in column) both matrix have the same row and column size

The message is: Step 1: Estimation of Proportions Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘assays’ for signature ‘"matrix"’

Could you help me please ?

Thank you

Emilie

ShaolongCao commented 4 years ago

DeMixT accepts input format of "SummarizedExperiment container" not "matrix". We will highlight this requirement in our next update.

I suggest you can modify your code as following: data_ICM = SummarizedExperiment(assays=list(counts=data_ICM)) data_TE = SummarizedExperiment(assays=list(counts=data_TE)) DeMixT(data_ICM, data_TE, data.comp2 = NULL, niter = 10, nbin = 50, if.filter = TRUE, output.more.info = TRUE)

Let me know if you have any issues.

Best, Shaolong

ederisoud commented 4 years ago

Dear Shaolong

I tried your solution and it fixed my issue.

I have a new problem.

When I run DeMixT function, I have these messages which appears: Step 1: Estimation of Proportions There are 17 normals and 17 tumors Iteration 1: updating purities 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Iteration 1: updating parameters Iteration 2: updating purities 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Iteration 2: updating parameters

I assume it is normal because it continues until iteration 10 but just after iteration 10 I have thie error message: Error in if (sum(obj == 0) > 1) { : missing value where TRUE/FALSE needed I tried to filter my geneset as there is no variable with a 0 in the data but it does not work?

Do you have a solution? My code is: Res<-DeMixT(data_ICM, data_TE, data.comp2 = NULL, niter = 10, nbin = 50, if.filter = FALSE, output.more.info = TRUE, nthread=8)

Moreover I do not understand how use OpenMS and its role for the algorythm. Could you explain it ?

Thank you

Best regards

Emilie

ShaolongCao commented 4 years ago

Hi Emilie,

It is a potential bug that I don't know. But I guess it is because you did not filter out genes by the default option. I suggest to change your code to Res<-DeMixT(data_ICM, data_TE, data.comp2 = NULL, niter = 10, nbin = 50, if.filter = TRUE, output.more.info = TRUE, nthread=8)

The OpenMP is used to implement parallel computing. It shouldn’t affect the results.

Let me know if you still have the problem.

Best, Shaolong

ederisoud commented 4 years ago

Hi Shaolong

I tried to change if.filter=TRUE and I still have the error message. I do not know how to fix this issue...

Do you have an idea ?

Thank you

Emilie

ShaolongCao commented 4 years ago

Hi Emilie,

It may caused by some genes that severely violate the model assumptions. You may try to random select some genes as the input and see if the error goes away. It would be better for us to help figure out this issue, if you can send us a de-identified version of the data that cause this bug. You can send the de-identified data for debugging to scao@mdanderson.org. By the way, we are releasing a new version of DeMixT very soon, probably in this week or next week. Will keep you update.

Best, Shaolong

ShaolongCao commented 4 years ago

Hi Emilie,

The updated package is available now on Github: https://github.com/wwylab/DeMixT Please try the new one and see if your bug was fixed.

Best, Shaolong

ederisoud commented 4 years ago

Hi Shalong

I tried to uptade the package.

At first, I had

>Res<-DeMixT(data.Y=data_ICM, #Données avec des types cellulaires mélangés

+        data.N1=data_TE, 
+        data.N2 = NULL,
+        niter = 10,
+        nbin = 50, 
+        if.filter = TRUE, 
+        output.more.info = TRUE,
+        nthread=8)
Step 1: Estimation of Proportions

Gene selection starts
Error in filter2(inputdata1r, ngene.selected.for.pi) : 
  The argument ngene.selected.for.pi can only be 
           an integer or a percentage between 0 and 1

So I tried to turn off the filter. At firt I do not understand everything because it said: Step 1: Estimation of Proportions

Initial of Proportions:
            PiN1
Sample 1  0.9235
Sample 2  0.9569
Sample 3  0.0221
Sample 4  0.8008
Sample 5  0.0521
Sample 6  0.5868
Sample 7  0.0254
Sample 8  0.6601
Sample 9  0.7791
Sample 10 0.5691
Sample 11 0.2999
Sample 12 0.1955
Sample 13 0.4849
Sample 14 0.2383
Sample 15 0.2104
Sample 16 0.1835
Sample 17 0.2367
Sample 18 0.5411
Sample 19 0.3701
Sample 20 0.6585
Sample 21 0.6941
Sample 22 0.7263
Sample 23 0.1630
There are  17 normals and 23 tumors

So I do not understand where the logarithm find the 23 samples in tumors data because I just have 17 samples in both tables. Then it worked until step 2. Then I have this error:

Step 2: Deconvolution of Expressions

Initial of Proportions:
            PiN1
Sample 1  0.6120
Sample 2  0.2610
Sample 3  0.0348
Sample 4  0.2311
Sample 5  0.5517
Sample 6  0.7934
Sample 7  0.0409
Sample 8  0.9235
Sample 9  0.5541
Sample 10 0.1122
Sample 11 0.7043
Sample 12 0.2543
Sample 13 0.4300
Sample 14 0.4780
Sample 15 0.7363
Sample 16 0.9349
Sample 17 0.4412
Error in if (sum(obj == 0) > 1) { : missing value where TRUE/FALSE needed

It looks like before. I do not understand why. Could you help me ?

Thank you

Emilie

pengyang0411 commented 4 years ago

Hi Emilie,

Thanks for interesting our work.

First, for the question that sample size doesn’t match, it is because we have applied spike-in normal, i.e., simulated normal expression data, to correct the potential bias of proportion estimation (the detail method will be published soon). Those 6 samples are simulated spike-in normal samples which is a normal procedure of the algorithm and should not affect results. Since your samples size is only 17, so I recommend input nspikein = 0 in the DeMixT Function, which will match your original sample size.

Second, may I ask how many genes in your dataset? Usually we run thousands of genes in DeMixT function. But generally, we assume hundreds of genes are enough. If your dataset contains enough genes, I suggest you set ngene.selected.for.pi = 0.2 as a input value as follow:

Res<-DeMixT(data.Y=data_ICM, #Données avec des types cellulaires mélangés data.N1=data_TE, data.N2 = NULL, niter = 10, nbin = 50, ngene.selected.for.pi = 0.2, nspikein = 0, if.filter = TRUE, output.more.info = TRUE, nthread=8)

Besides, I suggest you try our new updated function to simulate 2 component data as follow:

ngenes = # of genes test.data = simulate_2comp(G = ngenes, My = 17, M1 = 17, output.more.info = FALSE)

and then test it in both DeMixT and DeMixT_GS function:

res <- DeMixT(data.Y = test.data$data.Y, data.N1 = test.data$data.N1, data.N2 = NULL, nspikein = NULL, gene.selection.method = ‘GS’, niter = 10, nbin = 50, if.filter = TRUE, ngene.selected.for.pi = 150, mean.diff.in.CM = 0.25, tol = 10^(-5)) and res.GS <- DeMixT_GS(data.Y = test.data$data.Y, data.N1 = test.data$data.N1, niter = 10, nbin = 50, nspikein = NULL, if.filter = TRUE, ngene.Profile.selected = 150, mean.diff.in.CM = 0.25, ngene.selected.for.pi = 150, tol = 10^(-5))

Note, DeMixT_GS function will only return you the proportion estimation.

I believe through simulated data could be a good way to understand the performance of DeMixT function. :)

Let me know if you have any questions.

Best, Peng

ederisoud commented 4 years ago

Hi Peng

Thank you for your answer.

I will try the simulation and your suggestions. I will let you know when it will be done. In my data, after filtering to remove 0, I have 12846 genes.

Yesterday I managed to run the DeMixT function on my data. I changed 2 things:

First I read the other conversation about the problems with the package and normalized my data to the quantile.
Secondly, I was putting matrices in my summarized-experiment object and to do the normalization I needed data frame. So I put the data frame directly into the object instead of the matrices.

I don't know which one is the right solution but it worked.

However, normalizing my data made me think about statistical analysis. I use DeSeq2 for statistical analysis and the creators advise not to use normalized data. Is normalization essential for your algorithm? Because if I keep the normalized data, I will have to redo my DeSeq2 script.

pengyang0411 commented 4 years ago

Hi Emilie,

First, all DeMixT functions (DeMixT_GS, DeMixT_S1, DeMixT_S2) requires SummarizedExperiment format, which is also the output format of function simulate_2comp; thus, I recommend you can try a small simulated gene set from simulate_2comp, 500 for example, on your local machine, if you can install DeMixT successfully on your personal PC.

Second, it is not necessary to normalize your data before process into DeMixT; however, we assume it could reduce the batch effect through normalization. We recommand to normalize normal reference and mixed sample together, so that they are in the same scale. For example, the sums of expression of each sample across all genes are equal. It has been shown that using scaled normalization for mixed sample and normal reference sample together would yield robust estimation.

Quantile.Normalization.scale<-function(Count.matrix){ newt <- Count.matrix colnames(newt)=NULL rownames(newt)=NULL

designs=c(rep("0", dim(Count.matrix)[2])) seqData=newSeqCountSet(as.matrix(newt), designs) seqData=estNormFactors(seqData, "quantile") k3=seqData@normalizationFactor mk3=median(k3) k3=k3/mk3

temp<-newt

for(i in 1:ncol(newt)){ temp[,i] = temp[,i]/k3[i] } Count.matrix.normalized<-temp colnames(Count.matrix.normalized)<-colnames(Count.matrix) rownames(Count.matrix.normalized)<-rownames(Count.matrix)

return(Count.matrix.normalized) }

Quantile.Normalization.scale(cbind(data.Y, data.N1)) Note, the data needs to be in matrix format while normalizing it before encode it into SummarizedExperiment format.

Third, DeMixT works better for high counts data and we assume gene across samples follow a log2-normal distribution. As a result, for a specific gene, you can make a histogram plot across all samples after log2-transform the data, and investigate the normality of it.

Best, Peng

wwylab / DeMixTallmaterials

Using DeMixT for RNAseq embryo mixed cell #2