wyguo / ThreeDRNAseq

A pipeline for differential expression and differential alternative splicing analysis
https://github.com/wyguo/ThreeDRNAseq
GNU General Public License v3.0
60 stars 27 forks source link

Trying to Replicate the 3D Code in R: Disparities #28

Closed danphillips28 closed 3 years ago

danphillips28 commented 3 years ago

Hi Wenbin,

I've recently been doing an online course which happens to cover much of the 3D pipeline, and I have been using my 3D results to check I'm doing things correctly. Whilst things generally look the same, I've noticed some points where differences are introduced. I'm hoping you can shine a light on some of them. The identical inputs have been used for both workflows.

Filtering

  1. On 3D I chose tximport with lengthScaledTPM, outputting 55291 raw genes. I appear to have replicated this in R no problem (no errors), outputting the exact same number of genes using;

Txi_gene <- tximport(path, type = "salmon", tx2gene = M25_Tx.to.Symbol, txOut = FALSE, countsFromAbundance = "lengthScaledTPM", ignoreTxVersion = FALSE)

Note that M25_Tx.to.Symbol maps transcript IDs to gene symbols, so symbols show up in 3D. I wonder if you could suggest people do this in the documentation, rather than to gene IDs, which requires work downstream to annotate.

  1. On 3D, I chose to filter by CPM ≥ 1 in ≥ 4 samples. Outputting 19612 significantly expressed genes. In R, I tried to replicate this using (where Txi_gene is my Tximport list);

myCounts <- Txi_gene$counts myDGEList <- DGEList(myCounts) log2.cpm <- edgeR::cpm(myDGEList, log = TRUE) keepers <- rowSums(log2.cpm>=1)>=4 myDGEList.filtered <- myDGEList[keepers,]

This resulted in 17712 genes. I also tried;

cpm <- edgeR::cpm(myDGEList) keepers <- rowSums(cpm>=1)>=4 myDGEList.filtered <- myDGEList[keepers,] log2.cpm.filtered <- cpm(myDGEList.filtered, log=TRUE)

This resulted 20067 genes. Switching from >= to > hasn't helped. Any ideas?

Clustering When I filtered by >=5 instead of 4 I got 19582, which I thought was close enough for now. Whilst obviously using different input, my limma results (voomWeights, p < .05, L2FC > 0, adj = "fdr") are very similar.

3D (5,743 unique DEGs) AL24.AL12 CR10.AL12 CR20.AL12 CR30.AL12 CR40.AL12 Down 0 1 482 1267 2636 Up 0 2 430 1761 2413

R (5,722 unique DEGs) AL24.AL12 CR10.AL12 CR20.AL12 CR30.AL12 CR40.AL12 Down 0 1 471 1269 2643 Up 0 2 424 1765 2395

I then tried replicating the clustering, and my results differed in important ways (mostly cluster 7). How did you do the clustering?

Figures for Thesis (2)

Sorry for the long message!

Thanks as always. Daniel

wyguo commented 3 years ago

Hi Wenbin,

I've recently been doing an online course which happens to cover much of the 3D pipeline, and I have been using my 3D results to check I'm doing things correctly. Whilst things generally look the same, I've noticed some points where differences are introduced. I'm hoping you can shine a light on some of them. The identical inputs have been used for both workflows.

Filtering

  1. On 3D I chose tximport with lengthScaledTPM, outputting 55291 raw genes. I appear to have replicated this in R no problem (no errors), outputting the exact same number of genes using;

Txi_gene <- tximport(path, type = "salmon", tx2gene = M25_Tx.to.Symbol, txOut = FALSE, countsFromAbundance = "lengthScaledTPM", ignoreTxVersion = FALSE)

Note that M25_Tx.to.Symbol maps transcript IDs to gene symbols, so symbols show up in 3D. I wonder if you could suggest people do this in the documentation, rather than to gene IDs, which requires work downstream to annotate.

  1. On 3D, I chose to filter by CPM ≥ 1 in ≥ 4 samples. Outputting 19612 significantly expressed genes. In R, I tried to replicate this using (where Txi_gene is my Tximport list);

myCounts <- Txi_gene$counts myDGEList <- DGEList(myCounts) log2.cpm <- edgeR::cpm(myDGEList, log = TRUE) keepers <- rowSums(log2.cpm>=1)>=4 myDGEList.filtered <- myDGEList[keepers,]

This resulted in 17712 genes. I also tried;

cpm <- edgeR::cpm(myDGEList) keepers <- rowSums(cpm>=1)>=4 myDGEList.filtered <- myDGEList[keepers,] log2.cpm.filtered <- cpm(myDGEList.filtered, log=TRUE)

This resulted 20067 genes. Switching from >= to > hasn't helped. Any ideas?

Clustering When I filtered by >=5 instead of 4 I got 19582, which I thought was close enough for now. Whilst obviously using different input, my limma results (voomWeights, p < .05, L2FC > 0, adj = "fdr") are very similar.

3D (5,743 unique DEGs) AL24.AL12 CR10.AL12 CR20.AL12 CR30.AL12 CR40.AL12 Down 0 1 482 1267 2636 Up 0 2 430 1761 2413

R (5,722 unique DEGs) AL24.AL12 CR10.AL12 CR20.AL12 CR30.AL12 CR40.AL12 Down 0 1 471 1269 2643 Up 0 2 424 1765 2395

I then tried replicating the clustering, and my results differed in important ways (mostly cluster 7). How did you do the clustering?

Figures for Thesis (2)

Sorry for the long message!

Thanks as always. Daniel

Hi Daniel,

Thanks for the suggestion of the gene symbol. We could include the improvement in the next update. The low expression filter in 3D is based on CPM, not log2 CPM. In your code I see you have taken log2 of the CPM values. log2CPM >=1 means CPM >=2, so your expressed gene number is lower the 3D result. The 3D App does not directly use the complexHeatmap package for the clustering and plot. I use my own code to make the clusters and do some resorting to make the similar color blocks to stay closer. Then the complexheatmap package take my cluster results to make the plot. So you will find the heatmap is different to directly use the package to make the plot. Best,

danphillips28 commented 3 years ago

Hi Wenbin,

Thanks for the quick reply. Regarding the filtering, I did also include my results from filtering on CPM as well as log2 CPM, which also didn't work. But I have since realised the issue. I was filtering at the gene level rather than at the isoform level (i.e. I can replicate 3D isoform numbers using the my code above). Silly me. I am however a bit stuck on how to use these to filter for genes with >=1 isoforms with CPM >=1 in >=4 samples, as in 3D. My only thought was to use their transcript IDs to filter the mapping file and then reimport with tximport, but this would change the lengthScaledTPM corrections. Any hints?

After seeing how easy it is to (partly) incorporate IsoformSwitchAnalyzeR into the workflow, I also became curious if you and Runxuan have discussed plans to expand 3D to include more tools?

Thanks again! Daniel

wyguo commented 3 years ago

Hi Daniel, Yes, we are applying fundings to make improvement and add more functions to the 3D App.

In the 3D RNA-seq App, the low expression filter is based on transcript level. A gene is assumed to be expressed if one of the transcripts is expressend. You are right. You can use the transcript-gene mapping table. You just need to subset the table with the remaining transcripts. The genes in the sub-table are the expressed genes. Please remember to remove the duplicated genes when you count the number of expressed genes. Best, Wenbin

danphillips28 commented 3 years ago

Thanks Wenbin, looking forward to seeing what's next!