yueqiw / shiny_cell_browser

Shiny browser for single cell RNAseq data
9 stars 3 forks source link

Using seurat v3 dataset - does not work with embedded reductions having dimension not equal to 2 #9

Closed gringer closed 3 years ago

gringer commented 3 years ago

I'm getting an error when loading my dataset into the shiny cell browser. I've created this issue as a record of my attempts to fix the problem(s). I'm concentrating on the labelling methods here, because that's my current understanding around the cause of this issue.

Container log:
2020-10-06 01:59:08 INFO::loading data...
Warning in mapvalues(as.integer(assign_clust[, 2]), from = 1:length(colors),  :
  NAs introduced by coercion
The following `from` values were not present in `x`: 1
Registered S3 method overwritten by 'cli':
  method     from    
  print.boxx spatstat
Warning: Error in : Column 5 must be named.
  80: <Anonymous>
Error : Column 5 must be named.
2020-10-21 01:50:48 INFO::loading data...
Warning in mapvalues(as.integer(assign_clust[, 2]), from = 1:length(colors),  :
  NAs introduced by coercion
The following `from` values were not present in `x`: 1
Warning: Error in : Column 5 must be named.
  80: <Anonymous>
Error : Column 5 must be named.

The dataset was created using the following command:

saveRDS(hc.no11, file = "/mnt/BigBirdStornext/R_data/RStudioProjects/Seurat/JACI/JACI_10k_400epochs_no11.rds");

The resulting file is about 1.2 GB in size. It contains preferred cell labels as a feature with the name "OL_clusters", which is a naming of clusters/Idents created using the following command:

hc.all %>%                                                                                                                                                 
    FindNeighbors(dims=1:50) %>%                                                                                                                           
    FindClusters(resolution=0.8) -> hc.all                                                                                                                 

The difference between hc.all and hc.no11 is that hc.no11 doesn't include cluster 11, which was found to have odd expression profiles and excess spreading in the UMAP plot:

hc.no11 <- subset(hc.all, idents="11", invert=TRUE);

An Excel table was used to match clusters with their preferred labels:

library(readxl);                                                                                                                                        

read_excel("Cluster_identification.xlsx") %>%                                                                                                           
    mutate(clusterID=sub("^.", "", `Cluster Number`),                                                                                                   
           label=paste(Group, Number, sep="_")) %>%                                                                                                     
    mutate(label=sub("_NA$", "", label)) -> clusters.tbl;                                                                                               

hc.no11[["OL_clusters"]] <-                                                                                                                             
    clusters.tbl$label[match(Idents(hc.no11), clusters.tbl$clusterID)];                                                                                 
gringer commented 3 years ago

Problem 1: I had the wrong ID in my JSON configuration file. The cluster config setting was set to res.1, but it should have been OL_clusters.

This changed the output error:

Warning in mapvalues(as.integer(assign_clust[, 2]), from = 1:length(colors),  :
  NAs introduced by coercion
The following `from` values were not present in `x`: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
Warning: Error in : Column 5 must be named.
  80: <Anonymous>
Error : Column 5 must be named.
gringer commented 3 years ago

Hmm... looks like the code is expecting numeric cluster names.

Changing the code from this:

colorVec = mapvalues(as.integer(assign_clust[, 2]), from = 1:length(colors), to = toupper(colors))

to this:

colorVec = mapvalues(assign_clust[, 2], from = unique(assign_clust[, 2]), to = toupper(colors))

Seemed to fix the NA issue, although I suspect that there's an easier/quicker way via factor / levels. The Column 5 issue still remains, though:

2020-10-21 04:39:24 INFO::loading data...
Warning: Error in : Column 5 must be named.
  80: <Anonymous>
Error : Column 5 must be named.
gringer commented 3 years ago

The code is also expecting the dimensional reduction to have two dimensions... and I did a three-dimensional UMAP reduction.

Changing the code from this:

df_plot = cbind(full_embedding, assign_clust[, 2], colorVec)

to this:

df_plot = cbind(full_embedding[,1:2], assign_clust[, 2], colorVec)

Has fixed up that issue, which gets past the error that started this issue!

yueqiw commented 3 years ago
  1. The dimension issue: right now only 2D visualization is supported. The best solution is probably to run a 2-dimensional UMAP for the purpose of visualization (otherwise the information along the 3rd dimension will not be shown on the plot).

  2. About the mapvalues issue, I think your change looks good. Basically, the from parameter should be the unique values of the original array. Somehow the code worked without problem for Seurat 2 due to the way they set Idents.

  3. I'm not sure what caused the Column 5 error. Hope you find a solution!

gringer commented 3 years ago

The Column 5 error was due to the dimensionality of the reduction. Taking the first two dimensions fixed this error.

yueqiw commented 3 years ago

@gringer were you able to get the app up and running? Let me know if there are other problems.

gringer commented 3 years ago

Yes, I've got it working, thanks. My other issues were due to not properly reading the instructions about the input / config file format. I discovered that both the "diff_ex_cluster" and "diff_ex_file" parameters were mandatory, and that creating the differential expression file wasn't too taxing. Here's the code I used, just in case it's useful / helpful:

Idents(hc.no11) <- hc.no11[["OL_clusters"]]; # this is the ident I specified in the config file

map_dfr(as.character(unique(Idents(hc.no11))), function(x){
    cat(x,"\n");
    FindMarkers(hc.no11, ident.1=x) %>%
        rownames_to_column("gene") %>%
        mutate(cluster=x);
}) -> marker.DE.tbl;

write_csv(marker.DE.tbl, "/mnt/BigBirdStornext/R_data/RStudioProjects/Seurat/JACI/DEtable_10k_no11.csv.gz");

And my config file, for completeness:

{
    "data": [
        {
            "name": "JACI SCT_10k_400epochs_noCl11",
            "file": "/app/data/JACI/JACI_10k_400epochs_no11.rds",
            "cluster": "OL_clusters",
            "embedding": "umap",
            "diff_ex_cluster": "T-cell_1",
            "diff_ex_file": "/app/data/JACI/DEtable_10k_no11.csv.gz",
            "pt_size": 2,
            "font_scale": 0.75
        }
    ],
    "config": {
        "ui_title": "Single Cell Browser",
        "title_link_text": "FR Group",
        "title_link_url": "http://www.malaghan.org.nz"
    }
}
yueqiw commented 3 years ago

Glad to hear it's working!

I'll think about improving the instructions for the differential expression file (maybe by showing an example table or something).