zeehio / ggmatrix

ggplot extension for matrices
Other
0 stars 0 forks source link

Can provide a normal geom_* function which can inherit from the ggplot2 data? #2

Open Yunuuuu opened 2 weeks ago

zeehio commented 1 week ago

I don't understand what you mean. Could you provide some sample code of how would you expect this to work?

Yunuuuu commented 1 week ago

Thank you for your response. The current geom_matrix_raster requires a manually provided matrix and cannot inherit data directly from the ggplot function. Suppose I already have a long-formatted data frame, data:

data <- matrix(rnorm(100L), nrow = 10L)
data <- reshape2::melt(data, varnames = c("y", "x"))
library(ggplot2)
library(ggmatrix)
ggplot(data) +
  geom_matrix_raster(aes(x, y, value)) # this will fail
Yunuuuu commented 1 week ago

Based on my testing, ggmatrix is the fastest option, and I’m considering integrating it into my package ggalign. The package already supports drawing complex heatmaps using ggplot2 syntax. If ggmatrix could work with inherited data, it would allow for even tighter integration with ggalign.

I’ve also tested ggalign (which internally uses geom_raster or geom_tile depending on the number of heatmap cells) and compared its performance against other heatmap packages, as seen in this benchmark: https://yunuuuu.github.io/ggalign/articles/performance.html.

A minor example of how ggalign works here:

library(ggalign)
set.seed(123)
small_mat <- matrix(rnorm(81), nrow = 9)
rownames(small_mat) <- paste0("row", seq_len(nrow(small_mat)))
colnames(small_mat) <- paste0("column", seq_len(ncol(small_mat)))

# initialize the heatmap layout, we can regard it as a normal ggplot object
ggheatmap(small_mat) +
    # we can directly modify geoms, scales and other ggplot2 components
    scale_fill_viridis_c() +
    # add annotation in the top
    hmanno("top") +
    # in the top annotation, we add a dendrogram, and split observations into 3 groups
    align_dendro(aes(color = branch), k = 3) +
    # in the dendrogram we add a point geom
    geom_point(aes(color = branch, y = y)) +
    # change color mapping for the dendrogram
    scale_color_brewer(palette = "Dark2")

zeehio commented 1 week ago

I think it is worth explaining why this pkg works with a matrix before making further changes.

I originally avoided the need for melt() for performance reasons:

A matrix is just a long vector with a dimesion attribute. This package takes a matrix as input and places a copy on a data frame. Just the values, no x and no y, in column-wise order.

Because of how R works internally, that copy doesn't actually happen, since the matrix data is never modified and R uses copy-on-write.

All the code considered how to minimize memory allocations and moves/copies, because that's where ggplot wastes time.

When you use melt, you create a data frame with as many rows as the number of matrix elements and three columns. This is at least two allocations as large as your matrix ($x, $y) + a copy of the values. ggplot goes through all that data several times (checking plot limits, looking for missing values...). This adds significant overhead.

If you want my overhead analysis, that includes many pull requests (some already merged) and probably too much detail, you can check:

https://github.com/tidyverse/ggplot2/issues/4989

The first post already shows a big difference between using geom_raster() vs the combination of annotation_raster() with converting the matrix to a nativeRaster object (checkout https://coolbutuseless.github.io/package/nara/index.html to learn more about native raster objects!).

Maybe you can adapt the code from the cutting corners strategy explained in the first post on that issue to replace your slow geom_raster calls in your package with a faster solution.

Your benchmark is using a 1000*1000 matrix. A double takes 8 bytes, so that matrix takes 8MB. However your benchmark shows you are allocating more than 2GB. That's a 250x overhead. Data temporarily copied, read and written so many times by your package, by ggplot, and by some of its dependencies that the plot takes several seconds to happen.

zeehio commented 1 week ago

Using your package and ggplot:

Any function using a matrix as input can benefit from knowing it's shape and knowing that the values are given in column-major order, so for instance if I know I have to draw the values in a rectangle shape and I know where the first value is drawn and the size of each value I can easily find where any value of the matrix will be drawn

At this point you have lost the rectangular shape constraint. It is risky/complicated for any "general purpose function that accepts a data frame" to assume that the rows of the given data frame come from a melted matrix. The data frame is treated as a set of individual observations and each observation's position in the plot will have to be individually calculated. ggplot can't assume the data frame has nm observations corresponding to a rectangle shape nor that those observations are sorted in a specific order*

During the ggplot_build phase:

My estimation based on my performance benchmarking on the ggplot issue I linked before is that at least 30% of the plotting time is spent here

This happens when applying the scales. I have several pull requests in both ggplot and farver packages to reduce the amount of conversions, but they have been pending for review for two years. I also don't need that anymore so finding time myself to respond to the code review wouldn't be that easy.

During plot rendering phase:

This conversion does not assume that all rows and columns are present, or that they come from a rectangle shaped matrix. Basically we have to undo the melt without knowing that the data came from a melt operation.

I wish we could skip the character colour and go straight to native colour format

Summary

I can't take a data frame as input and preserve performance because taking the data frame implies going through all of ggplot positional scale mapping.

The positional scale mapping is useful for instance to set plot limits correctly. I assume in my package you want to draw the whole matrix and then I skip all position calculations in ggplot. If you wanted to plot a part of the matrix I would expect you to give me only that subset of the matrix as input because I can't use ggplot to do the filtering and keep performance as is.

In your package you may be able to make assumptions about the data frame that you pass to ggplot, since your package is also responsible for melting the matrix, I believe.

You can take a copy of my .R file, put it in your package and modify it so it fits your needs and benefits from your data frame assumptions

Yunuuuu commented 1 week ago

Thank you for your detailed explanation. I’ll take a look at your .R file and adapt it to align with these assumptions. This should help maintain performance while fitting the needs of my package.