ycphs / openxlsx

openxlsx - a fast way to read and write complex xslx files
https://ycphs.github.io/openxlsx/
Other
226 stars 75 forks source link

Merging cells scales badly #503

Open vpetzel opened 1 month ago

vpetzel commented 1 month ago

Using mergeCells can be very slow when doing many merges. This is caused by a check for overlapping merges. This is done by extracting all affected fields from all previous merges and by checking if any of these happen to fall in the fields affected by the previous merge. All of these things are done in a very expensive manner, as merges are stored as xml strings, and thus require expensive parsing back to integers. The check for intersection is done in a very expensive way too: openxlsx calculates a vector of all positions in the rectangles and subsequently compares these members for intersection.

But even then it is easy to see that this method will have a complexity of Θ(n) for adding a new merge to n existing merges (assuming limited size of the merges), and thus a complexity of Θ(n²) for adding n merges.

This code showcases this issue:

library(stringi)

M1 <- character()
M2 <- data.frame(
    r1 = integer(), r2 = integer(), c1 = integer(), c2 = integer()
)
M3 <- character()

add_merge1 <- function(r1, r2, c1, c2) {
    sqref <- openxlsx::getCellRefs(data.frame(x = c(r1, r2), 
        y = c(c1, c2)))
    exMerges <- regmatches(M1, regexpr("[A-Z0-9]+:[A-Z0-9]+", M1))
    if (!is.null(exMerges)) {
        comps <- lapply(exMerges, function(rectCoords) {
            unlist(strsplit(rectCoords, split = ":"))
        })
        exMergedCells <- openxlsx:::build_cell_merges(comps = comps)
        newMerge <- unlist(openxlsx:::build_cell_merges(comps = list(sqref)))
        mergeIntersections <- sapply(exMergedCells, function(x) {
            any(x %in% newMerge)
        })
        if (any(mergeIntersections)) {
            stop(sprintf("Merge intersects with existing merged cells: \n\t\t%s.\nRemove existing merge first.", 
                stri_join(exMerges[mergeIntersections], collapse = "\n\t\t")))
        }
    }
    M1 <<- c(
        M1, 
        sprintf("<mergeCell ref=\"%s\"/>",
                stri_join(sqref, collapse = ":", sep = " "))
    )
}

add_merge2 <- function(r1, r2, c1, c2) {
    if (r2 < r1) {
        .h <- r2
        r2 <- r1
        r1 <- .h
    }
    if (c2 < c1) {
        .h <- c2
        c2 <- c1
        c1 <- .h
    }

    intersections <- M2[
        ((M2$r1 <= r2) &
         (M2$r2 >= r1) &
         (M2$c1 <= c2) &
         (M2$c2 >= c1)),
    ]

    if (nrow(intersections) > 0) {
        stop(sprintf("Merge intersects with existing merged cells: \n\t\t%s.\nRemove existing merge first.", 
            paste(
                openxlsx::getCellRefs(intersections[c("r1", "c1")]),
                openxlsx::getCellRefs(intersections[c("r2", "c2")]),
                sep = ":"
            ),
            collapse = "\n\t\t"))
    }

    M2 <<- rbind(
        M2, 
        data.frame(r1 = r1, r2 = r2, c1 = c1, c2 = c2)
    )
}

add_merge3 <- function(r1, r2, c1, c2) {
    sqref <- openxlsx::getCellRefs(data.frame(x = c(r1, r2), 
        y = c(c1, c2)))
    M3 <<- c(
        M3, 
        sprintf("<mergeCell ref=\"%s\"/>",
                stri_join(sqref, collapse = ":", sep = " "))
    )
}

take_time <- function(expr) {
    a <- Sys.time()
    force(expr)
    b <- Sys.time()
    as.numeric(b) - as.numeric(a)
}

tapply <- function(x, fun, limit = 10, ...) {
    res <- rep(NA_real_, length(x))
    ttime <- 0
    for (i in seq_along(x)) {
        res[i] <- take_time(fun(x[i]))
        ttime <- ttime + res[i]
        if (ttime > limit) break()
    }
    res
}

M <- c(2, 10, 1000)
N <- 3000

pdat <- dplyr::bind_rows(lapply(
    M,
    function(M) {
        M1 <<- character()
        M2 <<- data.frame(
            r1 = integer(), r2 = integer(), c1 = integer(), c2 = integer()
        )
        M3 <<- character()

        timing1 <- tapply(
            seq_len(N),
            function(i) {
                add_merge1(i, i, 1, M)
            }
        )

        timing2 <- tapply(
            seq_len(N),
            function(i) {
                add_merge2(i, i, 1, M)
            }
        )

        timing3 <- tapply(
            seq_len(N),
            function(i) {
                add_merge3(i, i, 1, M)
            }
        )

        data.frame(
            n = seq_along(timing1), t1 = cumsum(timing1), t2 = cumsum(timing2), t3 = cumsum(timing3), M = M
        )
    }
))

ggplot2::ggplot(pdat, ggplot2::aes(n)) +
    ggplot2::geom_point(ggplot2::aes(y = t1, color = "current")) +
    ggplot2::geom_point(ggplot2::aes(y = t2, color = "alternative")) +
    ggplot2::geom_point(ggplot2::aes(y = t3, color = "no_check")) +
    ggplot2::xlab("# Merges") +
    ggplot2::ylab("Total Time in Seconds") +
    ggplot2::theme_light() +
    ggplot2::facet_wrap(~ M)

pt

Here we see the current behaviour, one that avoids all of the expensive string conversion overhead (formatting would only need to happen when writing the xlsx file) and instead stores merges in a data frame of corner coordinates. Intersection are determined using corner points only. The third version does not check for intersections.

We see that the current implementation is really expensive and gets worse as the area of merges increases, while the alternative structure is still quadratic, but much faster and does not depend on area. But not doing the check at all is of course very fast and nice (and could be even faster if done vectorized). Extending the behaviour we see that even with an area of 2 with 10000 merges it should take about 13 minutes, for 50000 6 hours (for an area of 1000 fields this would be over 5 days), while the alternative would take 2 Minutes, and the one without checks 40 seconds (meanwhile doing it vectorized would take 0.3 seconds).

So I do suggest to maybe include a flag check_overlap = TRUE or whatever to allow skipping that check. If set to FALSE would be the responsibilty of the user to make sure that the inputs are sane with a large gain of performance.

In the long term switching from the current flow to one that does not require as many string transformations would also very significantly speed up the regular use.

JanMarvin commented 1 month ago

Hi @vpetzel , if you want to, please add a PR with a flag that allows disabling the check. I’m not sure why anyone would want to add 10.000+ merges into a spreadsheet, but if we can step out of the way, for this rare case we can allow the user to disable the check.