ycphs / openxlsx

openxlsx - a fast way to read and write complex xslx files
https://ycphs.github.io/openxlsx/
Other
220 stars 74 forks source link

`saveWorkbook` produces invalid XML when manipulating libreoffice based XLSX files #403

Closed retostauffer closed 6 months ago

retostauffer commented 1 year ago

The issue:

I am using loadWorkbook(), writeData() and saveWorkbook() to modify an existing XLSX file, originally created with libreoffice on linux.

The first time (manipulating the XLSX file once with openxlsx) works well, the second time the XLSX gets corrupted. openxlsx::read.xlsx is still able to read the data, however, libreoffice only shows empty sheets, and readxl::read_excel refused to read the data (Error: unexpected end of data).

I may have found the problem (unclosed XML tag) as shown in the last part of this issue. Note that this error does not occur when I start with an XLSX file created (saved) by/via MS Excel.

System information:

Minimal:

I have put together a small minimal which uses a file test_libreoffice.xlsx, a simple XLSX file with two sheets. This XLSX file has been created using LibreOffice 7.3.7.2 30(Build:2) on an Ubuntu Ubuntu 22.04.1 LTS.

The R script below does the following:

  1. Tests if `"test_libreoffice.xlsx" is readable
  2. Manipulates "test_libreoffice.xlsx", stores modified version as "test1.xlsx"
  3. Tests readability of "test1.xlsx"
  4. Manipulates "test1.xlsx", stores modified version as "test2.xlsx"
  5. Tests readability of "test2.xlsx"

Testing readability (test_read()) tries to read the first sheet with both, openxlsx and readxl (testing). On both, my linux system as well as on my Windows 10 it fails in (5) where readxl::read_excel() is no longer to read the data. At this point, libreoffice will also only show empty sheets (no data).

library("readxl")
library("openxlsx")

# Reading sheet 1 with both openxlsx as well as readxl
test_read <- function(file) {
    cat("Testing readability of file \"", file, "\"\n")
    print(read.xlsx(file, sheet = 1))
    print(dim(read_excel(file, sheet = 1)))
}

# Small function which overwrites the content of the first sheet
modify_file <- function(src, dst) {
    # New (static) data set
    x  <- data.frame(a = 1:5, b = LETTERS[5:1], z = 5:1 + 100)
    # Loading, manipulating, and saving workbook
    wb <- loadWorkbook(src)
    writeData(wb, x, sheet = 1, startRow = 2, startCol = 1, colNames = FALSE)
    check <- saveWorkbook(wb, file = dst, overwrite = TRUE, returnValue = TRUE)
    stopifnot("problems saving workbook" = check)
}

# (1) Testing if original file is readable
test_read("test_libreoffice.xlsx")

# (2) Modify the file a first time, store it as test1.xlsx
modify_file("test_libreoffice.xlsx", "test1.xlsx")
# (3) Testing if still readable
test_read("test1.xlsx")

# (4) Modify the file a second time, store it as test2.xlsx
modify_file("test1.xlsx", "test2.xlsx")
# (5) Testing if still readable
test_read("test2.xlsx") # <- here is where the error occurs

Test files:

I am attaching test1.xlsx and test2.xlsx as created by the minimal above on my machine.

Cause of the problem:

I have been looking into the XLSX files to se where the two files "test1.xlsx" (which works as expected) and "test2.xlsx" (which fails) differ and found a problem in the XML definition of xl/worksheets/sheet1.xml (as well as xl/worksheets/sheet2.xml); a <sheetPr> tag that is not properly closed. Namely in this part:

# XML contains
<sheetPr><sheetPr filterMode="false"><pageSetUpPr fitToPage="false"/></sheetPr>
# I assume it must be this (works)                                             ! ------ !
<sheetPr><sheetPr filterMode="false"><pageSetUpPr fitToPage="false"/></sheetPr></sheetPr>

I've manually corrected this XML (in "test2.xlsx") which seem to resolve the problem. I can't attach the XML files here but they are located in xl/worksheets/ in both the file "test1.xlsx" and "test2.xlsx" (simply unzip).

I don't think this issue is related to #133 (#74, #81), hope opening this one is OK.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 6 months ago

This issue was closed because it has been stalled for 7 days with no activity.