ycphs / openxlsx

openxlsx - a fast way to read and write complex xslx files
https://ycphs.github.io/openxlsx/
Other
220 stars 73 forks source link

Add ability to update a sheet in workbook without having to read the whole workbook #430

Open janlimbeck opened 1 year ago

janlimbeck commented 1 year ago

Is your feature request related to a problem? Please describe.

When I want to update one sheet in an existing and potentially quite big notebook I currently have to do something like the below:

workBook <- loadWorkbook(workbookPath)
writeData(workBook, sheet = sheetName, x = dataFrame)
saveWorkbook(workBook, workbookPath, overwrite = TRUE)

This can take several minutes for the notebooks I am handling.

Describe the solution you'd like It would be great if e.g. write.xlsx could be adapted such that it can update a sheet in an existing workbook without having to load/parse the complete notebook. Additionally I have observed that parsing/writing back the unchanged sheets can sometime still lead to breaking changes for more complicated sheets. Hence it would be great if those were not touched at all both from a performance and compatibility perspective.

Describe alternatives you've considered

In principle the approach with loadWorkbook, writeData, saveWorkbook works but it is slow and occasionally tends to alter content.

JanMarvin commented 1 year ago

Hi @janlimbeck , this is a reasonable user requirement, but it is not implemented in this project, nor is it provided for in any other project I know of. openxlsx imports the entire workbook and imports styles, drawings and so on for a reason. To allow changes and access to the whole workbook. Still, what you want might be possible, see below, but if the entire world economy collapses because of you simply following guides from strangers on the internet, just to save yourself a few minutes, I point you to the MIT license: WITHOUT WARRANTY OF ANY KIND. No, joking aside, make a few copies and check everything extensively (Oh, and use a database if saving takes a long time, databases were invented for a reason ...).

The following should come close to what you want: 1) It unzips an existing workbook (old) 2) It loads only a single sheet from old in openxlsx2 and saves the output (new) 3) New is unzipped and sheet1.xml from old is overwritten with new 4) The unzipped old with the replaced sheet from new is zipped again

This should work as long as you only have data on the sheet (hyperlinks, drawings, pivot tables and other references might differ, in old and new. Look for any r:id= part in the xml files. The relationship ids on the new sheet need to match the old sheet you are going to replace). Of course references on other sheets, formulas in this workbook etc. will all be broken if the file contains a calcChain.xml file, but at least broken in an interesting way (hint: you have to manually click into every cell to repair the calculation chain).

library(openxlsx2)

### create a directory
tmp <- paste0(tempdir(), "/unziptest")
# unlink(tmp, recursive = TRUE)
dir.create(tmp)
setwd(tmp)

tmp_old <- paste0(tmp, "/old_xlsx")
tmp_new <- paste0(tmp, "/new_xlsx")

xlsx_old <- "readTest.xlsx"
xlsx_new <- "Sheet1.xlsx"

xlsx_fin <- paste0(tmp, "/Updated.xlsx")

# download the old workbook
fl <- "https://github.com/JanMarvin/openxlsx-data/raw/main/readTest.xlsx"
curl::curl_download(fl, xlsx_old)

# extract the old workbook
unlink(tmp_old, recursive = TRUE)
unzip(xlsx_old, exdir = tmp_old)

# modify the new workbook
wb <- wb_load(xlsx_old, sheet = 1, data_only = TRUE)
wb$add_data(sheet = 1, dims = "A14", x = mtcars)
wb$save(xlsx_new)

# extract the new workbook
unlink(tmp_new, recursive = TRUE)
unzip(xlsx_new, exdir = tmp_new)

# replace a sheet from the old workbook with the new workbook
sheet_xml <- "/xl/worksheets/sheet1.xml"
old <- paste0(tmp_old, sheet_xml)
new <- paste0(tmp_new, sheet_xml)

file.copy(new, old, overwrite = TRUE)

# create the output
zip::zip(
  zipfile = xlsx_fin,
  files = list.files(tmp_old, full.names = FALSE),
  recurse = TRUE, 
  compression_level = 6,
  include_directories = FALSE,
  root = "old_xlsx",
  mode = "cherry-pick"
)

xl_open(xlsx_fin)
janlimbeck commented 1 year ago

@JanMarvin Many thanks for this! I will take a good look and see if anything breaks in my scenario. I see you are the author of openxlsx2. Any chance a feature like this may find its way - in a less hacky way - into your library?

JanMarvin commented 1 year ago

No, I'm not really interested in this feature, keeping track of references in a workbook is hard enough. Keeping track of references in the input file and the output file really seems to cause major headaches. But this is the openxlsx issue tracker and I had a spare moment to explain why it is not possible in openxlsx and provide a small example of what could be done instead. PS: You should probably copy styles.xml from new (if you're writing anything other than plain text and numbers) and certainly delete calcChain.xml in the old folder and reference it in [Content_Types].xml. Keeping the calculation string really tends to brick the output file.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.

janlimbeck commented 1 month ago

Hi @janlimbeck , this is a reasonable user requirement, but it is not implemented in this project, nor is it provided for in any other project I know of. openxlsx imports the entire workbook and imports styles, drawings and so on for a reason. To allow changes and access to the whole workbook. Still, what you want might be possible, see below, but if the entire world economy collapses because of you simply following guides from strangers on the internet, just to save yourself a few minutes, I point you to the MIT license: WITHOUT WARRANTY OF ANY KIND. No, joking aside, make a few copies and check everything extensively (Oh, and use a database if saving takes a long time, databases were invented for a reason ...).

The following should come close to what you want:

  1. It unzips an existing workbook (old)
  2. It loads only a single sheet from old in openxlsx2 and saves the output (new)
  3. New is unzipped and sheet1.xml from old is overwritten with new
  4. The unzipped old with the replaced sheet from new is zipped again

This should work as long as you only have data on the sheet (hyperlinks, drawings, pivot tables and other references might differ, in old and new. Look for any r:id= part in the xml files. The relationship ids on the new sheet need to match the old sheet you are going to replace). Of course references on other sheets, formulas in this workbook etc. will all be broken if the file contains a calcChain.xml file, but at least broken in an interesting way (hint: you have to manually click into every cell to repair the calculation chain).

library(openxlsx2)

### create a directory
tmp <- paste0(tempdir(), "/unziptest")
# unlink(tmp, recursive = TRUE)
dir.create(tmp)
setwd(tmp)

tmp_old <- paste0(tmp, "/old_xlsx")
tmp_new <- paste0(tmp, "/new_xlsx")

xlsx_old <- "readTest.xlsx"
xlsx_new <- "Sheet1.xlsx"

xlsx_fin <- paste0(tmp, "/Updated.xlsx")

# download the old workbook
fl <- "https://github.com/JanMarvin/openxlsx-data/raw/main/readTest.xlsx"
curl::curl_download(fl, xlsx_old)

# extract the old workbook
unlink(tmp_old, recursive = TRUE)
unzip(xlsx_old, exdir = tmp_old)

# modify the new workbook
wb <- wb_load(xlsx_old, sheet = 1, data_only = TRUE)
wb$add_data(sheet = 1, dims = "A14", x = mtcars)
wb$save(xlsx_new)

# extract the new workbook
unlink(tmp_new, recursive = TRUE)
unzip(xlsx_new, exdir = tmp_new)

# replace a sheet from the old workbook with the new workbook
sheet_xml <- "/xl/worksheets/sheet1.xml"
old <- paste0(tmp_old, sheet_xml)
new <- paste0(tmp_new, sheet_xml)

file.copy(new, old, overwrite = TRUE)

# create the output
zip::zip(
  zipfile = xlsx_fin,
  files = list.files(tmp_old, full.names = FALSE),
  recurse = TRUE, 
  compression_level = 6,
  include_directories = FALSE,
  root = "old_xlsx",
  mode = "cherry-pick"
)

xl_open(xlsx_fin)

I have implemented the workaround described above but would still be curious if somebody has started working on a native solution.