ycphs / openxlsx

openxlsx - a fast way to read and write complex xslx files
https://ycphs.github.io/openxlsx/
Other
226 stars 75 forks source link

Convert a range to a number #326

Closed klar-C closed 1 year ago

klar-C commented 2 years ago

Hi -

I put this under bug because it is not really a feature request.

Is there any way to convert a range to a number in Excel?

Use case is that I have a long data frame which has certain rows with text - the other rows are numbers (stored as character). When writing the dataframe to Excel the numbers get interpreted as text and I need to convert them back within Excel. A solution would be to write the df piecewise, however that seems to be way slower than writing the df as a whole.

Any thoughts would be appreciated.

JanMarvin commented 2 years ago

Hi @klar-C , I assume a minimal example would be this? (You were asked to provide one ...)

x <- data.frame(x = "1", y = "text")
write.xlsx(x, "test.xlsx", TRUE)

In this case: that is no bug, you tell openxlsx that the input is of type character and it is written as such. Silently changing the input type would be a bug. You can change the column type to numeric or integer and in such case it will be written as number to the xlsx output.

klar-C commented 2 years ago

An example would be like this:

x <- data.frame(x=c('1','text','2','3','text','4'))

The column x has both numbers and text in character format. When writing it to Excel the numbers are interpreted as text by Excel. What I want is to let Excel know that those are really numbers. I could write the dataframe in pieces but that's way slower for large dataframes.

Agreed it isn't really a bug but also not a feature request so wasn't sure about the label.

JanMarvin commented 2 years ago

But they are already characters in R? You would have to convert them to numeric when writing? I mean, ofc it would be possible to implement a function that does the check prior to writing (Check if identical after as.character(as.numeric()) round trip, then write it as number), after all it does not matter to xlsx, but … well I'm not gonna do that for you :-)

klar-C commented 2 years ago

Yes they are characters in R.

The problem is that I have a large df with groups of rows that need "separators" (or headers for each section). The headers are text.

e.g. Header1 1 2 Header2 3

I tried the following but it is just way too slow: Write the df (with NA_dbl as separators in between the sections) and then create a loop to write each header section separately on top of the empty rows.

The loop is what makes it really slow - both when actually doing the writedata but also once the Excel is generated with savexlsx.

Yes definitely not asking you to do any work here - I can check out the source code by myself. But if you have any idea about how to tackle the problem efficiently in some other way it would be super helpful.

JanMarvin commented 2 years ago

Well, as always, it depends. I would try to keep them separate in R. For exampe in a report where you have annual numbers in the top (header1, 1 and 2) and monthly numbers in the bottom (header2, 3): Keep them as separate dataframes, with numbers as values. Then write rows 1:3 and later rows 5:6. For me, the process that generated the character data is faulty. Not the process that writes the data to the xlsx file.

klar-C commented 2 years ago

Totally agreed - it isn't clean. And they are separate initially. I'm actually putting them together because writing them separately is much slower. Across ~15K rows I have maybe 160 group rows - and writing them adds probably 1 minute on both the initial loop for writedata and then also on savexlsx. So just trying to play around to cut this time out.

JanMarvin commented 2 years ago

Haha 15k rows, oh boy :see_no_evil: (gosh, rows, I read columns for whatever reason)

I assume that simply takes time. I mean you are writing to xml and there are still a lot of conversions from R to xml.

JanMarvin commented 2 years ago

Just to be clear, 15k rows should be no problem at all. I was just thinking about 15k columns and how one would handle such a file later on in Excel. Can you provide some timings and dimensions? Like what are we talking about? Is it taking awfully long or is it just "i can go and grab a coffee and I want it to be done once I hit the button" long?

klar-C commented 2 years ago

It is around 60 columns and as said ~15K rows. What I did now was to create one large df (with intermediate NAs-rows) and then I fill in the section headers separately in case the underlying column is numeric). Then I apply the formatting (I keep track of all distinct formats) as one additional step. The R part is around 15-30 seconds - that's totally cool. Saving down the Excel file however takes ~80 seconds.

I guess one of the things I wanted to ask is whether openxlsx has performance issues with data overwriting ranges that have been written to in a previous command (during the final stage of creating the excel file).

JanMarvin commented 2 years ago

There are definitely performance problems. For example, one problem is that the workbook is not yet finalized when you hit saveWorkbook. There are a number of cleanup functions called, and writing the xml files takes some time as well. I'll look into this, but mostly out of curiosity, I have no intention of speeding up this writer. Use the time you are writing to pause and rest :-)

klar-C commented 2 years ago

Will do :). I'll post more performance numbers in case I come across other bottlenecks.

Awesome package!

JanMarvin commented 2 years ago

I just wanted to let you know that I have been playing around with a new version and the save time has improved dramatically. Below is the time difference for a dummy dataset (15,000 x 60) with 15 groups, each containing 1000 rows, from creation to final output. However, the program is still under development and there is no release schedule yet.

Time difference of 7.673866 secs
klar-C commented 2 years ago

Okay thank you.

JanMarvin commented 2 years ago

@klar-C https://github.com/JanMarvin/openxlsx2 is now publicly available for testing. Please be aware that it is not yet stable (aka do not include it into any production pipeline yet).

klar-C commented 2 years ago

Thank you! I will check it out asap.

JanMarvin commented 2 years ago

Just for reference:

> beg <- Sys.time()
> mm <- as.data.frame(matrix(1, ncol = 60, nrow = 15000))
> # create openxlsx2 workbook
> wb <- wb_workbook()$
+   # add a worksheet: "1"
+   add_worksheet(1)$
+   # add the data
+   add_data(1, mm)$
+   # add 15 groups. 14 are collapsed
+   group_rows(1, 1:1000, TRUE)$
+   group_rows(1, 1001:2000, TRUE)$
+   group_rows(1, 2001:3000, TRUE)$
+   group_rows(1, 3001:4000, TRUE)$
+   group_rows(1, 4001:5000, TRUE)$
+   group_rows(1, 5001:6000, TRUE)$
+   group_rows(1, 6001:7000, TRUE)$
+   group_rows(1, 7001:8000, TRUE)$
+   group_rows(1, 8001:9000, TRUE)$
+   group_rows(1, 9001:10000, TRUE)$
+   group_rows(1, 10001:11000, TRUE)$
+   group_rows(1, 11001:12000, TRUE)$
+   group_rows(1, 12001:13000, TRUE)$
+   group_rows(1, 13001:14000, TRUE)$
+   group_rows(1, 14001:15000, FALSE)$
+   # save file
+   save("grouping.xlsx")
> end <- Sys.time() - beg
> end
Time difference of 3.360859 secs
klar-C commented 2 years ago

Short question:

Does the new version allow for writing existing xlsm files?

With the old version I can open/modify existing xlsx files but as soon as there's a macro in it, it breaks.

JanMarvin commented 2 years ago

haven't tested it yet tbh. for questions regarding openxlsx2 please go to the other repo. [Edit:] But it works for me :)

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 7 days with no activity.