ycphs / openxlsx

openxlsx - a fast way to read and write complex xslx files
https://ycphs.github.io/openxlsx/
Other
225 stars 75 forks source link

Encodings in Column Names #454

Open clemenskuehn opened 11 months ago

clemenskuehn commented 11 months ago

Summary When using read.xlsx() on an xlsx-file with column names that partlz contain non-ASCII UTF-8 characters, the column names in the resulting data.frame end up with different encodings as well.

This can cause errors further down, e.g. in data.table, see below.

To Reproduce Create an xlsx-file with funny column names, e.g. three columns that contain something like this

the_good | the_bäd | the_ugly 1 | 4 | 7 2 | 5 | 8 3 | 6 | 9

The following code illustrates the problem (note that the mixed encoding in the column names also exists when not using as.data.table):

library(openxlsx) library(stringi) testo <- as.data.table(read.xlsx("Test.xlsx"))

testo[, sum(the_good)] testo[, sum(the_bäd)]

testo[, sum(the_good), by = the_ugly] testo[, sum(the_bäd), by = the_ugly]

stri_enc_mark(names(testo))

Expected behavior I would expect the problem to go awaz when all column names have the same encoding

Additional context If you think that is rather a problem of the data.table package, let me know. But I would think that, although the problem itself is quite exotic, I would expect column names to have the same encoding throughout an imported table.