Summary
When using read.xlsx() on an xlsx-file with column names that partlz contain non-ASCII UTF-8 characters, the column names in the resulting data.frame end up with different encodings as well.
This can cause errors further down, e.g. in data.table, see below.
To Reproduce
Create an xlsx-file with funny column names, e.g. three columns that contain something like this
testo[, sum(the_good), by = the_ugly]testo[, sum(the_bäd), by = the_ugly]
stri_enc_mark(names(testo))
Expected behavior
I would expect the problem to go awaz when all column names have the same encoding
Additional context
If you think that is rather a problem of the data.table package, let me know. But I would think that, although the problem itself is quite exotic, I would expect column names to have the same encoding throughout an imported table.
Summary When using read.xlsx() on an xlsx-file with column names that partlz contain non-ASCII UTF-8 characters, the column names in the resulting data.frame end up with different encodings as well.
This can cause errors further down, e.g. in data.table, see below.
To Reproduce Create an xlsx-file with funny column names, e.g. three columns that contain something like this
the_good | the_bäd | the_ugly 1 | 4 | 7 2 | 5 | 8 3 | 6 | 9
The following code illustrates the problem (note that the mixed encoding in the column names also exists when not using as.data.table):
library(openxlsx)
library(stringi)
testo <- as.data.table(read.xlsx("Test.xlsx"))
testo[, sum(the_good)]
testo[, sum(the_bäd)]
testo[, sum(the_good), by = the_ugly]
testo[, sum(the_bäd), by = the_ugly]
stri_enc_mark(names(testo))
Expected behavior I would expect the problem to go awaz when all column names have the same encoding
Additional context If you think that is rather a problem of the data.table package, let me know. But I would think that, although the problem itself is quite exotic, I would expect column names to have the same encoding throughout an imported table.