ycphs / openxlsx

openxlsx - a fast way to read and write complex xslx files
https://ycphs.github.io/openxlsx/
Other
226 stars 75 forks source link

read.xlsx() does not return the correct number of rows #304

Closed daattali closed 2 years ago

daattali commented 2 years ago

This could be either a bug or a feature request, depending on your view.

Suppose I have an Excel sheet with the following data:

Row Value
1 a
2 b
3
4
5
6 c

If I want to read rows 2-4, I expect to get a dataframe with 3 rows. But read.xlsx(file, sheet, rows = 2:4, cols = 1, skipEmptyRows = FALSE) will return only the first row "b" because the next two rows are empty. The same thing happens if I request rows 2:5, but if I ask for rows 2:6 then I'll correctly get all rows.

I consider this a bug because I don't see this behaviour documented in read.xlsx , although perhaps it's documented elsewhere.

Ideally I'd think the behaviour should change so that you can always know the exact dimensions of the returned dataframe. IMO request 3 rows should always return 3 rows. But in order to not have a breaking change, a new parameter can be added to specify whether or not to ignore trailing missing values.

JanMarvin commented 2 years ago

Hi @daattali ,

you requested 4 3 rows and the first one is always used as column name. This is documented here: https://github.com/ycphs/openxlsx/blob/1fa10d17f500863d86984f095440ab92632162ad/R/readWorkbook.R#L12

Changing this default behavior would break a lot of existing code. What you wanted is this:

> df <- read.xlsx("~/gh_issue_304.xlsx",
+                 rows = 2:4,
+                 cols = 1,
+                 colNames = FALSE)
> df
  X1
1  1
2  2
3  3

I close this as invalid.