willtrnr / pyxlsb

Excel 2007+ Binary Workbook (xlsb) reader for Python
GNU Lesser General Public License v3.0
90 stars 21 forks source link

Pulling Named Range or ListObject (e.g. Table) Values Into DataFrame or Nested List #19

Open pabloaperezfernandez opened 5 years ago

pabloaperezfernandez commented 5 years ago

Goal

  1. Read the values from an Excel named range into a nested list like [[1,2,3],[10,20,30]] or Pandas dataframe
  2. Read the values from an Excel table into a nested list or Pandas dataframe

Description It would be extremely useful to add a method to pull the contents of an entire name range or table (e.g. listobject in VBA) directly into a nested listed (a matrix such as [[1,2,3], [10,20,30]] or a Panda's dataframe.

I see reading data from XLSB files as one of the primary use cases of this library, and introducing what I am proposing would make using this package much easier. What I am describing can be done easily using XLWings, but this package is specific to Windows and requires Excel to be installed. This pretty much kills XLWings as an option for server-side applications or for deployments on Linux.

Suggested API Assume wsht and wbk have been initialized to a worksheet and workbook respectively.

DataFrameQ: Optional[bool] = True in each of the functions forces a dataframe to be return. By default, the functions should return nested lists.

I see the need for the following three methods:

  1. From a Worksheet Named Range - Should work with both the name of the ranges or a standard address Call: wsht.GetRangeValues(rangeName: str, DataFrameQ: Optional[bool] = False) Returned Value : [[val_1_1, val_1_2, ...], [val_2_1, val_2_2, ...], ...]

  2. From a Workbook Named Range - Should work with both the name of the ranges or a standard address Call: wbk.GetRangeValues(rangeName: str, DataFrameQ: Optional[bool] = False) Returned Value : [[val_1_1, val_1_2, ...], [val_2_1, val_2_2, ...], ...]

  3. From a Table - Need only work with table names Call: wsht.GetTableValues((tableName: str, DataFrameQ: Optional[bool] = False) Returned Value : [[val_1_1, val_1_2, ...], [val_2_1, val_2_2, ...], ...]

Sincerely,

Pablo

willtrnr commented 5 years ago

The binary structure for Table objects is rather big, but I'm pretty sure an equivalent named range is created for it so maybe we could get half way there. The current limited formula handling should actually be good enough to parse the range formula for the name.

Since names are scoped at the Workbook level I can see how a wbk.range(name) could abstract over wbk.sheet(name).rows() and cut the right shape at least until I add a way to address cells, this could even support 3D ranges with some effort.

For DataFrames, there's already #12 for integrating with Pandas.

pabloaperezfernandez commented 5 years ago

Correct. Simple VBA shows that you can pull the values in a table using the table's name as if it were a named range.

One note, names can be scoped at both the workbook and worksheet name. You have both kinds in Excel. You can see this when you declare a named range using the ribbon menu (see FORMULAS->DEFINE NAME->SCOPE). On the other hand, tables are always scoped at the worksheet level.

pabloaperezfernandez commented 5 years ago

One more point to bear in mind. While Excel allows you to access the values in the table using the table's name as that of a named range pointing to the table, the header row is excluded. In other words, the name automatically created with the same name as the table's, points to the table's name. It might be important to pull the header's values as well.

I encountered this problem in an application I am writing using XL Wings. XL Wings does not provide a table object (e.g. VBA's listobject), but the body of the table is accessible using the trick I have described.