willtrnr / pyxlsb

Excel 2007+ Binary Workbook (xlsb) reader for Python
GNU Lesser General Public License v3.0
90 stars 21 forks source link

String columns are read as None #43

Open jeffreykog opened 2 years ago

jeffreykog commented 2 years ago

Hi, i am attempting to read a file (also as attachment in this ticket). The integer column (titled Site) is read correctly, but the fields containing string values are all read as String.

I was not able to find a cause for this in the source code. This file is delivered to me on a regular basis, and we have no control over the format the file is provided in. So requesting to export it as xlsx or csv is not really an option.

As github does not support uploading xlsb files, i compressed in a zip. So the zip is not the direct content of the file.

file.zip

Reproducing:

wb = open_workbook('Netwerk modernisatie week 49.xlsb')
sheet = wb.get_sheet(1)
print(next(sheet.rows()))
[Cell(r=0, c=0, v=None), Cell(r=0, c=1, v=None), Cell(r=0, c=2, v=None), Cell(r=0, c=3, v=None), Cell(r=0, c=4, v=None)]

The first row of the file contains strings. The second row also contains some integer values, which are read correctly.

What am i doing wrong?

willtrnr commented 2 years ago

I had a quick look at it and had the same issue so I don't think you're doing anything wrong. I'll have to take a deeper look later.

For now, have you tried re-saving the file with Excel? The library should be able to handle it anyway, but just to check if it's a specific glitch with that file.

jeffreykog commented 2 years ago

Okay, i have finally got access to a Windows install with working Excel and it turns out that after re-saving the file with Excel the file size goes from 45kB to 65kB. Now pyxlsb is able to properly read the re-saved file.

Netwerk modernisatie week 49 - From Excel.xlsb.zip

willtrnr commented 2 years ago

Something similar happened in #23.

But since I have your "broken" file, this will help figure out what's weird with it and hopefully make pyxlsb more lenient.

jeffreykog commented 2 years ago

I have been looking at it and i'm on to something regarding the cause of this. I added some more debug logging to pyxlsb/reader.py, line 181, where a handler is retrieved for a given recid like this:

...
      recdata = self._fp.read(reclen)
      with RecordReader(recdata) as reader:
        # Addition start
        if recid not in self.handlers:
          print("Unknown handler " + str(recid) + " with len " + str(reclen) + " (line below)")
        # Addition end
        ret = (self.handlers.get(recid) or Handler()).read(reader, recid, reclen)
      if self._debug:
        print('{:08X}  {:04X}  {:<6} {} {}'.format(pos, recid, reclen, ' '.join('{:02X}'.format(b) for b in recdata), ret))
...

In my testing this reports a missing handler for recid 62 or 0x3E. The reclen can be calculated back to the contents of the cell. For example, the first column in my file is 30 characters long. Reclen is 85 (30 2 + 25). The last column in my file is always 6 characters long. Reclen there is 37 (6 2 + 25).

@willtrnr Do you have a binary specification you use commonly for this? I think the CellHandler could be added for recid 0x35. The only difficulty is the cell format. But it seems that the string characters have normal ascii/unicode codes seperated by an empty byte (0x00). Possibly UTF-16?

Debug logging for reference:

Unknown handler 62 with len 85 (line below)
00000281  003E  85     00 00 00 00 00 00 00 00 00 1E 00 00 00 54 00 69 00 6A 00 64 00 65 00 6C 00 69 00 6A 00 6B 00 20 00 4E 00 65 00 74 00 77 00 65 00 72 00 6B 00 20 00 4D 00 6F 00 64 00 65 00 72 00 6E 00 69 00 73 00 61 00 74 00 69 00 65 00 00 00 00 00 00 00 00 00 00 00 00 00 None
000002D8  0005  16     01 00 00 00 00 00 00 FF 00 00 00 00 00 24 A5 40 c(c=1, v=2706.0, f=None, style=4278190080)
000002EA  0005  16     02 00 00 00 01 00 00 FF 00 00 00 00 E0 BE E5 40 c(c=2, v=44535.0, f=None, style=4278190081)
000002FC  0005  16     03 00 00 00 01 00 00 FF 00 00 00 00 00 BF E5 40 c(c=3, v=44536.0, f=None, style=4278190081)
Unknown handler 62 with len 37 (line below)
0000030E  003E  37     04 00 00 00 00 00 00 00 00 06 00 00 00 37 00 30 00 32 00 31 00 48 00 44 00 00 00 00 00 00 00 00 00 00 00 00 00 None

On the last line, 0x06 is the length. And the string starts with 0x37

jeffreykog commented 2 years ago

https://github.com/willtrnr/pyxlsb/compare/v1.0.x...jeffreykog:fix-special-strings

That is what was required to get it working. The behavior of this cell type seems to be a mix between StringInstanceHandler (as it skips 1 byte before reading) and the normal CellHandler (because that one reads col id and style before parsing).

I have not opened this as a PR as i'm sure the quality of my 'fix' is not good enough to merge into upstream. But i'm sure this can be used as a reference for other people or as inspiration for a more permanent fix.

willtrnr commented 2 years ago

That's great, thanks a lot for looking into it.

The reference for this is MS-XLSB. I see that there's quite a few new major releases since I last checked up on the record sets, it's quite possible that this is a new-ish feature or that some Excel version decided to start using it.

Skimming through the docs, the record 62 seems to be related to string cells in shared workbooks. Looks like you were pretty close with your "special string" idea.

If you're up for it, you can rework your patch to better match the technical spec in MS-XLSB and I'd happy to pull it in.

Since this turns out to be a fixable "new feature", I'll re-open this.