zdavatz / spreadsheet

The Ruby Spreadsheet by ywesee GmbH
http://spreadsheet.ch
GNU General Public License v3.0
1.13k stars 240 forks source link

Empty cells when parsing certain kind of XLS files #225

Closed namelivia closed 5 years ago

namelivia commented 5 years ago

I'm trying to parse files provided by a Spanish bank backend using a script and when reading it's contents I've observed that cells come empty. I have also tried opening these files using Libreoffice and the cell contents are correctly displayed, also saving these files with Libreoffice as .XLS fixes them.

However I wanted to investigate further to see what's wrong with the files and to see if I could parse them correctly. I added lots of prints and followed the execution and this is what os making the rows to appear empty for these files:

While reading the worksheets using the method read_worksheet from reader on a "fixed" file :row ops are found first and the worksheet @offsets hash is filled with one entry for each row by the set_row_address method. However when reading the "non-fixed" file this is not happening (yet).

After that the cell data ops are read while reading the worksheet and as these are included in ROW_BLOCK_OPS the method set_missing_row_address will be executed. Now what I've seen this method does is to create entries on the @offsets hash for those who have indexes not already present on it. In the case of the "non-fixed" file this will populate the hash, on the "fixed file" it will just leave it as it is.

Finally on the "non-fixed" file it will read the :row ops and by calling set_row_address it will override the @offsets hash with different values. I have tried commenting the call to set_row_address and it actually makes the hash values to stay, read_row to be called and correctly parse my "non-fixed" files.

To be clear, on a "fixed" file :row ops come first, then cell data ops. On the "non-fixed" files the cell data ops come first, and then the :row ops.

My "solution" is pure coincidence and will probably break other files because I'm just focusing on this particular case, plus I have no Excel format knowledge, but if you consider this an issue that could be fixed I hope this explanation could be a starting point for a discussion on how to do it.

zdavatz commented 5 years ago

Can you provide a sample file?

namelivia commented 5 years ago

Thanks for the fast reply. Sure! Here is a sample of an original "unfixed" file: original.zip

zdavatz commented 5 years ago

Can you please also provide the script with witch you try to open and edit the files on you local computer.

Then, can you try the following please:

  1. Open the files you receive from the Spanish Bank via LibreOffice.
  2. Save the files with LibreOffice to a new Filename.
  3. Now try to read the file with Spreadsheet gem. Is the content of the cells now displayed correctly?
namelivia commented 5 years ago

Yes, opening with Libreoffice and saving them fixes the files, sorry if it wasn't clear on the issue description. When traversing this new files rows are checked first so the problem I described is not happening. The script is here: input_file.rb opens the file converter.rb traverses the rows transaction.rb reads the cells But is not doing anything more than opening the file accessing to the first worksheet and iterating the rows and accessing their cells using the [] operator.

zdavatz commented 5 years ago

Yeah, I would be interested to know, what software creates those exported XLS files in the first place. Some binary formatting is not properly done there, I think.

namelivia commented 5 years ago

I did alter some of the information there using a hex editor trying to preserve the structure in order to protect the data privacy. I had no time to work on this but I'm still interested in taking a look at it, I assumed the problem was that the file was not properly generated by the bank backend (I have no info on what tool do they use) but the fact that libreoffice or google sheets open the file with no errors makes me think there is something more. I will take a look myself as soon as I have some spare time for this. Thanks for the patience!

zdavatz commented 5 years ago

Google or LibreOffice have more resources. If you can edit - with spreadsheet gem - the file generated by the bank after you saved it with LibreOffice, then it is the Software that generates the file that has the issue. You are not the first person having this issue.

namelivia commented 5 years ago

I see your point and agree with you then, in that case is better not to alter the gem code. I will patch the method in the tool used to parse this specific kind of files as it is a specific file issue, it seems more appropriate. I'll close the issue now! Thank you for your interest and fast responses!

zdavatz commented 5 years ago

Thank you too!