Adds skip row callback - Githubissues

evaldobratti commented 3 years ago

Adds a skip_row callback in order to skip rows when still parsing the xlsx, saving some memory before giving the final result.

Why I needed this... Some xlsx that I have to parse are stored in gdocs and they are downloaded locally and then uploaded into our system. I don't know if is a gdocs error or editors messing some stuff, but some files are huge, even though they have only 200 usefull lines, other ones being empty "" or even empty spaces " ". So empty_rows and blank_value didn't work for me.

When parsing without this PR, a 5mb file (200 usefull lines) would result in almos 1M lines, almost 2gb of memory and a crash at heroku dyno. With this change, the parsing only took the 200 lines that mattered and kept memory stable without huge peaks.

Why not just filter after... It could be done, code was already prepared for this scenario, but the amount of memory that it was taking was crashing the server, without reaching the filtering code.

xavier commented 3 years ago

Great work! 👍 Thanks a lot!

evaldobratti commented 3 years ago

Thank you for the great lib :clap:

xavier / xlsx_reader

Adds skip row callback #9