yulqen / bcompiler-engine

MOVED: A Python library to alleviate the pain of using Excel spreadsheets to collect data from your stakeholders.
https://git.sr.ht/~yulqen/bcompiler-engine
MIT License
4 stars 1 forks source link

Memory leak - datamaps import #30

Closed yulqen closed 3 years ago

yulqen commented 3 years ago

A large memory leak reported today using two small input template files. Over 8Gb of RAM used to produce the master. DfT emails from WG originate the problem - test files contained in datamaps_problem_jan21.zip.

yulqen commented 3 years ago

Millions of emty rows causing the problem.

banillie commented 3 years ago

I think a good solution would be to use the datamap.cv to calculate how many rows the user expects to hold data on each tab. So if said tab's highest reference is 65E, then Datamaps only collects data up to row 65. Just an idea.

yulqen commented 3 years ago

I've been thinking about whether we should use the datamap for this, and this is definitely a plausible way of doing it. There would be a minimal amount of computation upfront doing this and it definitely makes the process more resilient. I like it.

I was also considering adding a datamaps check template command, or something similar. Essentially a separate command that targets a template file and checks for this problem (and other problems, perhaps). There is already a datamaps check command - it could be added to that:

datamaps check --help

Usage: datamaps check [OPTIONS]

  Checks for a correctly-named blank template in the input directory and a
  datamap file in the input directory. Checks for correct headers in
  datamap.

Options:
  --help  Show this message and exit.

The user would just get a readout and warnings if it looked like there were excessive rows in a sheet. We could then add a flag such as --template-check into the datamaps import templates command that runs the check before the import, or do the opposite and check by default and have a --no-template-check if you want to avoid it. Checking the template does have some overhead because it would need to open the file and currently openpyxl can take several seconds to open these files. I don't think I'd want it to run by default.

Essentially, this is a problem with the template, caused by Excel - somehow. I don't actually know it comes about, but when looking into your template I could see that it was created in 2015. It's no wonder really, considering all the many iterations, copy-and-pastings and whatever that it's undergone. The thing I like about having a separate check command in datamaps is that teaches the user to use cleaner files. But using the datamap, something like you're suggesting here, would make the process more resilient (definitely a good thing) - but it also perpetuates bad habits by accepting using messy templates.

Maybe the ultimate solution is to have both?! A separate check templates command that reports sheet dimensions to the user AND a guard facility, using the datamap, which protects against overrunning. Bigger job, but definitely a more robust solution. Something to add to the ROADMAP as versions 1.0.7 and 1.0.8 perhaps? What do you think?

In the meantime, in lieu of me testing and fixing this, you should probably have a look at your template and see if you can remove the erroneous rows from the sheets and try running again - just a single file. Use top in another terminal whilst the import is running to monitor RAM usage.

banillie commented 3 years ago

Yes agree with all the above and think it will be best to have both. Will have a go at removing the rows and re-running now. Best, Will.

banillie commented 3 years ago

Ok I removed the empty rows from the DCA and Risks tab and it worked, but it still used up around 5.5 memory for one return. So the positive is that it is these empty rows that are causing the problem. The negative is that there must be other tabs with this problem, to be using up the memory. Would you able to check which tabs are a problem?

yulqen commented 3 years ago

I did notice that yesterday, but I didn't check in detail. It could be 5. I've got a test that helps to highlight which is the problematic sheet. Will check this afternoon during our session.

Good work on establishing that this is indeed the problem. You'll need to share with me how you deleted the empty rows.

banillie commented 3 years ago

Yeap its tabs 2 and 5 where the empty rows are hidden. Removal of the empty rows and everything goes through fine.

Unfortunately I didn't find a method that worked for deleting the rows. I had to cut and paste the data into new tabs, and then delete the old tabs. I'll explore with the team whether we can find a better approach. I tried this:

Click the Home tab in the top menu bar in Excel, and click "Find & Select" on the right side. Select "Go to Special." A pop-up box will appear. ... Excel will then highlight all of the blank cells. ... Once all the blank rows are highlighted, go to the Home tab and find the "Delete" button on the right-hand side.

This is useful as it actually shows you in the top left box, all the rows being churned through. It takes a long time! But, it also seems to highlight rows that we want to keep, and when you try to then delete the rows it won't let you. so other potential ways for this need to be explored.

Further things to follow-up on:

See you soon.

yulqen commented 3 years ago

Update:

I've done some work on limiting the amount of rows that are ingested by the import process, based on the idea that we know from the datamap how many rows are expected in each template sheet.

It works, but there is a problem. As you know, ALL template cells are currently in scope during the initial processing of the cell - only later is the datamap applied to the data, which filters the required cells before outputing on to the master. The reason behind this design decision was to be able to collect the total amount of data of a spreadsheet so it could be stored in another format, and recreated at will, without reliance on having the source file to hand. The current implementation collects all the data but only keeps it in memory and does not write it to disk.

When you have over a million rows in your spreadsheet, this is why you run out of memory.

The problem with the idea of using the datamap to calculate the maximum number of rows per sheet is that the import process, as described above, eagerly processes all sheets in the file regardless of what's in the datamap. If, which happened when I was running tests yesterday, the program processes a template file that has sheets that are not declared in the datamap, the limiting does not apply. We're therefore at a point where we have to chose a "datamap-first" approach anyway, and not do the total data ingestion as we do currently, and if we do "datamap-first", we might as well filter the cells as we import them.

Perhaps a "datamap-first" implementation is actually a better design overall but we are where we are and to re-write/re-design the application now is not on the cards - at this point anyway.

I'm therefore going to implement a row limit by default (perhaps 500). This will be configurable in the config file and the program will report what the current limit is in the log as it runs. This is somewhat crude but I like the simplicity of it.

banillie commented 3 years ago

Hiya Matt, sounds super. Although its quite crude it should be a perfectly acceptable solution, certainly for now. Just had a look at the teams current datamap.cvs and the highest row that goes up to is around 130, and that is for a pretty large template, so I think 500 rows should be more ample. 👍

banillie commented 3 years ago

Hi Matt got this error re installing new version

ERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: 'WHEEL' Consider using the--useroption or check the permissions.

yulqen commented 3 years ago

@banillie - v1.0.9 is up on PyPI contain the fix for 1.0.8 bug with config command.

sudo pip install -U datamaps if using your system Python or pip install datamaps with any virtualenv activated and away you go.

Summary of changes

If team needs a Windows build, they can download the zipped exe file for GitHub, unzip it and run as usual. I haven't tested the Windows file yet. Once tested, the file will go up on Twenty Four Software site as normal.

Please test and report any futher problems with this. If happy, please close the issue. :-)

banillie commented 3 years ago

Excellent thanks Matt. Will do later.

banillie commented 3 years ago

Yeap that all works fine. Just pip installed and ran fine first time. I tried the config show-file command and opened up the config file to have a look at it, also. So all good. I'll get the team to download the zip file and test the exe file on their machines. Cheers. 👍

banillie commented 3 years ago

Hi Matt, Joe got this error message when we tested the exe file on his standalone.

C:\Users\Standalone\Downloads>dm.exe import templates -m Traceback (most recent call last): File "lib\site-packages\PyInstaller\loader\rthooks\pyi_rth_pkgres.py", line 13, in <module> File "c:\users\runneradmin\appdata\local\pypoetry\cache\virtualenvs\datamaps-pt2xlvq9-py3.7\lib\site-packages\PyInstaller\loader\pyimod03_importers.py", line 623, in exec_module File "lib\site-packages\pkg_resources\__init__.py", line 86, in <module> ModuleNotFoundError: No module named 'pkg_resources.py2_warn' [6272] Failed to execute script pyi_rth_pkgres

Fix is not urgent as we have the new PyPI version working on my machine.

Cheers, Will.

yulqen commented 3 years ago

I haven't had a chance to look at the Windows build yet - I will do as soon as I can.

Can he do pip install datamaps?

banillie commented 3 years ago

Yes he can. I'll get it install that way on his machine, and others, shortly.

yulqen commented 3 years ago

That's great - let me know how they get on.

yulqen commented 3 years ago

@banillie - The Windows build of datamaps (dm.exe) is currently broken. It has always been something of a hack and was never completely happy about it. It's building fine on GitHub but as Joe has reported, it is not executing properly. In addition, new versions of Windows appear to immediately banish it as soon as it is downloaded, reporting a virus or other threat. Obviously, I know there is no threat from datamaps but to get round it you have to start tinkering with security rules which I wouldn't advocate.

At this point, I don't know enough about Windows to be able to make a proper installable build of datamaps - even if it is barely technically possible to do that with a Python program, and I'm not minded to invest a lot of time in it.

I therefore strongly recommend that you equip your guys with Python, pip and a decent terminal on their standalones (search for Windows Terminal in the Windows Store). All are easily installable nowadays on Windows and work fine for the job. The only thing you might have to do is set for them is set their PATH variable so that they can run the program without typing in a horrendously long file path. If you're having problems doing that, let me know.

NB: Windows, of course, adds the .exe extension to everything, so when you have done pip install datamaps, and added it the PATH, you have to run datamaps.exe --help instead of datamaps --help like in proper operating systems.... :-)

banillie commented 3 years ago

@yulqen yes will do. It would be useful to discuss some of this prior to discussing with the Team, as I'll need to be able to provide them with very clear instructions and i'm not au fait with any of it at the moment.

yulqen commented 3 years ago

Yeah, whenever you need.