Closed yulqen closed 3 years ago
Millions of emty rows causing the problem.
I think a good solution would be to use the datamap.cv to calculate how many rows the user expects to hold data on each tab. So if said tab's highest reference is 65E, then Datamaps only collects data up to row 65. Just an idea.
I've been thinking about whether we should use the datamap for this, and this is definitely a plausible way of doing it. There would be a minimal amount of computation upfront doing this and it definitely makes the process more resilient. I like it.
I was also considering adding a datamaps check template
command, or something similar. Essentially a separate command that targets a template file and checks for this problem (and other problems, perhaps). There is already a datamaps check
command - it could be added to that:
datamaps check --help
Usage: datamaps check [OPTIONS]
Checks for a correctly-named blank template in the input directory and a
datamap file in the input directory. Checks for correct headers in
datamap.
Options:
--help Show this message and exit.
The user would just get a readout and warnings if it looked like there were excessive rows in a sheet. We could then add a flag such as --template-check
into the datamaps import templates
command that runs the check before the import, or do the opposite and check by default and have a --no-template-check
if you want to avoid it. Checking the template does have some overhead because it would need to open the file and currently openpyxl
can take several seconds to open these files. I don't think I'd want it to run by default.
Essentially, this is a problem with the template, caused by Excel - somehow. I don't actually know it comes about, but when looking into your template I could see that it was created in 2015. It's no wonder really, considering all the many iterations, copy-and-pastings and whatever that it's undergone. The thing I like about having a separate check
command in datamaps is that teaches the user to use cleaner files. But using the datamap, something like you're suggesting here, would make the process more resilient (definitely a good thing) - but it also perpetuates bad habits by accepting using messy templates.
Maybe the ultimate solution is to have both?! A separate check templates
command that reports sheet dimensions to the user AND a guard facility, using the datamap, which protects against overrunning. Bigger job, but definitely a more robust solution. Something to add to the ROADMAP as versions 1.0.7
and 1.0.8
perhaps? What do you think?
In the meantime, in lieu of me testing and fixing this, you should probably have a look at your template and see if you can remove the erroneous rows from the sheets and try running again - just a single file. Use top
in another terminal whilst the import is running to monitor RAM usage.
Yes agree with all the above and think it will be best to have both. Will have a go at removing the rows and re-running now. Best, Will.
Ok I removed the empty rows from the DCA and Risks tab and it worked, but it still used up around 5.5 memory for one return. So the positive is that it is these empty rows that are causing the problem. The negative is that there must be other tabs with this problem, to be using up the memory. Would you able to check which tabs are a problem?
I did notice that yesterday, but I didn't check in detail. It could be 5. I've got a test that helps to highlight which is the problematic sheet. Will check this afternoon during our session.
Good work on establishing that this is indeed the problem. You'll need to share with me how you deleted the empty rows.
Yeap its tabs 2 and 5 where the empty rows are hidden. Removal of the empty rows and everything goes through fine.
Unfortunately I didn't find a method that worked for deleting the rows. I had to cut and paste the data into new tabs, and then delete the old tabs. I'll explore with the team whether we can find a better approach. I tried this:
Click the Home tab in the top menu bar in Excel, and click "Find & Select" on the right side. Select "Go to Special." A pop-up box will appear. ... Excel will then highlight all of the blank cells. ... Once all the blank rows are highlighted, go to the Home tab and find the "Delete" button on the right-hand side.
This is useful as it actually shows you in the top left box, all the rows being churned through. It takes a long time! But, it also seems to highlight rows that we want to keep, and when you try to then delete the rows it won't let you. so other potential ways for this need to be explored.
Further things to follow-up on:
See you soon.
Update:
I've done some work on limiting the amount of rows that are ingested by the import
process, based on the idea that we know from the datamap how many rows are expected in each template sheet.
It works, but there is a problem. As you know, ALL template cells are currently in scope during the initial processing of the cell - only later is the datamap applied to the data, which filters the required cells before outputing on to the master. The reason behind this design decision was to be able to collect the total amount of data of a spreadsheet so it could be stored in another format, and recreated at will, without reliance on having the source file to hand. The current implementation collects all the data but only keeps it in memory and does not write it to disk.
When you have over a million rows in your spreadsheet, this is why you run out of memory.
The problem with the idea of using the datamap to calculate the maximum number of rows per sheet is that the import process, as described above, eagerly processes all sheets in the file regardless of what's in the datamap. If, which happened when I was running tests yesterday, the program processes a template file that has sheets that are not declared in the datamap, the limiting does not apply. We're therefore at a point where we have to chose a "datamap-first" approach anyway, and not do the total data ingestion as we do currently, and if we do "datamap-first", we might as well filter the cells as we import them.
Perhaps a "datamap-first" implementation is actually a better design overall but we are where we are and to re-write/re-design the application now is not on the cards - at this point anyway.
I'm therefore going to implement a row limit by default (perhaps 500). This will be configurable in the config file and the program will report what the current limit is in the log as it runs. This is somewhat crude but I like the simplicity of it.
Hiya Matt, sounds super. Although its quite crude it should be a perfectly acceptable solution, certainly for now. Just had a look at the teams current datamap.cvs and the highest row that goes up to is around 130, and that is for a pretty large template, so I think 500 rows should be more ample. 👍
Hi Matt got this error re installing new version
ERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: 'WHEEL' Consider using the
--useroption or check the permissions.
@banillie - v1.0.9 is up on PyPI contain the fix for 1.0.8 bug with config
command.
sudo pip install -U datamaps
if using your system Python or pip install datamaps
with any virtualenv activated and away you go.
datamaps import templates -m
is currently set to 500. This can be changed in the config.ini
file if necessary. This will prevent the program crashing and locking up your machine due to all RAM being eaten by Excel files that contain millions of hidden rows in worksheets.datamaps config restart
will reset configuration to default settings. You may need to do this when you upgrade if your config.ini
file does not contain the TEMPLATE_ROW_LIMIT
setting. You should be warned about this in the terminal output when trying to run a datamap command, if this is the case.datamaps config show-file
will point out where config.ini
lives on your system.datamaps --help
, datamaps config --help
, datamaps config restart --help
and datamaps config show-file --help
should be instructive.If team needs a Windows build, they can download the zipped exe file for GitHub, unzip it and run as usual. I haven't tested the Windows file yet. Once tested, the file will go up on Twenty Four Software site as normal.
Please test and report any futher problems with this. If happy, please close the issue. :-)
Excellent thanks Matt. Will do later.
Yeap that all works fine. Just pip installed and ran fine first time. I tried the config show-file command and opened up the config file to have a look at it, also. So all good. I'll get the team to download the zip file and test the exe file on their machines. Cheers. 👍
Hi Matt, Joe got this error message when we tested the exe file on his standalone.
C:\Users\Standalone\Downloads>dm.exe import templates -m Traceback (most recent call last): File "lib\site-packages\PyInstaller\loader\rthooks\pyi_rth_pkgres.py", line 13, in <module> File "c:\users\runneradmin\appdata\local\pypoetry\cache\virtualenvs\datamaps-pt2xlvq9-py3.7\lib\site-packages\PyInstaller\loader\pyimod03_importers.py", line 623, in exec_module File "lib\site-packages\pkg_resources\__init__.py", line 86, in <module> ModuleNotFoundError: No module named 'pkg_resources.py2_warn' [6272] Failed to execute script pyi_rth_pkgres
Fix is not urgent as we have the new PyPI version working on my machine.
Cheers, Will.
I haven't had a chance to look at the Windows build yet - I will do as soon as I can.
Can he do pip install datamaps
?
Yes he can. I'll get it install that way on his machine, and others, shortly.
That's great - let me know how they get on.
@banillie - The Windows build of datamaps (dm.exe) is currently broken. It has always been something of a hack and was never completely happy about it. It's building fine on GitHub but as Joe has reported, it is not executing properly. In addition, new versions of Windows appear to immediately banish it as soon as it is downloaded, reporting a virus or other threat. Obviously, I know there is no threat from datamaps but to get round it you have to start tinkering with security rules which I wouldn't advocate.
At this point, I don't know enough about Windows to be able to make a proper installable build of datamaps - even if it is barely technically possible to do that with a Python program, and I'm not minded to invest a lot of time in it.
I therefore strongly recommend that you equip your guys with Python, pip and a decent terminal on their standalones (search for Windows Terminal in the Windows Store). All are easily installable nowadays on Windows and work fine for the job. The only thing you might have to do is set for them is set their PATH
variable so that they can run the program without typing in a horrendously long file path. If you're having problems doing that, let me know.
NB: Windows, of course, adds the .exe
extension to everything, so when you have done pip install datamaps
, and added it the PATH
, you have to run datamaps.exe --help
instead of datamaps --help
like in proper operating systems.... :-)
@yulqen yes will do. It would be useful to discuss some of this prior to discussing with the Team, as I'll need to be able to provide them with very clear instructions and i'm not au fait with any of it at the moment.
Yeah, whenever you need.
A large memory leak reported today using two small input template files. Over 8Gb of RAM used to produce the master. DfT emails from WG originate the problem - test files contained in
datamaps_problem_jan21.zip
.