Open ksunden opened 2 years ago
Update:
We have found two sources of higher than expected memory usage by the Bluesky Run Engine:
bluesky/bluesky#1513 removes a cache of status tasks that was largely unecessary and caused every point to add to a collection that was not emptied in the course of a scan. This caused a steady increase in memory usage over the course of long acquisitions, which eventually caused the Run Engine to fail.
bluesky/bluesky#1543 Uses a more memory efficient representation of grid scans which do not snake their axes. This caused a large increase in memory usage when starting large grid scans (a couple hundred MB for a 11x101x101 scan.
I am attempting to use grid_scan_wp with shape 11,101,101 as a test for memory performance (on the test computer). After the aforementioned fixes, bluesky version 4cb0c2, the scan ran for ~99k data points on the test computer (~88% complete).
re-manager memory
The re-manager log reported Unrecognized input header: 99
after then last recorded data point. Docker app reported seeing no containers, but containers could still be queried for status through cli (I imagine the just became unresponsive for the docker app, so it thought they weren't there).
I will try to perform this same test on bluesky v1.9 to get a sense of how much of the problem has been alleviated.
That is a rate of ~2 MB / min, which is evidently enough to be problematic, but significantly better than before
I am curious if count
sees the same accumulation or not, that would narrow in where to look for this one. (also built in grid_scan
vs grid_scan_wp
, but I honestly don't see how that would happen)
I can confirm the recent bluesky fixes have helped memory leakage greatly. Here is the grid_scan_wp
result from bluesky 1.9.0. In this run, the scan "failed" after ~2h40m (~79k points). The failure was not a crash, but rather the data acquisition slowed by more than an order of magnitude.
bluesky v1.9.0 | bluesky v1.10.0 | |
---|---|---|
nominal memory (MB) | 230 | 220 |
memory increase at start (MB) | 182 | 66 |
memory at crash (MB) | 3060 | 764 |
scan percentage at "crash" | 70% | 88% |
scan runtime | ~2h40m | ~4h |
MB gained / min | 16.5 | 2 |
I ran a 112211 point count
plan under bluesky 1.10.0. Memory grew steadily from 230 MB at start of scan to 1300 MB at end of scan. Notably, memory did not go down after the scan finished--it stayed at 1280 MB usage.
Current suspicions:
LiveTable (via BestEffortCallback)... note that the most obvious place for a leak seemed to not be the major problem, but I still think its worth disabling the live table entirely to see if we can nail down the problem to something involving it.
Logging system somewhere (which would be greatly exacerbated by the live table, as that greatly adds to the logged output, but if it were more fundamental logging then it may still present a problem over long running sessions even if there is not the constant stream of logs during scans)
Regardless, next step is to try this out without live table/BEC enabled, as if it still presents as a problem, then probably neither of these are the culprit (meaning back to the drawing board a bit), but if it does go away then we have a short term workaround and some very specific areas to look.
I have a test without BEC loaded running, I will note that for some technical and some laziness reasons it also disables the insertion of documents into mongo, so that should be noted as a possible additional cause if the problem is alleviated, so not as clean of a "if I do X then it must be Y" as I would like, but not that far off either.
A test with a script and ophyd.sim hardware and built in bluesky grid-scan showed virtually no memory accumulation (a little bit of settling time at the start, but completely stable after that, including with BEC enabled, admittedly with the change to its retention of output lines)
So I think that brings me to "its not likely the RE worker itself that is accumulating memory, probably need to dig into queueserver"
Executive Summary
This is an ongoing investigation into symptoms that have recurred that seem very similar to #41. Namely, some scans, particularly large ones (such as 3D scans), fail to complete.
Initial report, Symptoms, troubleshooting
On or about 2022-08-05, reports initiated that scans were halting, preventing overnight queues from completing.
The initial scan which failed was itself a 3D scan and was enqueued following a 3D scan.
The system was restarted and the queue set to run.
When in the halted state, which occurred a few hours later, there was no indication of an error in the logs of the re-manager, but it had been several minutes since the last datapoint had been printed to the table in the logs.
Some basic troubleshooting was performed:
qserver status
: returned expected output, that the queue was runningdocker stats --no-stream
: nothing crazy on memory usage, re-manager was spinning at 100% CPU usageqserver re pause immediate
:Failed to pause the running plan: Timeout occurred
qserver manager stop safe off
: Caused Watchdog messages similar to #41At this point, the docker containers were restarted and started as expected.
@ddkohler noted:
@emkaufman reported:
When I arrived in lab on 2022-08-08, I took a quick look, confirmed that the fixes to #41 were in place, so this is not seemingly the same issue (though it is similar). I noted that the docker desktop app indicated that the containers had been built "22 minutes ago" at ~9:50, inquired if they had been rebuilt. Nobody had rebuilt them and they had been running for 15 hours at that point, so it later became evident that the "22 minutes ago" was simply never updated.
On 2022-08-11 @emkaufman updated:
Actions taken:
While investigation is ongoing, avoid taking the scans which fail on the lab machine as much as possible
Construct a test Windows machine using an old office computer that can hopefully reproduce the symptoms we see and provide a mechanism to address these concerns with minimal interruption to the lab machines when possible. @pcruzparri is in charge of bringing that machine online and doing initial reproduction testing