ksunden commented 2 years ago

Executive Summary

This is an ongoing investigation into symptoms that have recurred that seem very similar to #41. Namely, some scans, particularly large ones (such as 3D scans), fail to complete.

Initial report, Symptoms, troubleshooting

On or about 2022-08-05, reports initiated that scans were halting, preventing overnight queues from completing.

The initial scan which failed was itself a 3D scan and was enqueued following a 3D scan.

The system was restarted and the queue set to run.

When in the halted state, which occurred a few hours later, there was no indication of an error in the logs of the re-manager, but it had been several minutes since the last datapoint had been printed to the table in the logs.

Some basic troubleshooting was performed:

check for busy or unexpectedly offline yaq-daemons: none
check qserver status: returned expected output, that the queue was running
check memory and cpu usage of the docker containers using docker stats --no-stream: nothing crazy on memory usage, re-manager was spinning at 100% CPU usage
attempt to pause RE via CLI qserver re pause immediate: Failed to pause the running plan: Timeout occurred
Try to shutdown re environment using qserver manager stop safe off: Caused Watchdog messages similar to #41

At this point, the docker containers were restarted and started as expected.

@ddkohler noted:

last line of log lists twin1 position as 7517 nm, but twin1 had already progressed to 7541 nm (i.e. twin1 made it to the next data point)

last line of log lists hp/filter position as 549 nm. I did see the filter was still at this position. I did not check if the hp moved or not.

These three motors are part of the inner loop of the acquisition, so no movement helps identify where the re tripped up. Just a suggestion to track these numbers next time we see a stall

@emkaufman reported:

I ended up not adding a long scan to run overnight as I wanted to start a new scan first thing tomorrow morning. I observed the problem one more time today. I noticed that I have only been seeing this issue occur when I run scans with 3 axes

When I arrived in lab on 2022-08-08, I took a quick look, confirmed that the fixes to #41 were in place, so this is not seemingly the same issue (though it is similar). I noted that the docker desktop app indicated that the containers had been built "22 minutes ago" at ~9:50, inquired if they had been rebuilt. Nobody had rebuilt them and they had been running for 15 hours at that point, so it later became evident that the "22 minutes ago" was simply never updated.

On 2022-08-11 @emkaufman updated:

Multiple 2D scans stalling on blue sky recently. An observation I had about the 2D scans that are stalling is that they always have another scan queued AFTER it. Trying to remember if this was the case with the 3D scans that were stalling as well

Actions taken:

While investigation is ongoing, avoid taking the scans which fail on the lab machine as much as possible

Construct a test Windows machine using an old office computer that can hopefully reproduce the symptoms we see and provide a mechanism to address these concerns with minimal interruption to the lab machines when possible. @pcruzparri is in charge of bringing that machine online and doing initial reproduction testing

ksunden commented 2 years ago

Update:

We have found two sources of higher than expected memory usage by the Bluesky Run Engine:

bluesky/bluesky#1513 removes a cache of status tasks that was largely unecessary and caused every point to add to a collection that was not emptied in the course of a scan. This caused a steady increase in memory usage over the course of long acquisitions, which eventually caused the Run Engine to fail.

bluesky/bluesky#1543 Uses a more memory efficient representation of grid scans which do not snake their axes. This caused a large increase in memory usage when starting large grid scans (a couple hundred MB for a 11x101x101 scan.

ddkohler commented 2 years ago

I am attempting to use grid_scan_wp with shape 11,101,101 as a test for memory performance (on the test computer). After the aforementioned fixes, bluesky version 4cb0c2, the scan ran for ~99k data points on the test computer (~88% complete).

re-manager memory

prior to scan start: ~220 MB
just after scan start: ~286 MB
at crash (4hrs later): 764 MB

The re-manager log reported Unrecognized input header: 99 after then last recorded data point. Docker app reported seeing no containers, but containers could still be queried for status through cli (I imagine the just became unresponsive for the docker app, so it thought they weren't there).

I will try to perform this same test on bluesky v1.9 to get a sense of how much of the problem has been alleviated.

ksunden commented 2 years ago

That is a rate of ~2 MB / min, which is evidently enough to be problematic, but significantly better than before

ksunden commented 2 years ago

I am curious if count sees the same accumulation or not, that would narrow in where to look for this one. (also built in grid_scan vs grid_scan_wp, but I honestly don't see how that would happen)

ddkohler commented 2 years ago

I can confirm the recent bluesky fixes have helped memory leakage greatly. Here is the grid_scan_wp result from bluesky 1.9.0. In this run, the scan "failed" after ~2h40m (~79k points). The failure was not a crash, but rather the data acquisition slowed by more than an order of magnitude.

	bluesky v1.9.0	bluesky v1.10.0
nominal memory (MB)	230	220
memory increase at start (MB)	182	66
memory at crash (MB)	3060	764
scan percentage at "crash"	70%	88%
scan runtime	~2h40m	~4h
MB gained / min	16.5	2

ddkohler commented 2 years ago

I ran a 112211 point count plan under bluesky 1.10.0. Memory grew steadily from 230 MB at start of scan to 1300 MB at end of scan. Notably, memory did not go down after the scan finished--it stayed at 1280 MB usage.

ksunden commented 2 years ago

Current suspicions:

LiveTable (via BestEffortCallback)... note that the most obvious place for a leak seemed to not be the major problem, but I still think its worth disabling the live table entirely to see if we can nail down the problem to something involving it.
- If the problem is directly here, it should show up in a standalone script as well as running via queueserver
Logging system somewhere (which would be greatly exacerbated by the live table, as that greatly adds to the logged output, but if it were more fundamental logging then it may still present a problem over long running sessions even if there is not the constant stream of logs during scans)
- This might be reliant on queueserver itself, and be hard to replicate in a standalone script
- Disabling live table may be enough for now, but live table is a nice feature to have so would prefer to patch this up so it doesn't grow to be a problem rather than paper it over by removing useful features.

Regardless, next step is to try this out without live table/BEC enabled, as if it still presents as a problem, then probably neither of these are the culprit (meaning back to the drawing board a bit), but if it does go away then we have a short term workaround and some very specific areas to look.

ksunden commented 2 years ago

I have a test without BEC loaded running, I will note that for some technical and some laziness reasons it also disables the insertion of documents into mongo, so that should be noted as a possible additional cause if the problem is alleviated, so not as clean of a "if I do X then it must be Y" as I would like, but not that far off either.

ksunden commented 2 years ago

A test with a script and ophyd.sim hardware and built in bluesky grid-scan showed virtually no memory accumulation (a little bit of settling time at the start, but completely stable after that, including with BEC enabled, admittedly with the change to its retention of output lines)

So I think that brings me to "its not likely the RE worker itself that is accumulating memory, probably need to dig into queueserver"

wright-group / bluesky-in-a-box

Potential memory problems on Waldo #42

Executive Summary

Initial report, Symptoms, troubleshooting