wright-group / bluesky-in-a-box

Bluesky services in docker containers, for use in Wright Group experimental orchestration.
MIT License
4 stars 2 forks source link

Potential memory problems on Waldo #42

Open ksunden opened 2 years ago

ksunden commented 2 years ago

Executive Summary

This is an ongoing investigation into symptoms that have recurred that seem very similar to #41. Namely, some scans, particularly large ones (such as 3D scans), fail to complete.

Initial report, Symptoms, troubleshooting

On or about 2022-08-05, reports initiated that scans were halting, preventing overnight queues from completing.

The initial scan which failed was itself a 3D scan and was enqueued following a 3D scan.

The system was restarted and the queue set to run.

When in the halted state, which occurred a few hours later, there was no indication of an error in the logs of the re-manager, but it had been several minutes since the last datapoint had been printed to the table in the logs.

Some basic troubleshooting was performed:

At this point, the docker containers were restarted and started as expected.

@ddkohler noted:

  • last line of log lists twin1 position as 7517 nm, but twin1 had already progressed to 7541 nm (i.e. twin1 made it to the next data point)
  • last line of log lists hp/filter position as 549 nm. I did see the filter was still at this position. I did not check if the hp moved or not.

These three motors are part of the inner loop of the acquisition, so no movement helps identify where the re tripped up. Just a suggestion to track these numbers next time we see a stall

@emkaufman reported:

I ended up not adding a long scan to run overnight as I wanted to start a new scan first thing tomorrow morning. I observed the problem one more time today. I noticed that I have only been seeing this issue occur when I run scans with 3 axes

When I arrived in lab on 2022-08-08, I took a quick look, confirmed that the fixes to #41 were in place, so this is not seemingly the same issue (though it is similar). I noted that the docker desktop app indicated that the containers had been built "22 minutes ago" at ~9:50, inquired if they had been rebuilt. Nobody had rebuilt them and they had been running for 15 hours at that point, so it later became evident that the "22 minutes ago" was simply never updated.

On 2022-08-11 @emkaufman updated:

Multiple 2D scans stalling on blue sky recently. An observation I had about the 2D scans that are stalling is that they always have another scan queued AFTER it. Trying to remember if this was the case with the 3D scans that were stalling as well

Actions taken:

While investigation is ongoing, avoid taking the scans which fail on the lab machine as much as possible

Construct a test Windows machine using an old office computer that can hopefully reproduce the symptoms we see and provide a mechanism to address these concerns with minimal interruption to the lab machines when possible. @pcruzparri is in charge of bringing that machine online and doing initial reproduction testing

ksunden commented 2 years ago

Update:

We have found two sources of higher than expected memory usage by the Bluesky Run Engine:

bluesky/bluesky#1513 removes a cache of status tasks that was largely unecessary and caused every point to add to a collection that was not emptied in the course of a scan. This caused a steady increase in memory usage over the course of long acquisitions, which eventually caused the Run Engine to fail.

bluesky/bluesky#1543 Uses a more memory efficient representation of grid scans which do not snake their axes. This caused a large increase in memory usage when starting large grid scans (a couple hundred MB for a 11x101x101 scan.

ddkohler commented 2 years ago

I am attempting to use grid_scan_wp with shape 11,101,101 as a test for memory performance (on the test computer). After the aforementioned fixes, bluesky version 4cb0c2, the scan ran for ~99k data points on the test computer (~88% complete).

re-manager memory

The re-manager log reported Unrecognized input header: 99 after then last recorded data point. Docker app reported seeing no containers, but containers could still be queried for status through cli (I imagine the just became unresponsive for the docker app, so it thought they weren't there).

I will try to perform this same test on bluesky v1.9 to get a sense of how much of the problem has been alleviated.

ksunden commented 2 years ago

That is a rate of ~2 MB / min, which is evidently enough to be problematic, but significantly better than before

ksunden commented 2 years ago

I am curious if count sees the same accumulation or not, that would narrow in where to look for this one. (also built in grid_scan vs grid_scan_wp, but I honestly don't see how that would happen)

ddkohler commented 2 years ago

I can confirm the recent bluesky fixes have helped memory leakage greatly. Here is the grid_scan_wp result from bluesky 1.9.0. In this run, the scan "failed" after ~2h40m (~79k points). The failure was not a crash, but rather the data acquisition slowed by more than an order of magnitude.

bluesky v1.9.0 bluesky v1.10.0
nominal memory (MB) 230 220
memory increase at start (MB) 182 66
memory at crash (MB) 3060 764
scan percentage at "crash" 70% 88%
scan runtime ~2h40m ~4h
MB gained / min 16.5 2
ddkohler commented 2 years ago

I ran a 112211 point count plan under bluesky 1.10.0. Memory grew steadily from 230 MB at start of scan to 1300 MB at end of scan. Notably, memory did not go down after the scan finished--it stayed at 1280 MB usage.

ksunden commented 2 years ago

Current suspicions:

Regardless, next step is to try this out without live table/BEC enabled, as if it still presents as a problem, then probably neither of these are the culprit (meaning back to the drawing board a bit), but if it does go away then we have a short term workaround and some very specific areas to look.

ksunden commented 2 years ago

I have a test without BEC loaded running, I will note that for some technical and some laziness reasons it also disables the insertion of documents into mongo, so that should be noted as a possible additional cause if the problem is alleviated, so not as clean of a "if I do X then it must be Y" as I would like, but not that far off either.

ksunden commented 2 years ago

A test with a script and ophyd.sim hardware and built in bluesky grid-scan showed virtually no memory accumulation (a little bit of settling time at the start, but completely stable after that, including with BEC enabled, admittedly with the change to its retention of output lines)

So I think that brings me to "its not likely the RE worker itself that is accumulating memory, probably need to dig into queueserver"