Investigate process hanging on azure benchmark

In between OoMKilled issues (#3769), we've also encountered a strange scenario on Azure Benchmark runs while attempting to run for 24 hours. Essentially:

The node will run fine for a while, not particularly logging any errors/warnings of note.
Eventually, the node will just seem to "hang".
- The Java process itself will continue to run, and we see JVM metrics reported to ApplicationInsights.
- There is no sign of any Auctionmark activity - it seems to halt completely. Can see how this looks in a graph below.
- We're not entirely clear on what causes to happen.

See some context from the comment here.

Graphs

"node-3" in these following graphs - can see it stops reporting auctionmark meters but continues reporting memory usage:

Thread states of the node:

TODO

[x] Replicate this issue while running Yourkit on all of the benchmark nodes.
[x] Hook in Yourkit to get a thread stack and see if we can observe what causes the issue.

On a 24 hour run, we have encountered this issue again (after bumping the memory reservation), within the slack channel is a set of thread stacks, but of particular interest are the "Blocked" threads:

xtdb-tx-subscription-pool-1-thread-2  Blocked CPU use on sample: 0ms
  java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1931)
  com.github.benmanes.caffeine.cache.BoundedLocalCache.remap(BoundedLocalCache.java:2853)
  com.github.benmanes.caffeine.cache.BoundedLocalCache.compute(BoundedLocalCache.java:2803)
  com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.compute(LocalAsyncCache.java:409)
  com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.compute(LocalAsyncCache.java:295)
  xtdb.buffer_pool$update_evictor_key.invokeStatic(buffer_pool.clj:316)
  xtdb.buffer_pool$update_evictor_key.invoke(buffer_pool.clj:314)
  xtdb.buffer_pool.RemoteBufferPool.getBuffer(buffer_pool.clj:333)
  xtdb.buffer_pool$open_record_batch.invokeStatic(buffer_pool.clj:533)
  xtdb.buffer_pool$open_record_batch.invoke(buffer_pool.clj:532)
  xtdb.operator.scan.ArrowMergePlanPage.load_page(scan.clj:349)
  ...

Not the first time we've seen issues in there. Perhaps some kind of race condition - I wonder if something has either changed in the code there, or if something in azure causes issues.

Update from this morning:

All three of the nodes from the 6GB memory AKS run seem to still have running processes - ie, that seems to prevent the OoMKilled issues.
All three of the nodes are in the "stuck process" state - curious to see how similar the thread states on the other two nodes are.
Also curious to see if the logs give us any information in that regard about any potential causes.

Some graphs of them all - can see still has memory usage and the like, but no auctionmark activity and storage usage is non existent as well:

Next steps from me:

Investigate the logs of the hanging nodes for any interesting warnings/errors (if any).
Point Yourkit at all of the nodes, see if they are all blocked at the same point.
Investigate the blocking in more detail.

Collected logs & thread stacks for all of the hanging nodes on this branch - didn't really capture much else from Yourkit, so scaled down the failed benchmark run and node pool.

Observations from the Logs/Stacks

Some notes from these logs/stacks:

Both node 1 and node 2 have a number of "blocked" thread states - node 3 does not have any blocked thread states, interestingly enough (though the threads are still parked).
- These are blocked on compute and computeIfPresent calls within getBuffer - I believe we have in the past encountered race conditions/issues with this, though I'd have to dig through slack for those.
- There has been some changes/edits within there in last month or so, might be worth a closer look at those for potential issues.
- I would expect similar problems to occur within S3 / Google Cloud if the remote buffer pool is the issue. It might be worth spinning these up to test.
All nodes have a number of DirectMemory errors within the compactor - we've seen these in the past and the compaction job should get retried.
In addition to these, node 3 closes with a number of errors of interest - likely related to the compute issues:
- See the lines below here in the logs.

Node 3 Error Stack

Oct 09, 2024 9:34:22 PM com.github.benmanes.caffeine.cache.LocalAsyncCache lambda$handleCompletion$7
WARNING: Exception thrown during asynchronous load
java.util.concurrent.CompletionException: java.nio.file.NoSuchFileException: /var/lib/xtdb/buffers/disk-cache-3/tables/public$item/data/log-l01-fr53cfa70-nr5578a86-rs41028.arrow
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
    at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1773)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1760)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
Caused by: java.nio.file.NoSuchFileException: /var/lib/xtdb/buffers/disk-cache-3/tables/public$item/data/log-l01-fr53cfa70-nr5578a86-rs41028.arrow
    at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
    at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
    at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:171)
    at java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
    at java.base/java.nio.file.Files.readAttributes(Files.java:1853)
    at java.base/java.nio.file.Files.size(Files.java:2462)
    at xtdb.util$size_on_disk.invokeStatic(util.clj:279)
    at xtdb.util$size_on_disk.invoke(util.clj:278)
    at xtdb.buffer_pool.RemoteBufferPool$fn__8619$fn__8628.invoke(buffer_pool.clj:350)
    at clojure.lang.FnInvokers.invokeOO(FnInvokers.java:247)
    at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
    ... 8 more

Furher log investigation

Gathering the times each node stopped reporting auctionmark gauges:

Node 1 - 3:49pm yesterday.
Node 2 - 10:35pm yesterday.
Node 3 - 9:35pm yesterday.

3:49pm:

A few Blob already exists for ... aborting multipart upload messages around 3:48 or so - see these all throughout the run, though.
No particular error messages or the like within any of the nodes.
No logs after 3:49 for node 1, except for stagnant log flusher at around 9:30pm.

10:35pm:

No logs from any of the nodes at this point, including node 2 itself.

9:35pm:

No particular logs of interest from node 2 at this time, again - no errors to speak of in node 2.
Node 3 itself throws a number of "NoSuchFileException" errors at this time, as previously stated.

Just wanted to check/validate the size of the local disk cache, see if running up against limits - can confirm we're only using 24% total of the volume claim so I dont think any of the nodes would have been attempting to evict from their local cache.

Posting chatting to Jeremy, going to kick off a 24hour bench run with:

Increased Memory Reservation (Up to 6GB, as before), to avoid OoMKilled.
Local disk cache evictor entirely removed/switched off (have cut out the caffeine cache which seems to have been having issues)
Fairly sure we shouldn't hit size limits of volume claim, but going to increase the size of it a bit regardless.

Shall observe how these run, if the nodes still hang/have issues somewhere else. Will take a short look on saturday (essentially just "did it break or did it not?" and point yourkit at it if it did).

Update from the benchmark run at the end of last week - I kicked off the run described above (with the caffeine based local disk cache evictor entirely removed), and observed the following after 8 hours of runtime (prior to me spinning down the nodepool so it didn't run all weekend):

All 3 nodes were still running process wise - no OoMKilled or the like.
None of them seemed to get stuck on auctionmark/had hanging processes.
All of them got stuck reporting errors around 8 hours in after running out of disk space on the local disk cache.
- This makes sense, though the total volume claim would have been for 100GiB.
- Between the three nodes we must have gone > 100GiB.

Now, given that none of them had the process hanging after around 8 hours of run time (compared to the previous run, where pretty much all of them had hanging processes by that point, if not much earlier), I'm at least somewhat confident in saying that "the hanging process issue lies within the local disk cache evictor", given that we do not seem to see it without.

I believe we cannot just remove the evictor and release that - but it is worth us either:

Trying to find the cause of any issues within the existing code - ie, did something change/break?
Reworking the local disk cache evictor to be something a bit simpler and less complicated (especially in terms of how it blocks, with our current approach of relying in compute causing a lot of potential breaking points).
- I've taken a small go of this over on this branch, though it isn't finished, nor have I tested it.

Post discussion around this issue and some additional looking through the thread stacks, we're fairly sure there is some situations which can cause deadlocks within the local disk cache evictor code. As such, it's useful for us to dig back into that implementation, and take note of what is currently taking out locks against the evictor, and what may potentially cause issues.

Overview of the Local Disk Cache Evictor

For some context, I want to go into what exactly the evictor does/is for and how it does it - after which point we can talk in terms of locking/synchronization.

Under the hood, for the purpose of keeping the local disk cache underneath a specified "max size", we make use of an Async Caffeine Cache, representing the local disk cache and evicting any files currently not in use when we go above our configured size:

We save buffer paths and model their relative weightings/sizes - also where we set our limit:
- The concept essentially being that we can use Caffeine and it's semantics to evict files once it's beyond a certain max size.
- Uses custom weightings to achieve this - essentially, the weighing function run whenever a file is inserted/updated - and is based on it's file size on disk.
- We prevent files that are currently being used from being deleted by using caffeine's pinning mechanism - in the case of a 'maxweight' based eviction policy:
- If we want a file to be preserved we can return a weight of 0.
- We accomplish this via the map value stored in Caffeine, which has a pinned? key.
- Pinning behaviour:
- Files are pinned when they are first fetched to be returned (ie, from getBuffer), pinning just prior to them being memory mapped.
- Files are unpinned when the ArrowBuf that memory maps them is released from memory - passing a custom release-fn into the allocation-manager they are using.
- At the point of files being pinned, we adjust/lower the max weight of the cache to account for them towards the overall cache weight.
- When they are unpinned, we re-add their size on disk back to the max weight.
We use the underlying ConcurrentHashMap implemented by Caffeine whenever we make changes to a value, utilizing the behaviour of compute to ensure that two concurrent edits do not happen (essentially, we implicitly block/wait on futures, see the code here)

In terms of locks

NOTE: We use update-evictor-key for these operations, which is essentially a higher order function that fetches the underlying map from the cache and calls compute, applying some "update function"

Knowing the above, let's dive into what points we call a lock on the local disk cache evictor and what we're doing at each point.

Inside of getBuffer:
- If the buffer is not available within the memory cache, we go to fetch it from either the local disk cache or from the object store. In either case, we're going to be memory mapping/pinning a file, so we call to compute on that key/path in the local disk cache evictor. Effectively, we're take a lock over that particular key. We hold the lock over this key as we do the following:
- If the file is already on disk we return a completeablefuture with a map, containing the buffer path and whether or not the file was already present & pinned in the evictor.
- If the file is not already on disk, we fetch the key from the ObjectStore, storing to a local temp file and using atomic-move on the temp file to it's actual place on local-disk-cache. We return a map with the buffer-cache-path and previously-pinned? set to false.
- With the buffer-cache-path and previously-pinned? map in hand, we set about allocating the arrow buffer with the memory mapped file, and pin it within the evictor.
  - We create an ArrowBuf over a MappedByteBuffer from the file, and add this to the memory cache.
  - When creating the ArrowBuf, we pass a "release-fn" to be called when the buffer is released/no longer in use. More on that later.
  - The value of the key in the evictor is finally updated, ending the compute. The value is set to {:pinned? true :file-size file-size :ctx {:buf buf :previously-pinned? previously-pinned?}} - pinning the file in the evictor and setting the file-size.
- Returning the above value from the finished compute, we release the lock on the key.
- With the above value, we check if the file was previously pinned in the evictor, if it was not we adjust the !evictor-max-weight atom - decreasing it by the file size. This will cause a response from the "atom watcher".
In the buffer-release-fn:
- When the arrow buffer is released, we go to unpin the file within the evictor. We call to compute on that key/path in the local disk cache evictor, taking a lock over that particular key.
- Within that compute function we are also updating the !evictor-max-weight atom - increasing it by the file size. This will cause a response from the "atom watcher".
In the "atom watcher" we add to !evictor-max-weight:
- Whenever a change is made to the atom, we go to update the maximum weight of the caffeine cache.
- We get a "Thread Safe Synchronous View" of the AsyncCache by calling synchronous, which we then fetch the eviction policy of, and setMaximum with the max of either 0 or the current value of the max-size atom.
- Within the implementation of Caffeine, setMaximum will create a lock over the evictionLock of the cache
In openArrowWriter:
- Once the Arrow writer finishes writing batches, we call to compute on the key/path which we will write to locally, effectively creating a lock on that key.
- We check if there is a value already in the evictor, returning whether or not a file for this is already pinned in a completeablefuture.
- We then call to apply to this future, carrying out the following:
- We atomic-move from the temp file openArrowWriter has written to, to the actual buffer-cache-path.
- We set the value in the evictor to a map, with {:pinned? if-file-previously-pinned-in-evictor :file-size file-size-on-disk}.
- We return the value above, ending the compute and releasing the lock on the key.

Concerns

Of note, I'm concerned about the atom wacher / weight adjusting part of this, particularly the one that is adjusting the cache weigh within another compute call.

Following the above, I spent some time in the code pulling out the aforementioned "max weight adjustment" code inside the the buffer-release-fn that was nested inside of another compute call (to avoid calling a lock from inside another lock).

After this, I kicked off a 24hour run over the weekend - during the course of this, made the following observations:

Node 1 actually continued to run for the entire 24 hours, completing successfully.
Node 2 had a process hang at around 6am on Sunday, after running for 14 hours.
Node 3 had a process hang at around 8am on Sunday, after running for 16 hours.

Compared the previous hanging nodes we observed (see above), these both took quite a while to run into a process hang, and one DID run the whole time - which is at least somewhat encouraging, though due to the timing factor of these kind of issues I'd be wary to call it an obvious improvement.

Again, I took the logs and thread stacks from these nodes prior to scaling them down, and these are available on a branch of my XTDB fork. Shall spend some time reading through the thread stacks in particular, see if the issue is indeed the same/related and if there are any other steps I can take.

Observations

From a quick look through the threadstacks, can see the following (on both nodes with a hanging process):

All of the auctionmark threads are blocked on awaitTx, called when making a query - this is likely waiting to catch up on some transaction.
The xtdb-tx-subscription-pool-1-thread-2 is blocked on xtdb.buffer_pool.RemoteBufferPool.getBuffer - namely, the future within there.
- Interestingly, the thread stacks ARE different to the previous ones. Previously, we saw a lot more specifics around update-evictor-key and compute. See previous stack examples, searching for update_evictor_key invocations.
- Seems to be around derefing the future returned from it?

xtdb / xtdb