Open tggreene opened 2 weeks ago
On a 24 hour run, we have encountered this issue again (after bumping the memory reservation), within the slack channel is a set of thread stacks, but of particular interest are the "Blocked" threads:
xtdb-tx-subscription-pool-1-thread-2 Blocked CPU use on sample: 0ms
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1931)
com.github.benmanes.caffeine.cache.BoundedLocalCache.remap(BoundedLocalCache.java:2853)
com.github.benmanes.caffeine.cache.BoundedLocalCache.compute(BoundedLocalCache.java:2803)
com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.compute(LocalAsyncCache.java:409)
com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.compute(LocalAsyncCache.java:295)
xtdb.buffer_pool$update_evictor_key.invokeStatic(buffer_pool.clj:316)
xtdb.buffer_pool$update_evictor_key.invoke(buffer_pool.clj:314)
xtdb.buffer_pool.RemoteBufferPool.getBuffer(buffer_pool.clj:333)
xtdb.buffer_pool$open_record_batch.invokeStatic(buffer_pool.clj:533)
xtdb.buffer_pool$open_record_batch.invoke(buffer_pool.clj:532)
xtdb.operator.scan.ArrowMergePlanPage.load_page(scan.clj:349)
...
Not the first time we've seen issues in there. Perhaps some kind of race condition - I wonder if something has either changed in the code there, or if something in azure causes issues.
Update from this morning:
Some graphs of them all - can see still has memory usage and the like, but no auctionmark activity and storage usage is non existent as well:
Next steps from me:
Collected logs & thread stacks for all of the hanging nodes on this branch - didn't really capture much else from Yourkit, so scaled down the failed benchmark run and node pool.
Some notes from these logs/stacks:
compute
and computeIfPresent
calls within getBuffer
- I believe we have in the past encountered race conditions/issues with this, though I'd have to dig through slack for those.Oct 09, 2024 9:34:22 PM com.github.benmanes.caffeine.cache.LocalAsyncCache lambda$handleCompletion$7
WARNING: Exception thrown during asynchronous load
java.util.concurrent.CompletionException: java.nio.file.NoSuchFileException: /var/lib/xtdb/buffers/disk-cache-3/tables/public$item/data/log-l01-fr53cfa70-nr5578a86-rs41028.arrow
at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649)
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1773)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1760)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
Caused by: java.nio.file.NoSuchFileException: /var/lib/xtdb/buffers/disk-cache-3/tables/public$item/data/log-l01-fr53cfa70-nr5578a86-rs41028.arrow
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:171)
at java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
at java.base/java.nio.file.Files.readAttributes(Files.java:1853)
at java.base/java.nio.file.Files.size(Files.java:2462)
at xtdb.util$size_on_disk.invokeStatic(util.clj:279)
at xtdb.util$size_on_disk.invoke(util.clj:278)
at xtdb.buffer_pool.RemoteBufferPool$fn__8619$fn__8628.invoke(buffer_pool.clj:350)
at clojure.lang.FnInvokers.invokeOO(FnInvokers.java:247)
at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
... 8 more
Gathering the times each node stopped reporting auctionmark gauges:
3:49pm:
10:35pm:
9:35pm:
Just wanted to check/validate the size of the local disk cache, see if running up against limits - can confirm we're only using 24% total of the volume claim so I dont think any of the nodes would have been attempting to evict from their local cache.
Posting chatting to Jeremy, going to kick off a 24hour bench run with:
Shall observe how these run, if the nodes still hang/have issues somewhere else. Will take a short look on saturday (essentially just "did it break or did it not?" and point yourkit at it if it did).
Update from the benchmark run at the end of last week - I kicked off the run described above (with the caffeine based local disk cache evictor entirely removed), and observed the following after 8 hours of runtime (prior to me spinning down the nodepool so it didn't run all weekend):
Now, given that none of them had the process hanging after around 8 hours of run time (compared to the previous run, where pretty much all of them had hanging processes by that point, if not much earlier), I'm at least somewhat confident in saying that "the hanging process issue lies within the local disk cache evictor", given that we do not seem to see it without.
I believe we cannot just remove the evictor and release that - but it is worth us either:
Post discussion around this issue and some additional looking through the thread stacks, we're fairly sure there is some situations which can cause deadlocks within the local disk cache evictor code. As such, it's useful for us to dig back into that implementation, and take note of what is currently taking out locks against the evictor, and what may potentially cause issues.
For some context, I want to go into what exactly the evictor does/is for and how it does it - after which point we can talk in terms of locking/synchronization.
Under the hood, for the purpose of keeping the local disk cache underneath a specified "max size", we make use of an Async Caffeine Cache, representing the local disk cache and evicting any files currently not in use when we go above our configured size:
weight
of 0
.pinned?
key.getBuffer
), pinning just prior to them being memory mapped.release-fn
into the allocation-manager they are using. We use the underlying ConcurrentHashMap implemented by Caffeine whenever we make changes to a value, utilizing the behaviour of compute
to ensure that two concurrent edits do not happen (essentially, we implicitly block/wait on futures, see the code here)
NOTE: We use
update-evictor-key
for these operations, which is essentially a higher order function that fetches the underlying map from the cache and callscompute
, applying some "update function"
Knowing the above, let's dive into what points we call a lock on the local disk cache evictor and what we're doing at each point.
getBuffer
:
compute
on that key/path in the local disk cache evictor. Effectively, we're take a lock over that particular key. We hold the lock over this key as we do the following: atomic-move
on the temp file to it's actual place on local-disk-cache. We return a map with the buffer-cache-path
and previously-pinned?
set to false. buffer-cache-path
and previously-pinned?
map in hand, we set about allocating the arrow buffer with the memory mapped file, and pin
it within the evictor.
compute
. The value is set to {:pinned? true :file-size file-size :ctx {:buf buf :previously-pinned? previously-pinned?}}
- pinning the file in the evictor and setting the file-size.compute
, we release the lock on the key.!evictor-max-weight
atom - decreasing it by the file size. This will cause a response from the "atom watcher".buffer-release-fn
:
compute
on that key/path in the local disk cache evictor, taking a lock over that particular key.compute
function we are also updating the !evictor-max-weight
atom - increasing it by the file size. This will cause a response from the "atom watcher". !evictor-max-weight
:
synchronous
, which we then fetch the eviction policy of, and setMaximum
with the max of either 0
or the current value of the max-size atom.setMaximum
will create a lock over the evictionLock
of the cache openArrowWriter
:
compute
on the key/path which we will write to locally, effectively creating a lock on that key.atomic-move
from the temp file openArrowWriter
has written to, to the actual buffer-cache-path.{:pinned? if-file-previously-pinned-in-evictor :file-size file-size-on-disk}
.compute
and releasing the lock on the key.Of note, I'm concerned about the atom wacher / weight adjusting part of this, particularly the one that is adjusting the cache weigh within another compute
call.
Following the above, I spent some time in the code pulling out the aforementioned "max weight adjustment" code inside the the buffer-release-fn
that was nested inside of another compute call (to avoid calling a lock from inside another lock).
After this, I kicked off a 24hour run over the weekend - during the course of this, made the following observations:
Compared the previous hanging nodes we observed (see above), these both took quite a while to run into a process hang, and one DID run the whole time - which is at least somewhat encouraging, though due to the timing factor of these kind of issues I'd be wary to call it an obvious improvement.
Again, I took the logs and thread stacks from these nodes prior to scaling them down, and these are available on a branch of my XTDB fork. Shall spend some time reading through the thread stacks in particular, see if the issue is indeed the same/related and if there are any other steps I can take.
From a quick look through the threadstacks, can see the following (on both nodes with a hanging process):
awaitTx
, called when making a query - this is likely waiting to catch up on some transaction.xtdb-tx-subscription-pool-1-thread-2
is blocked on xtdb.buffer_pool.RemoteBufferPool.getBuffer
- namely, the future within there.
update-evictor-key
and compute
. See previous stack examples, searching for update_evictor_key
invocations.
In between OoMKilled issues (#3769), we've also encountered a strange scenario on Azure Benchmark runs while attempting to run for 24 hours. Essentially:
See some context from the comment here.
Graphs
"node-3" in these following graphs - can see it stops reporting auctionmark meters but continues reporting memory usage:
Thread states of the node:
TODO