xtdb / xtdb

An immutable SQL database for application development, time-travel reporting and data compliance. Developed by @juxt
https://xtdb.com
Mozilla Public License 2.0
2.57k stars 168 forks source link

Investigate process hanging on azure benchmark #3770

Open tggreene opened 2 weeks ago

tggreene commented 2 weeks ago

In between OoMKilled issues (#3769), we've also encountered a strange scenario on Azure Benchmark runs while attempting to run for 24 hours. Essentially:

See some context from the comment here.

Graphs

"node-3" in these following graphs - can see it stops reporting auctionmark meters but continues reporting memory usage: Image Image

Thread states of the node: Image

TODO

danmason commented 2 weeks ago

On a 24 hour run, we have encountered this issue again (after bumping the memory reservation), within the slack channel is a set of thread stacks, but of particular interest are the "Blocked" threads:

xtdb-tx-subscription-pool-1-thread-2  Blocked CPU use on sample: 0ms
  java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1931)
  com.github.benmanes.caffeine.cache.BoundedLocalCache.remap(BoundedLocalCache.java:2853)
  com.github.benmanes.caffeine.cache.BoundedLocalCache.compute(BoundedLocalCache.java:2803)
  com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.compute(LocalAsyncCache.java:409)
  com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.compute(LocalAsyncCache.java:295)
  xtdb.buffer_pool$update_evictor_key.invokeStatic(buffer_pool.clj:316)
  xtdb.buffer_pool$update_evictor_key.invoke(buffer_pool.clj:314)
  xtdb.buffer_pool.RemoteBufferPool.getBuffer(buffer_pool.clj:333)
  xtdb.buffer_pool$open_record_batch.invokeStatic(buffer_pool.clj:533)
  xtdb.buffer_pool$open_record_batch.invoke(buffer_pool.clj:532)
  xtdb.operator.scan.ArrowMergePlanPage.load_page(scan.clj:349)
  ...

Not the first time we've seen issues in there. Perhaps some kind of race condition - I wonder if something has either changed in the code there, or if something in azure causes issues.

danmason commented 2 weeks ago

Update from this morning:

Some graphs of them all - can see still has memory usage and the like, but no auctionmark activity and storage usage is non existent as well: Image Image Image

Next steps from me:

danmason commented 2 weeks ago

Collected logs & thread stacks for all of the hanging nodes on this branch - didn't really capture much else from Yourkit, so scaled down the failed benchmark run and node pool.

Observations from the Logs/Stacks

Some notes from these logs/stacks:

Node 3 Error Stack

Oct 09, 2024 9:34:22 PM com.github.benmanes.caffeine.cache.LocalAsyncCache lambda$handleCompletion$7
WARNING: Exception thrown during asynchronous load
java.util.concurrent.CompletionException: java.nio.file.NoSuchFileException: /var/lib/xtdb/buffers/disk-cache-3/tables/public$item/data/log-l01-fr53cfa70-nr5578a86-rs41028.arrow
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
    at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1773)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1760)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
Caused by: java.nio.file.NoSuchFileException: /var/lib/xtdb/buffers/disk-cache-3/tables/public$item/data/log-l01-fr53cfa70-nr5578a86-rs41028.arrow
    at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
    at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
    at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:171)
    at java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
    at java.base/java.nio.file.Files.readAttributes(Files.java:1853)
    at java.base/java.nio.file.Files.size(Files.java:2462)
    at xtdb.util$size_on_disk.invokeStatic(util.clj:279)
    at xtdb.util$size_on_disk.invoke(util.clj:278)
    at xtdb.buffer_pool.RemoteBufferPool$fn__8619$fn__8628.invoke(buffer_pool.clj:350)
    at clojure.lang.FnInvokers.invokeOO(FnInvokers.java:247)
    at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
    ... 8 more

Furher log investigation

Gathering the times each node stopped reporting auctionmark gauges:

3:49pm:

10:35pm:

9:35pm:

danmason commented 1 week ago

Just wanted to check/validate the size of the local disk cache, see if running up against limits - can confirm we're only using 24% total of the volume claim so I dont think any of the nodes would have been attempting to evict from their local cache.

Posting chatting to Jeremy, going to kick off a 24hour bench run with:

Shall observe how these run, if the nodes still hang/have issues somewhere else. Will take a short look on saturday (essentially just "did it break or did it not?" and point yourkit at it if it did).

danmason commented 1 week ago

Update from the benchmark run at the end of last week - I kicked off the run described above (with the caffeine based local disk cache evictor entirely removed), and observed the following after 8 hours of runtime (prior to me spinning down the nodepool so it didn't run all weekend):

Now, given that none of them had the process hanging after around 8 hours of run time (compared to the previous run, where pretty much all of them had hanging processes by that point, if not much earlier), I'm at least somewhat confident in saying that "the hanging process issue lies within the local disk cache evictor", given that we do not seem to see it without.

I believe we cannot just remove the evictor and release that - but it is worth us either:

danmason commented 1 week ago

Post discussion around this issue and some additional looking through the thread stacks, we're fairly sure there is some situations which can cause deadlocks within the local disk cache evictor code. As such, it's useful for us to dig back into that implementation, and take note of what is currently taking out locks against the evictor, and what may potentially cause issues.

Overview of the Local Disk Cache Evictor

For some context, I want to go into what exactly the evictor does/is for and how it does it - after which point we can talk in terms of locking/synchronization.

Under the hood, for the purpose of keeping the local disk cache underneath a specified "max size", we make use of an Async Caffeine Cache, representing the local disk cache and evicting any files currently not in use when we go above our configured size:

NOTE: We use update-evictor-key for these operations, which is essentially a higher order function that fetches the underlying map from the cache and calls compute, applying some "update function"

Knowing the above, let's dive into what points we call a lock on the local disk cache evictor and what we're doing at each point.

Concerns

Of note, I'm concerned about the atom wacher / weight adjusting part of this, particularly the one that is adjusting the cache weigh within another compute call.

danmason commented 2 days ago

Following the above, I spent some time in the code pulling out the aforementioned "max weight adjustment" code inside the the buffer-release-fn that was nested inside of another compute call (to avoid calling a lock from inside another lock).

After this, I kicked off a 24hour run over the weekend - during the course of this, made the following observations:

Compared the previous hanging nodes we observed (see above), these both took quite a while to run into a process hang, and one DID run the whole time - which is at least somewhat encouraging, though due to the timing factor of these kind of issues I'd be wary to call it an obvious improvement.

Again, I took the logs and thread stacks from these nodes prior to scaling them down, and these are available on a branch of my XTDB fork. Shall spend some time reading through the thread stacks in particular, see if the issue is indeed the same/related and if there are any other steps I can take.

Observations

From a quick look through the threadstacks, can see the following (on both nodes with a hanging process):