High cpu utilization of Promise.await and app degradation over time

jsfwa commented 3 years ago

Hello, this issue is a result of investigation of related problem in https://github.com/zio/zio-kafka/issues/238

For some reason during continuous recreation of Promises inner state of each new Promise(after old one was already completed) slowly growing and start to consumes too much cpu time as a results significantly affects performance of entire application.

The actual problem is in the next function and List of joiners, its starts from 0.1-2% of cpu time and goes up until you reset parent effect or app stops to respond(usually about up to 30%+ of cpu):

How it looks in real application

Reproducible scenario(takes at least 1-3 hours to see the effect), but the growing time it spends on inner joiners list is visible in cpu sampler from the beginning:

testM("stream should not go brrr") {

      def newJob(queue: Queue[Promise[Option[Throwable], Chunk[Int]]]) = ZStream {
        ZManaged.succeed {
          for {
            p <- Promise.make[Option[Throwable], Chunk[Int]]
            _ <- queue.offer(p)
            r <- p.await
          } yield r
        }
      }

      (for {
        jobQueue  <- Queue.unbounded[ZStream[Any, Throwable, Int]]
        feedQueue <- Queue.unbounded[Promise[Option[Throwable], Chunk[Int]]]

        //Actual stream of substreams
        stream = ZStream
          .fromQueue(jobQueue)
          .mapMParUnordered(50) { s =>
            //Just to affect performance faster
            s.mapMPar(10)(Task(_))
              .groupedWithin(1000, 1.millis)
              .buffer(2)
              .bufferSliding(1)
              .runDrain
          }
          .runDrain

        //Feed to fullfill promises
        feed = ZStream
          .fromQueue(feedQueue)
          .mapMPar(10) { x =>
            x.succeed(Chunk(1))
          }
          .runDrain

        fb1 <- stream.fork
        fb2 <- feed.fork

        _   <- Chunk.fromIterable(0.to(10)).mapM(_ => jobQueue.offer(newJob(feedQueue)))

        _   <- fb1.zip(fb2).join
      } yield assert(true)(equalTo(true))).provideCustomLayer(zio.clock.Clock.live)
}

TLDR: We have stream of substreams that generates and fullfill promises one by one, its starts to work very slow over time and use as much cpu as possible, after we recreate stream, performance become normal and start degrade from the beginning

iravid commented 3 years ago

Nice work investigating this @jsfwa. Looking into this now.

iravid commented 3 years ago

Having a hard time reproducing this locally with ZIO 1.0.3. Running the following entrypoint:

object Problem extends App {
  def run(args: List[String]): zio.URIO[zio.ZEnv, ExitCode] = {
    def newJob(queue: Queue[Promise[Option[Throwable], Chunk[Int]]]) =
      ZStream.repeatEffectChunkOption {
        for {
          p <- Promise.make[Option[Throwable], Chunk[Int]]
          _ <- queue.offer(p)
          r <- p.await
        } yield r
      }

    (for {
      jobQueue  <- Queue.unbounded[ZStream[Any, Throwable, Int]]
      feedQueue <- Queue.unbounded[Promise[Option[Throwable], Chunk[Int]]]

      //Actual stream of substreams
      stream = ZStream
                 .fromQueue(jobQueue)
                 .mapMParUnordered(50) { s =>
                   //Just to affect performance faster
                   s.mapMPar(10)(Task(_))
                     .groupedWithin(1000, 1.millis)
                     .buffer(2)
                     .bufferSliding(1)
                     .runDrain
                 }
                 .runDrain

      //Feed to fullfill promises
      feed = ZStream
               .fromQueue(feedQueue)
               .mapMPar(10) { x =>
                 x.succeed(Chunk(1))
               }
               .runDrain

      fb1 <- stream.fork
      fb2 <- feed.fork

      _ <- Chunk.fromIterable(0.to(10)).mapM(_ => jobQueue.offer(newJob(feedQueue)))

      _ <- fb1.zip(fb2).join
    } yield ()).exitCode
  }
}

Results in pretty much constant CPU usage, and no Promise methods showing up on the profiler. Which JVM are you running this on?

iravid commented 3 years ago

I did note #4402, though I don't yet understand how could that scenario (a very large amount of joiners on a single promise) happen in this snippet.

jsfwa commented 3 years ago

Its even more frustrating, that you was not able to reproduce current problem, since we reproduce it in all our environments. But I have lot more details now. In all cases we are using:

zio 1.0.3
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.7+10, mixed mode)

JAVA_OPTS: "
  -server
  -XX:MaxInlineLevel=18
  -XX:InitialRAMPercentage=40
  -XX:MaxRAMPercentage=75
  -XX:MaxMetaspaceSize=256M
  -XX:+UseG1GC
  -XX:+UseNUMA
  -XX:InitiatingHeapOccupancyPercent=30
  -XX:+ExitOnOutOfMemoryError
  -XX:-UseBiasedLocking
  -XX:+ScavengeBeforeFullGC
  -XX:MaxGCPauseMillis=200
  -XX:+ExplicitGCInvokesConcurrent
  -Djava.awt.headless=true
  -Dcom.sun.management.jmxremote.port=9999
  -Dcom.sun.management.jmxremote.ssl=false
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dcom.sun.management.jmxremote.local.only=false
  -Dcom.sun.management.jmxremote.rmi.port=9999
  -Dcom.sun.management.jmxremote=true
  -Djava.rmi.server.hostname=127.0.0.1
  "

First scenario

When we using default zio setup(tracing enabled), we got growing of ~2k WeakHashMap items per hour, like:
```
java.util.WeakHashMap$Entry[]   3,479 (1.5%)    532,456 B (2.8%)    n/a
to
java.util.WeakHashMap$Entry[]   5,677 (1.8%)    866,552 B (3.6%)    n/a
```
- Also cpu grows 4-6% per hour
- Also amount of Pending promises growing too(slowly, couple per hour) and some of them becoming GC roots(maybe this is important)
- Also amount of Promise lambdas growing about 10 per hour, this ones, and it explains the number of WeakHashMap entries:

What really puzzles me, that whenever we restart parent effect(substreams), cpu drops back to normal and start increasing again, also resources get cleaned up too.

So in this case, not a single joiner collection is growing, but the count of different joiners is increasing, and interruptJoiner start executing too often(at least its only explanation I have now)

Second scenario

When we disabling tracing, the app, mentioned above, become very stable, until we not start playing with load
```
override val platform: Platform = Platform.default.withTracing(Tracing.disabled)
```
In this scenario, when we increasing and then decreasing load to normal, cpu not fully go back, until again we not restart the parent effect, but again without tracing its works much more stable, here is how changing load and reset looks like:

I will try to run test with different jdk today, wonder if it helps

iravid commented 3 years ago

Hey @jsfwa - just to clarify, these graphs are from the zio-kafka program or the minimized snippet?

jsfwa commented 3 years ago

Visualvm from snippet, since I only can take a heap dump localy

All grafana cases from zio-kafka - simple consume/drain without business logic

iravid commented 3 years ago

@jsfwa Can you reach out on Discord? My username is iravid

aartigao commented 3 years ago

I suspect that we are suffering the same issue here:

Curious that we use the same JVM for our instances: OpenJDK 11. @iravid did you had a chance to talk with @jsfwa about it?

I am snapshotting heap dumps for one of our instances that will eventually degrade until being killed by the system, if it helps.

In our case what we see is not only the CPU increase, but a HEAP allocation increase over time and the final degradation hitting Old GCs cycles.

iravid commented 3 years ago

@aartigao The root cause is https://github.com/zio/zio/issues/4395

aartigao commented 3 years ago

@aartigao The root cause is #4395

Did you mean to link to another issue? 😄 Or recursive-trolling me? 😝

iravid commented 3 years ago

Oh sorry I thought we were in zio-kafka :-) @jsfwa proposed a fix for this, but it's a bit of a workaround and does not solve the actual underlying problem. @jdegoes was looking into this.

zio / zio

High cpu utilization of Promise.await and app degradation over time #4395