Open jsfwa opened 3 years ago
Nice work investigating this @jsfwa. Looking into this now.
Having a hard time reproducing this locally with ZIO 1.0.3. Running the following entrypoint:
object Problem extends App {
def run(args: List[String]): zio.URIO[zio.ZEnv, ExitCode] = {
def newJob(queue: Queue[Promise[Option[Throwable], Chunk[Int]]]) =
ZStream.repeatEffectChunkOption {
for {
p <- Promise.make[Option[Throwable], Chunk[Int]]
_ <- queue.offer(p)
r <- p.await
} yield r
}
(for {
jobQueue <- Queue.unbounded[ZStream[Any, Throwable, Int]]
feedQueue <- Queue.unbounded[Promise[Option[Throwable], Chunk[Int]]]
//Actual stream of substreams
stream = ZStream
.fromQueue(jobQueue)
.mapMParUnordered(50) { s =>
//Just to affect performance faster
s.mapMPar(10)(Task(_))
.groupedWithin(1000, 1.millis)
.buffer(2)
.bufferSliding(1)
.runDrain
}
.runDrain
//Feed to fullfill promises
feed = ZStream
.fromQueue(feedQueue)
.mapMPar(10) { x =>
x.succeed(Chunk(1))
}
.runDrain
fb1 <- stream.fork
fb2 <- feed.fork
_ <- Chunk.fromIterable(0.to(10)).mapM(_ => jobQueue.offer(newJob(feedQueue)))
_ <- fb1.zip(fb2).join
} yield ()).exitCode
}
}
Results in pretty much constant CPU usage, and no Promise methods showing up on the profiler. Which JVM are you running this on?
I did note #4402, though I don't yet understand how could that scenario (a very large amount of joiners on a single promise) happen in this snippet.
Its even more frustrating, that you was not able to reproduce current problem, since we reproduce it in all our environments. But I have lot more details now. In all cases we are using:
zio 1.0.3
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.7+10, mixed mode)
JAVA_OPTS: "
-server
-XX:MaxInlineLevel=18
-XX:InitialRAMPercentage=40
-XX:MaxRAMPercentage=75
-XX:MaxMetaspaceSize=256M
-XX:+UseG1GC
-XX:+UseNUMA
-XX:InitiatingHeapOccupancyPercent=30
-XX:+ExitOnOutOfMemoryError
-XX:-UseBiasedLocking
-XX:+ScavengeBeforeFullGC
-XX:MaxGCPauseMillis=200
-XX:+ExplicitGCInvokesConcurrent
-Djava.awt.headless=true
-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.rmi.port=9999
-Dcom.sun.management.jmxremote=true
-Djava.rmi.server.hostname=127.0.0.1
"
First scenario
java.util.WeakHashMap$Entry[] 3,479 (1.5%) 532,456 B (2.8%) n/a
to
java.util.WeakHashMap$Entry[] 5,677 (1.8%) 866,552 B (3.6%) n/a
What really puzzles me, that whenever we restart parent effect(substreams), cpu drops back to normal and start increasing again, also resources get cleaned up too.
So in this case, not a single joiner collection is growing, but the count of different joiners is increasing, and interruptJoiner start executing too often(at least its only explanation I have now)
Second scenario
override val platform: Platform = Platform.default.withTracing(Tracing.disabled)
In this scenario, when we increasing and then decreasing load to normal, cpu not fully go back, until again we not restart the parent effect, but again without tracing its works much more stable, here is how changing load and reset looks like:
I will try to run test with different jdk today, wonder if it helps
Hey @jsfwa - just to clarify, these graphs are from the zio-kafka program or the minimized snippet?
Visualvm from snippet, since I only can take a heap dump localy
All grafana cases from zio-kafka - simple consume/drain without business logic
@jsfwa Can you reach out on Discord? My username is iravid
I suspect that we are suffering the same issue here:
Curious that we use the same JVM for our instances: OpenJDK 11. @iravid did you had a chance to talk with @jsfwa about it?
I am snapshotting heap dumps for one of our instances that will eventually degrade until being killed by the system, if it helps.
In our case what we see is not only the CPU increase, but a HEAP allocation increase over time and the final degradation hitting Old GCs cycles.
@aartigao The root cause is https://github.com/zio/zio/issues/4395
@aartigao The root cause is #4395
Did you mean to link to another issue? 😄 Or recursive-trolling me? 😝
Oh sorry I thought we were in zio-kafka :-) @jsfwa proposed a fix for this, but it's a bit of a workaround and does not solve the actual underlying problem. @jdegoes was looking into this.
Hello, this issue is a result of investigation of related problem in https://github.com/zio/zio-kafka/issues/238
For some reason during continuous recreation of Promises inner state of each new Promise(after old one was already completed) slowly growing and start to consumes too much cpu time as a results significantly affects performance of entire application.
The actual problem is in the next function and List of joiners, its starts from 0.1-2% of cpu time and goes up until you reset parent effect or app stops to respond(usually about up to 30%+ of cpu):
How it looks in real application
Reproducible scenario(takes at least 1-3 hours to see the effect), but the growing time it spends on inner joiners list is visible in cpu sampler from the beginning:
TLDR: We have stream of substreams that generates and fullfill promises one by one, its starts to work very slow over time and use as much cpu as possible, after we recreate stream, performance become normal and start degrade from the beginning