wyhaines / opentelemetry-api.cr

The core of open telemetry instrumentation is the OpenTelemetry API/SDK. The initial aim of this shard is to implement the OpenTelemetry specification for metrics, traces, and logs.
Apache License 2.0
12 stars 1 forks source link

More memory leaks... #17

Closed wyhaines closed 2 years ago

wyhaines commented 2 years ago

This issue appears to be resolved, but I am creating a ticket just so that there is a record of this.

Nutshell -- after the prior round of memory leak fixes, it turned out that there were still leaks if one hit the app in the right way (i.e. a fairly normal HTTP 1.1 request that doesn't use Keep-Alive for subsequent requests).

image


https://github.com/wyhaines/opentelemetry-api.cr/blob/v0.3.2/src/opentelemetry-api/context.cr#L9

So, there are a few memory-pressure-related fixes. One of them involves this line, which is ultimately what was causing the largest memory leak.

If one was making a lot of requests that were each coming in a completely different request (versus a Keep-alive where it was all in the same fiber), then each new request made a new fiber, and each new fiber made a new Context. But there was nothing to delete the old ones, for the fibers that had done their work and then died. Eventually, with a few thousand or a few hundred thousand Context objects in memory, RAM usage starts getting ugly.

I didn't see it earlier because my tests were using Keepalive and thus it wasn't triggering it.

I switched this so that the Context gets created on the fiber itself, so it gets cleaned up when the fiber gets cleaned up.

Also, I discovered that there were problems getting Span objects finalized, seemingly related to the fact that they were basically being managed and iterated as a tree, and for whatever reason, this appeared to mean that sometimes they would stick around a long time, or forever.

There was no actual need to store them as a tree, though, so I simplified it so that it just uses a Deque to store all spans with are Sampled (will be sent to an exporter). This cleared that problem up. And then finally I just switched a few things that were classes but could be structs to be a struct instead. There is a bit of risk to that change. It didn't break any specs, so I think it's OK, but it is possible that I just don't have strong enough spec coverage and it isn't OK. Fortunately, it's a 30 second thing to switch it back if it does turn out to have been a mistake.

End result is that heap size is looking a lot better, even when I am making sure to never use a keepalive request in my little test api: (edited)

image

wyhaines commented 2 years ago

I am letting a test run overnight to ensure no surprises, but the fix for this should be released in the morning.

wyhaines commented 2 years ago

image

This looks better. Prepping the release.

wyhaines commented 2 years ago

These fixes are released. Closing.