The core of open telemetry instrumentation is the OpenTelemetry API/SDK. The initial aim of this shard is to implement the OpenTelemetry specification for metrics, traces, and logs.
This issue appears to be resolved, but I am creating a ticket just so that there is a record of this.
Nutshell -- after the prior round of memory leak fixes, it turned out that there were still leaks if one hit the app in the right way (i.e. a fairly normal HTTP 1.1 request that doesn't use Keep-Alive for subsequent requests).
So, there are a few memory-pressure-related fixes. One of them involves this line, which is ultimately what was causing the largest memory leak.
If one was making a lot of requests that were each coming in a completely different request (versus a Keep-alive where it was all in the same fiber), then each new request made a new fiber, and each new fiber made a new Context. But there was nothing to delete the old ones, for the fibers that had done their work and then died. Eventually, with a few thousand or a few hundred thousand Context objects in memory, RAM usage starts getting ugly.
I didn't see it earlier because my tests were using Keepalive and thus it wasn't triggering it.
I switched this so that the Context gets created on the fiber itself, so it gets cleaned up when the fiber gets cleaned up.
Also, I discovered that there were problems getting Span objects finalized, seemingly related to the fact that they were basically being managed and iterated as a tree, and for whatever reason, this appeared to mean that sometimes they would stick around a long time, or forever.
There was no actual need to store them as a tree, though, so I simplified it so that it just uses a Deque to store all spans with are Sampled (will be sent to an exporter). This cleared that problem up. And then finally I just switched a few things that were classes but could be structs to be a struct instead. There is a bit of risk to that change. It didn't break any specs, so I think it's OK, but it is possible that I just don't have strong enough spec coverage and it isn't OK. Fortunately, it's a 30 second thing to switch it back if it does turn out to have been a mistake.
End result is that heap size is looking a lot better, even when I am making sure to never use a keepalive request in my little test api: (edited)
This issue appears to be resolved, but I am creating a ticket just so that there is a record of this.
Nutshell -- after the prior round of memory leak fixes, it turned out that there were still leaks if one hit the app in the right way (i.e. a fairly normal HTTP 1.1 request that doesn't use Keep-Alive for subsequent requests).
https://github.com/wyhaines/opentelemetry-api.cr/blob/v0.3.2/src/opentelemetry-api/context.cr#L9
So, there are a few memory-pressure-related fixes. One of them involves this line, which is ultimately what was causing the largest memory leak.
If one was making a lot of requests that were each coming in a completely different request (versus a Keep-alive where it was all in the same fiber), then each new request made a new fiber, and each new fiber made a new Context. But there was nothing to delete the old ones, for the fibers that had done their work and then died. Eventually, with a few thousand or a few hundred thousand Context objects in memory, RAM usage starts getting ugly.
I didn't see it earlier because my tests were using Keepalive and thus it wasn't triggering it.
I switched this so that the Context gets created on the fiber itself, so it gets cleaned up when the fiber gets cleaned up.
Also, I discovered that there were problems getting Span objects finalized, seemingly related to the fact that they were basically being managed and iterated as a tree, and for whatever reason, this appeared to mean that sometimes they would stick around a long time, or forever.
There was no actual need to store them as a tree, though, so I simplified it so that it just uses a Deque to store all spans with are Sampled (will be sent to an exporter). This cleared that problem up. And then finally I just switched a few things that were classes but could be structs to be a struct instead. There is a bit of risk to that change. It didn't break any specs, so I think it's OK, but it is possible that I just don't have strong enough spec coverage and it isn't OK. Fortunately, it's a 30 second thing to switch it back if it does turn out to have been a mistake.
End result is that heap size is looking a lot better, even when I am making sure to never use a keepalive request in my little test api: (edited)