Support telemetry - Githubissues

robmadole commented 5 years ago

Telemetry seems to have emerged as the defacto tool for instrumentation.

Has any thought been given to adding this? Would there be any interest in a PR that adds this?

whitfin commented 5 years ago

What does "adding this" mean? What benefits do you see, and how do you see it being integrated?

robmadole commented 5 years ago

A good example is through Phoenix.

https://github.com/phoenixframework/phoenix/blob/27e167ff145a3d8f06c04c09787cb907dfec7b2d/lib/phoenix/router.ex#L336

It's basically a pub/sub system. This is similar to Cachex's hooks but with very little overhead. From the viewpoint of Phoenix all they are doing is publishing the event. The subscription side of things doesn't matter. It becomes the interface for instrumentation or event data from one system to another.

The benefit of telemetry is that it's already a familiar tool for instrumentation for a lot of folks. I already have telemetry integrated into my app and I have metrics being pushed somewhere from my subscriptions (:telemetry.attach/4). If Cachex supported telemetry all I need to do is see what the event name and data look like and plug those into my metrics tool (Prometheus).

As far as integration, most likely it's the addition of :telemetry.execute/3 calls near the same code that is currently handling the hooks notifications.

whitfin commented 5 years ago

To rephrase a little, I guess what I mean is; what's special about this that Cachex needs to include something specific for it? Why does it need to be integrated as part of the core library?

I'm not a fan of baking in things that people might not care about or want, so even if it shipped with Cachex it would definitely be in the form of a hook. As that is the case (it would take a very strong argument to change that perspective), it's debatable whether it should be in Cachex at all or just a hook distributed as a separate package that people can include at cache startup. Does this make sense/do you agree with this line of thought?

robmadole commented 5 years ago

Gotcha. My biggest concern is the performance limitation with GenServer (which hooks use right?). We've hit up against these in production. Maybe you've implemented hooks in such a way that this is not an issue?

The overhead of telemetry is low (it's basically synchronous, iterating through a list of subscribers and executing callbacks.)

Have you seen any performance issues with GenServer? Have you considered switching to something that supports back-pressure like GenStage if integrating Telemetry sounds unappealling?

whitfin commented 5 years ago

So a couple of things on that then...

My biggest concern is the performance limitation with GenServer (which hooks use right?).

Yes, hooks use processes, so the impact of each hook is pretty much the overhead of sending a single Erlang message. Some hooks listen for a response, but your case doesn't require it so the cost is basically a single message being sent.

Conceptually, this message is a very minimal cost, considering most production applications have thousands of these calls happening at any given time. If you do not have thousands running at one time, then you likely don't have the throughput needs to optimize any given one of them.

A more general way to summarize the above is "If a hook calling Telemetry is slow, it's because of Telemetry".

The overhead of telemetry is low (it's basically synchronous, iterating through a list of subscribers and executing callbacks.)

This is the first I've heard of this library, but I'm looking at the source and it seems to make good use of it would require at least one event/handler registered per cache action. Is this correct?

Assuming this is the case, there's a potential for a large slowdown in cache throughput as the handler list that :telemetry has to walk over is going to be > ~20 actions (I don't know the exact count off the top of my head). This seems like a fairly large iteration to put into Cachex itself, which generally has quite minimal looping.

Although this removes the cost of a message call, we'd be introducing a potential slow iteration into the main library. It's also important to note that you can opt out of inclusion of a hook without overhead, but you cannot opt out of something in Cachex itself without overhead (even if it's just the branch to check if it's enabled).

Have you seen any performance issues with GenServer?

In general, no, not when used correctly. It's almost guaranteed that there are other places worth optimizing in your application before you start looking at the cost of the actual communication with a GenServer. Of course I've had issues due to the code running inside a GenServer, but never related to its pure existence or cost of passing messages itself.

Have you considered switching to something that supports back-pressure like GenStage?

There's no real point in using something like GenStage due to the nature of what a hook is. You're listening to the app to trigger something else, or to record things like metrics - these are not huge tasks. If you want something that consumes more cycles, you'd delegate to another process elsewhere, or a pool of processes. Very little work should ever happen in the hook itself. It's also worth remembering that tools like GenStage do not offer cost-free back-pressure, so there is again a cost for those who do not need or want it.

If integrating Telemetry sounds unappealing?

It's not really that it's unappealing in of itself, don't get me wrong. It's more that it can be done with a hook already, and I don't really see a reason to move it into Cachex itself. Cachex tries hard to be as efficient as possible unless you go out of your way to turn something additional on. At best, if this integration were in the library itself, there would need to be a new flag added and an additional condition check on every cache request.

I don't have time to do it myself currently, but it feels like it would be good to create a hook that integrates Cachex and Telemetry. I don't know too much about Telemetry, but it seems like it should be a fairly small amount of code to do so. From there, we can revisit if anything needs to change on behalf of Cachex. If not, we're done!

robmadole commented 5 years ago

Yes, hooks use processes, so the impact of each hook is pretty much the overhead of sending a single Erlang message. Some hooks listen for a response, but your case doesn't require it so the cost is basically a single message being sent.

While this is generally true message passing isn't free. The message has to be transferred to the heap (I think?) and the bigger the message the more resources this takes up. We had a situation where our messages grew to be rather large and in production we were seeing GenServer timeouts (default 5s) with as little as 20 requests per second to the cowboy server.

Conceptually, this message is a very minimal cost, considering most production applications have thousands of these calls happening at any given time. If you do not have thousands running at one time, then you likely don't have the throughput needs to optimize any given one of them.

Thousands of these calls at any given time is exactly our use case ;)

This is the first I've heard of this library, but I'm looking at the source and it seems to make good use of it would require at least one event/handler registered per cache action. Is this correct?

Probably not an event/handler per cache action. Actually, most certainly not. You'd get more bang for the buck by doing something like this (leveraging the measurement and metadata arguments):

:telemetry.execute([:cachex, :action], {:purge, :ok}, {:ok, 1})

That's only one event ([:cachex, :action])

I don't have time to do it myself currently, but it feels like it would be good to create a hook that integrates Cachex and Telemetry. I don't know too much about Telemetry, but it seems like it should be a fairly small amount of code to do so. From there, we can revisit if anything needs to change on behalf of Cachex. If not, we're done!

Your hooks API is fantastic. Easy to use and reason about but it's using GenServer which I see as a significant downside. But let me put my money where my big mouth is. I'll be load testing this app at scale (10k / sec) soon and if I can notice any performance difference I'll report back.

Thanks for replying today, I appreciate your work on this library. Really well done.

whitfin commented 5 years ago

While this is generally true message passing isn't free. The message has to be transferred to the heap (I think?) and the bigger the message the more resources this takes up. We had a situation where our messages grew to be rather large and in production we were seeing GenServer timeouts (default 5s) with as little as 20 requests per second to the cowboy server.

Sure, but these messages are typically quite small (although it's variable based on your keys, so I can't define any exact amount).

Thousands of these calls at any given time is exactly our use case ;)

I do have a hook in production with tens of thousands of calls running through it at peak, with no noticeable issues. It's fairly small in scope though, it basically monitors stats in memory and occasionally reports them to an internal service. Essentially it's a bunch of counters.

Probably not an event/handler per cache action.

Got it, that API does make a lot more sense to me now :)

Even though that is the case, it would have to iterate all handlers registered in the global application, right? So things like Phoenix would be in the same list iteration as the cache event? I'm just a little tentative here, because I can't reliably give a performance metric if this is the case (and we integrate it).

Your hooks API is fantastic. Easy to use and reason about

Thanks, means a lot :)

But it's using GenServer which I see as a significant downside.

I'm curious what you would use instead, or do you mean that it's just due to the nature of the feature? I'm all for trying something else if you have any ideas, since the internals are hidden out of the public API. If we do things without a separate process though, it would have to be in a major in case people are relying on things like order.

I'll be load testing this app at scale (10k / sec) soon and if I can notice any performance difference I'll report back.

Sweet! Regardless of this issue, the performance overhead of a single hook is a good metric to have anyway.

If you don't end up writing a hook, I'll probably write one up when I get a chance in case it's useful for anyone else. We do have other hooks in Cachex that ship by default, so I'm not particularly opposed to including it in the library either should it prove popular (depending on the associated maintenance costs). There's nothing stopping it being an external package initially before it is absorbed, either. It's mainly the ability to freely opt-out that I'm most concerned about, so as long as we have that, we're good!

Feel free to re-open this if you think we need any work done. Even if not, feel free to comment back if you have any other questions/concerns/statistics - always willing to lend an ear!

robmadole commented 4 years ago

Dropping a quick note to report my load-testing results.

Long story short: there was no statistical difference between using a hook and not using a hook.

Thanks again for this great library, @whitfin

whitfin commented 4 years ago

One quick question @robmadole!

I forgot to ask before, but if you have actually written a hook for integrating with Telemetry, would you consider publishing it to Hex? You can then drop it here for reference, so people asking the same thing might find it.

No pressure of course, since it comes with maintenance costs :p

robmadole commented 4 years ago

This is where we landed. The example might be somewhat useful but it's not really something publishable to Hex. So no use of telemetry; we just go straight to Prometheus.

https://gist.github.com/robmadole/36172031a2504402a69323cae1f6908a

whitfin / cachex

Support telemetry #231