Metrics scalability performance issue in 6.0

awelzel commented 1 year ago

This has been reported during Zeek 6.0 rc1/rc2 testing as manager memory growing steadily until crashing due to OOM.

Further, when manager's memory usage is growing, it does not serve requests to curl http://localhost:9911/metrics anymore. The connection is accepted but no response is produced.

The following supervisor setup reproduces the issue here. It creates two counter families with 200 counter instances for each family. With 8 workers and 3 other processes, this results in enough metrics to cause manager overload.

Either increasing the number of workers or increasing nworkers or ncounters or nfamilies should reproduce the issue if you have a more powerful system.

Deployments with 64 or 128 workers may trigger this observation even if the number of instances per metrics is small (10 or 20) as it's multiplied by the number of endpoints if I read the code right. Let alone setups with 500+ workers (#352).

On the Zeek side we could consider removing the metrics for log streams/writers and/or event invocations for 6.0 or reduce the synchronization interval, but this seems mostly like a performance bug.

Further down the road, it may also be more efficient to query workers on-demand rather than having workers publish their metrics every second (which most of the time will be overwritten again) and the expense of a small delay.

$ cat my-supervisor.zeek
@load base/frameworks/cluster
@load base/frameworks/reporter

redef LogAscii::use_json = T;
redef Broker::disable_ssl = F;

redef Reporter::info_to_stderr = T;
redef Reporter::warnings_to_stderr = T;
redef Reporter::errors_to_stderr = T;

global nworkers = 8;

event zeek_init()
        {
        # print Cluster::local_node_type();
        if ( ! Supervisor::is_supervisor() )
                return;

        Broker::listen("127.0.0.1", 9999/tcp);

        local cluster: table[string] of Supervisor::ClusterEndpoint;
        cluster["manager"] = [$role=Supervisor::MANAGER, $host=127.0.0.1, $p=10000/tcp];
        cluster["proxy"] = [$role=Supervisor::PROXY, $host=127.0.0.1, $p=10001/tcp];
        cluster["logger"] = [$role=Supervisor::LOGGER, $host=127.0.0.1, $p=10002/tcp];

        local worker_port_offset = 10100;
        local i = 0;
        while ( i < nworkers )
                {
                ++i;
                local name = fmt("worker-%03d", i);
                cluster[name] = [$role=Supervisor::WORKER, $host=127.0.0.1, $p=0/tcp, $interface="lo"];
                }

        for ( n, ep in cluster )
                {
                local sn = Supervisor::NodeConfig($name=n);
                sn$cluster = cluster;
                sn$directory = n;
                sn$env = table(["ZEEK_DEFAULT_CONNECT_RETRY"] = "1");

                if ( ep?$interface )
                        sn$interface = ep$interface;

                print "starting",  sn$name;
                local res = Supervisor::create(sn);
                if ( res != "" )
                        print fmt("supervisor failed to create node '%s': %s", sn, res);
                }
        }

@if ( ! Supervisor::is_supervisor() )
@load ./telemetry.zeek
@endif

$ cat telemetry.zeek 
redef Broker::disable_ssl = T;
global update_interval: interval = 1sec;

global nfamilies = 2;
global ncounters_per_family = 200;

type Counters: record {
  f: Telemetry::CounterFamily;
  counters: vector of Telemetry::Counter;
};

global my_families: vector of Counters;
global counters: vector of Telemetry::Counter;

event update_telemetry() {
        schedule update_interval { update_telemetry() };

        for ( _, f in my_families ) {
                for ( _, c in f$counters ) {
                        Telemetry::counter_inc(c, rand(10));
                }
        }
}

event zeek_init() {
        local i = 0;
        while ( i < nfamilies ) {
                local f = Counters(
                        $f=Telemetry::register_counter_family([
                                $prefix="zeek",
                                $name="test",
                                $unit="stuff",
                                $help_text=fmt("stuff %d", i),
                                $labels=vector("label1", "label2"),
                        ]),
                        $counters=vector(),
                );
                my_families[i] = f;
                local j = 0;
                while ( j < ncounters_per_family ) {
                        local labels = vector(cat(i), cat(j));
                        f$counters += Telemetry::counter_with(f$f, labels);
                        ++j;
                }
                ++i;
        }

        schedule update_interval { update_telemetry() };

}

8-workers-supervisor

ckreibich commented 1 year ago

On the Zeek side we could consider removing the metrics for log streams/writers and/or event invocations for 6.0 or reduce the synchronization interval, but this seems mostly like a performance bug.

Agreed — I'm relieved because we have so many options here. The find_if() looks expensive because it's implementing an identity comparison for a set of labels (yes?) so we should first see if we could optimize that. My next suggestion would be to dial down the synchronization interval (and make it configurable, if possible), and as a last resort remove the new metrics. Thoughts on this welcome, of course.

I also like the idea of implementing proper request-driven scraping in 6.1 instead of constant push from all nodes to the manager.

ckreibich commented 1 year ago

@Neverlord would be great to hear your thoughts here — could metric_scope gain a map to point directly from labels to a matching instance, so we can drop the vector scan? Or could instances become a map itself?

Neverlord commented 1 year ago

I'll look into it.

Neverlord commented 1 year ago

I also like the idea of implementing proper request-driven scraping in 6.1 instead of constant push from all nodes to the manager.

I'm not convinced that this is a good route for a distributed system with loose coupling like Broker. At the pub/sub layer, we don't know how many responses we should expect. There is no central registry of metric sources. Even if there was one, we would still have to guard against all sorts of partial errors, ultimately with some sort of timeout for the operation. A loosely coupled push model like we have now is much more robust.

If the once-per-second updates introduce significant traffic, I think we can instead optimize that update. I haven't looked at the actual metrics yet, but are all workers actually using all the metrics? Maybe we could emit metrics with zero values or use some sort of "compression" / better encoding.

But let's fix the obvious performance bugs first. 🙂

Neverlord commented 1 year ago

Agreed — I'm relieved because we have so many options here. The find_if() looks expensive because it's implementing an identity comparison for a set of labels (yes?) so we should first see if we could optimize that. My next suggestion would be to dial down the synchronization interval (and make it configurable, if possible), and as a last resort remove the new metrics. ... Or could instances become a map itself?

Full ACK on the strategy. 👍

PR #367 is getting rid of the find_if and treats instances like a map (via std::lower_bound lookups). Since the vector is just a list of pointers, std::lower_bound on a sorted vector should be even faster than an actual std::map (because the map has to do pointer chasing on a tree). Let's see if this improves performance enough.

The interval is already configuration via the broker.metrics.export.interval option (BROKER_METRICS_EXPORT_INTERVAL as environment variable and Broker::metrics_export_interval from Zeek scripts). So we can fine-tune that as well if necessary.

awelzel commented 1 year ago

I'm not convinced that this is a good route for a distributed system with loose coupling like Broker. At the pub/sub layer, we don't know how many responses we should expect.

On the Zeek level the nodes we expect metrics from are fixed. All nodes should also have consistent metrics (types, help, labels). In fact, in much larger setups it might be best to forego the centralization aspect of either push based or pull based altogether and use configuration management to set Prometheus up for individual nodes accordingly.

Even if there was one, we would still have to guard against all sorts of partial errors, ultimately with some sort of timeout for the operation.

That seems quite okay and pragmatic. If a node fails to provide metrics (in 1 or 2 seconds) then it timed-out and that's a signal, too.

I have prototyped the request-response/pull-based approach here: https://github.com/awelzel/zeek-js-metrics

This triggers metric collection as Zeek events over broker and collects the results (pre-rendered Prometheus lines) before replying back to an HTTP request handled in JavaScript on the manager.

With 24 workers and a large number of metrics there is zero overhead or extra cluster communication when no scraping happens and significant lower overhead on the manager when scraping happens at 1 second intervals (still high, but to comparable with broker's default). In this artificial setup, broker's centralization causes the manager to use 30% CPU by default, for the pull approach usage is only at 10%.

This doesn't necessarily mean we should require JS for this, but I think it's reasonable to use it to compare approaches and expectations.

Neverlord commented 1 year ago

Even if there was one, we would still have to guard against all sorts of partial errors, ultimately with some sort of timeout for the operation.

That seems quite okay and pragmatic. If a node fails to provide metrics (in 1 or 2 seconds) then it timed-out and that's a signal, too.

I disagree with this. Very much.

With opening up the system via WebSocket and ultimately via ALM, we are no longer limited to the rigid structure of a Zeek cluster. To me, this is not a pragmatic solution. On the contrary, it will tie down Broker and directly contradicts many design decisions and idioms when designing a loosely coupled, distributed system.

It also brings more issues with it. Tying up a scraper for up to 1-2s because a single node is lagging behind is unacceptable. They probably time out at that point, ultimately causing the scraper to have no metrics at all instead of at least having the n-1 metrics.

I very much appreciate your efforts in quantifying the problem. But please let's not commit to problematic workarounds that compromise the system architecture and come with their own bag of problems. Let's fix the actual problem here: poor performance in Broker. This is purely a performance bug. Currently, the metrics are heavily nested. Broker is really, really bad at efficiently handling this (unfortunately). #368 could be a big part of a solution here, to make this simply a non-issue. Pre-rendering the metrics instead of shipping "neat" broker::data is also an option.

The central collection of metrics was something I've put together a while ago after some internal discussion that this would be nice to have. Then it basically didn't get used until you tried it after adding hundreds of metrics to Zeek. My design was assuming maybe a couple dozen metrics per node. Let me fix that. 🙂

We have already disabled central collection by default again, right? Ist this still something we would consider urgent?

awelzel commented 1 year ago

We have already disabled central collection by default again, right?

Yes, it's disabled.

bbannier commented 1 year ago

I have been observing this discussion from the sidelines, just a few comments.

Even if there was one, we would still have to guard against all sorts of partial errors, ultimately with some sort of timeout for the operation.

That seems quite okay and pragmatic. If a node fails to provide metrics (in 1 or 2 seconds) then it timed-out and that's a signal, too.

I disagree with this. Very much.

With opening up the system via WebSocket and ultimately via ALM, we are no longer limited to the rigid structure of a Zeek cluster. To me, this is not a pragmatic solution. On the contrary, it will tie down Broker and directly contradicts many design decisions and idioms when designing a loosely coupled, distributed system.

For me reading of metrics from the manager is a convenience feature for users who do not want to set up proper scraping of individual, decoupled components. Re: your comments on dynamic cluster layouts, we should have discoverability tooling for working with such clusters anyway, so there should (eventually?) be a way set up such scraping; if anything is missing here I'd focus on that.

The current implementation has some issues:

The cluster always performs work that potentially nobody collects. More optimal use of this feature would require exactly synchronizing worker push intervals with the collector's scrape interval from the manager. This seems extremely rigid and will very likely not be used in the optimal way.
Ultimately it runs into scalability issues as number of nodes and/or metrics grows. Both of these are controlled by users and not us. No matter how much we optimize code, we can always come up with a configuration which will bog down the manager.
Users need to do fine tuning of the update interval for their setup. There is no good automatic (or even dynamic) way to pick a "good" value which simultaneously gives good granularity as well as minimizes overhead.
More generally, implementing backpressure in a push architecture requires a non-trivial protocol and makes interpreting metrics results harder.

It also brings more issues with it. Tying up a scraper for up to 1-2s because a single node is lagging behind is unacceptable. They probably time out at that point, ultimately causing the scraper to have no metrics at all instead of at least having the n-1 metrics.

I'd argue that if this a concern for users they should scrape individual nodes.

Neverlord commented 1 year ago

It also brings more issues with it. Tying up a scraper for up to 1-2s because a single node is lagging behind is unacceptable. They probably time out at that point, ultimately causing the scraper to have no metrics at all instead of at least having the n-1 metrics.

I'd argue that if this a concern for users they should scrape individual nodes.

In general: yes and full ACK on all of your other observations.

The issue we are trying to tackle is that configuring a Zeek cluster with metrics should be as easy as possible to make it simple to actually get the metrics. In a Zeek cluster, we have users that run > 60 workers and (much) more. It gets very fiddly to configure a port for each of the Zeek processes. And then you also have to configure your Prometheus accordingly, which usually runs on a different host. You quickly end up with hundreds of HTTP ports that you need to make accessible over the network and configure in at least two places (Zeek's cluster config plus Prometheus). The setup just becomes much more manageable if there is a port to get metrics from a Zeek cluster.

However, maybe we did think of this from a wrong perspective and are tackling the problems on the wrong layer. Instead of having the cluster collect the metrics and have the user pick ports, we could actually turn this around. When booting up the manager with metrics enabled, it could instruct all of the workers/proxies/loggers to turn on metrics too with a port chosen by the OS. Then, the manager could collect all ports and make the list of all the available endpoints/ports available to Prometheus: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config.

The manager would only have to respond to trivial HTTP GET requests that tells Prometheus what ports to scrape. The response is basically just a JSON list. That way, Prometheus would scrape all of the individual processes with minimal setup by the user. If the actual Prometheus runs on a different node and the user doesn't want to open up so many ports to the network, I think federations could help with that.

With this setup, we would have to manage a few extra steps when spinning up the cluster and need to teach Broker (or Zeek) to open up an extra HTTP port where Prometheus can collect the scrape targets. I think that should be very minimal extra overhead and no extra work is performed if no Prometheus is actually scraping.

Thoughts?

Neverlord commented 11 months ago

To tackle this, we need to serve HTTP replies from Zeek. @timwoj It seems like prometheus-cpp isn't available in Zeek yet (for the 3rd-party HTTP server). Is this on its way? If not, do we just pull in civetweb-cpp ourselves or do we delay this issue?

timwoj commented 11 months ago

To tackle this, we need to serve HTTP replies from Zeek. @timwoj It seems like prometheus-cpp isn't available in Zeek yet (for the 3rd-party HTTP server). Is this on its way? If not, do we just pull in civetweb-cpp ourselves or do we delay this issue?

We're not planning for any work on telemetry to land until at least 7.0, so if there's something else we can do in the meantime go for it. I'm also still planning to bring in open-telemetry since we'd like to tinker with the tracing functionality at some point too, and writing this once for prometheus-cpp and then having to rewrite it again for opentelemetry isn't worth it when I could just do it for otel first.

zeek / broker

Metrics scalability performance issue in 6.0 #366