zeek / broker

Zeek's Messaging Library
https://docs.zeek.org/projects/broker
Other
65 stars 28 forks source link

Broker message provenance tracking #404

Open MP-Corelight opened 4 months ago

MP-Corelight commented 4 months ago

As a user of a high-volume Zeek instance that sometimes gets overloaded, I would like to know where message are coming from, where they're being sent, and how many are being sent, so that I can isolate the source of load issues and address them.

timwoj commented 4 months ago

I moved this over into the broker repo, since it makes more sense to track a feature like this over there.

MP-Corelight commented 3 months ago

The team discussed this issue yesterday and agrees that the feature would be valuable.

We should check the work @awelzel has previously done on metadata in Broker events.

Some other questions the team had:

  1. How granular are we trying to be in identifying the source of the messages? Namespace? Module? Line of code?
  2. Do we need to deal with all Broker message types or just events?
  3. Can we track the events as they're processed, or is the goal to troubleshoot events that are never processed?

@pauldokas, can you answer the above?

As @timwoj alluded to above, there are a couple of angles we can look at this from: either putting more metrics into Broker itself (which can probably be read even if a Zeek process is overloaded) versus Zeek (where we might be able to add more detailed and useful information). Once we have the answers to the above questions, @ckreibich and @Neverlord can sit down and dig into the potential solutions a bit more.

MP-Corelight commented 1 month ago

@vpax mentioned today that he's specifically looking to estimate the message load incurred by new packages under development. The problem with doing this at the Zeek level is that some of the messaging is implicit — i.e. synchronized store use causes Broker traffic which would've been missed if we'd just instrumented the BIF. We might be able to break this into a simple "raw count" metric for today and enhance it later; we don't even necessarily need to tie it to the specific script layer right now.

@ckreibich will add some more info about telemetry here to further the discussion.