yggdrasil-network / yggdrasil-go

An experiment in scalable routing as an encrypted IPv6 overlay network
https://yggdrasil-network.github.io
Other
3.53k stars 242 forks source link

High CPU utilization when multicast peer not in AllowedPublicKeys #1141

Open jgoerzen opened 5 months ago

jgoerzen commented 5 months ago

Hello,

I had a situation in which one laptop on my network was showing constant high CPU utilization from Yggdrasil even when Yggdrasil was effectively idle (confirmed with tcpdump/iftop both on the tun interface as well as the host's interface). The CPU utilization was about 50%.

This particular laptop has MulticastInterfaces defined with a password. It also has some entries in AllowedPublicKeys. My understanding was that AllowedPublicKeys was not consulted for connections established via multicast.

Finally upon running strace on yggdrasil, I saw it was repeatedly accepting connections from two multicast peers on the LAN. Those two peers knew the multicast password, but were not listed in AllowedPublicKeys. yggdrasilctl getpeers showed the IPv6 link-local (fe80:) address in the URI column, but the IP address column was blank.

After adding them both to AllowedPublicKeys on the laptop, the CPU utilization issue went away and they were then listed with Yggdrasil IPs in getpeers.

So I think there are two bugs here:

  1. There is no backoff from a client connecting to a multicast peer, and having the connection dropped due to not being in AllowedPublicKeys
  2. The documentation isn't clear about AllowedPublicKeys and how it relates to MulticastInterfaces (or perhaps the implementation doesn't follow the documentation)

Thanks again for Yggdrasil!

neilalexander commented 5 months ago

Thanks for the report, I'll take a look into this.

waseigo commented 1 month ago

I had the same issue. Two nodes (VMs) A and B in the same subnet, both of them with the default settings for MultiCast in yggdrasil.conf, but only one of them (B) had a single entry in AllowedPublicKeys for a third yggdrasil node (C) at a different site.

/var/log/syslog of node A grew to fill the disk, as it was logging the connection attempts to B. CPU of both A and B went to 100%, as did network traffic (looking at the stats that virt-manager shows).

So, I concur with @jgoerzen regarding the backoff.

neilalexander commented 1 month ago

Please check with the latest develop commits if possible!