zerotier / ZeroTierOne

A Smart Ethernet Switch for Earth
https://zerotier.com
Other
14.61k stars 1.71k forks source link

packet loss and low throughput from repeating bad_cast error #2119

Open rakundig opened 1 year ago

rakundig commented 1 year ago

While this occurs I have run pings from misc devices in that network and the pings time out while this 'reset' of the interfaces is happening as I tail syslog.

zerotier-cli -v

1.12.1

laduke commented 1 year ago

thanks for reporting. We've been testing with iperf but so far haven't been able to reproduce.

rakundig commented 1 year ago

Prob remains. I have stopped and disabled avahi, it isn't really needed anyhow, just in case it was contributing to the issue but that seems not to have made a difference.

Below is some more info:

When this happens it is evident in a manner of ways.

1) things slow down. ZT continues to work but doing anything over that network is quite slow. Example: loading a webpage on a remote server takes forever. I don't know how long because I gave up. It must be transferring some data because it doesn't time out, it just doesn't complete either.

2) when checking status via CLI we get:

zerotier-cli status
zerotier-cli: missing port and zerotier-one.port not found in /var/lib/zerotier-one

Restarting ZT fixes it temporarily

restarting zerotier-one.service resolves that issue and everything is working and snappy again.

When I got that error I cat'd the file /var/lib/zerotier-one/zerotier-one.port and it was empty.

After restart it has 9993 again.

Logs:

_Note: I noticed that the 14:15 has a 'bad_cast' and the restart of ZT at 14:20 does not, but the ones later in the data that freq happen have it again._

Syslog entries:

**Sep 21 14:15:25  zerotier-one[1682037]: terminate called after throwing an instance of 'std::bad_cast'
Sep 21 14:15:25 zerotier-one[1682037]:   what():  std::bad_cast**
Sep 21 14:15:25  systemd[1]: zerotier-one.service: Main process exited, code=killed, status=6/ABRT
Sep 21 14:15:25  systemd[1]: zerotier-one.service: Failed with result 'signal'.
Sep 21 14:15:25  systemd[1]: zerotier-one.service: Consumed 31min 30.129s CPU time.
Sep 21 14:15:25  systemd[1]: zerotier-one.service: Scheduled restart job, restart counter is at 1.
Sep 21 14:15:25  systemd[1]: Stopped ZeroTier One.
Sep 21 14:15:25  systemd[1]: zerotier-one.service: Consumed 31min 30.129s CPU time.
Sep 21 14:15:25  systemd[1]: Started ZeroTier One.
Sep 21 14:15:25  zerotier-one[1876000]: Starting Control Plane...
Sep 21 14:15:25  zerotier-one[1876000]: Starting V6 Control Plane...
Sep 21 14:15:25  systemd-udevd[1875994]: Using default interface naming scheme 'v247'.
Sep 21 14:15:25  systemd-udevd[1875994]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
...
Sep 21 14:20:03 systemd[1]: Stopping ZeroTier One...
Sep 21 14:20:03 ezerotier-one[1876000]: Control Plane Stopped
Sep 21 14:20:03 zerotier-one[1876000]: V6 Control Plane Stopped
Sep 21 14:20:04 systemd[1]: zerotier-one.service: Succeeded.
Sep 21 14:20:04 systemd[1]: Stopped ZeroTier One.
Sep 21 14:20:04 systemd[1]: zerotier-one.service: Consumed 32.556s CPU time.
Sep 21 14:20:04 systemd[1]: Started ZeroTier One.
Sep 21 14:20:04 erotier-one[1879262]: Starting V6 Control Plane...
Sep 21 14:20:04  zerotier-one[1879262]: Starting Control Plane...
Sep 21 14:20:04 systemd-udevd[1879257]: Using default interface naming scheme 'v247'.
Sep 21 14:20:04 systemd-udevd[1879257]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 14:20:04 systemd-udevd[1879256]: Using default interface naming scheme 'v247'.
Sep 21 14:20:04 systemd-udevd[1879256]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 14:20:04 systemd-udevd[1879307]: Using default interface naming scheme 'v247'.
Sep 21 14:20:04 systemd-udevd[1879307]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
...
Sep 21 15:04:04  zerotier-one[1879262]: terminate called after throwing an instance of 'std::bad_cast'
Sep 21 15:04:04  zerotier-one[1879262]:   what():  std::bad_cast
Sep 21 15:04:04  systemd[1]: zerotier-one.service: Main process exited, code=killed, status=6/ABRT
Sep 21 15:04:04  systemd[1]: zerotier-one.service: Failed with result 'signal'.
Sep 21 15:04:04  systemd[1]: zerotier-one.service: Consumed 4min 58.453s CPU time.
Sep 21 15:04:04  systemd[1]: zerotier-one.service: Scheduled restart job, restart counter is at 1.
Sep 21 15:04:04  systemd[1]: Stopped ZeroTier One.
Sep 21 15:04:04  systemd[1]: zerotier-one.service: Consumed 4min 58.453s CPU time.
Sep 21 15:04:04  systemd[1]: Started ZeroTier One.
Sep 21 15:04:04  zerotier-one[1908786]: Starting Control Plane...
Sep 21 15:04:04  zerotier-one[1908786]: Starting V6 Control Plane...
Sep 21 15:04:04  systemd-udevd[1908780]: Using default interface naming scheme 'v247'.
Sep 21 15:04:04  systemd-udevd[1908780]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:04:04  systemd-udevd[1908781]: Using default interface naming scheme 'v247'.
Sep 21 15:04:04  systemd-udevd[1908781]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:04:04  systemd-udevd[1908832]: Using default interface naming scheme 'v247'.
Sep 21 15:04:04  systemd-udevd[1908832]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
...
Sep 21 15:18:46  zerotier-one[1908786]: terminate called after throwing an instance of 'std::bad_cast'
Sep 21 15:18:46  zerotier-one[1908786]:   what():  std::bad_cast
Sep 21 15:18:46  systemd[1]: zerotier-one.service: Main process exited, code=killed, status=6/ABRT
Sep 21 15:18:46  systemd[1]: zerotier-one.service: Failed with result 'signal'.
Sep 21 15:18:46  systemd[1]: zerotier-one.service: Consumed 1min 42.191s CPU time.
Sep 21 15:18:46  systemd[1]: zerotier-one.service: Scheduled restart job, restart counter is at 2.
Sep 21 15:18:46  systemd[1]: Stopped ZeroTier One.
Sep 21 15:18:46  systemd[1]: zerotier-one.service: Consumed 1min 42.191s CPU time.
Sep 21 15:18:46  systemd[1]: Started ZeroTier One.
Sep 21 15:18:46  zerotier-one[1921874]: Starting Control Plane...
Sep 21 15:18:46  zerotier-one[1921874]: Starting V6 Control Plane...
Sep 21 15:18:46  systemd-udevd[1921868]: Using default interface naming scheme 'v247'.
Sep 21 15:18:46  systemd-udevd[1921868]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:18:46  systemd-udevd[1921869]: Using default interface naming scheme 'v247'.
Sep 21 15:18:46  systemd-udevd[1921869]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:18:46  systemd-udevd[1921920]: Using default interface naming scheme 'v247'.
Sep 21 15:18:46  systemd-udevd[1921920]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:18:51  zerotier-one[1921874]: terminate called after throwing an instance of 'std::bad_cast'
Sep 21 15:18:51  zerotier-one[1921874]:   what():  std::bad_cast
Sep 21 15:18:51  systemd[1]: zerotier-one.service: Main process exited, code=killed, status=6/ABRT
Sep 21 15:18:51  systemd[1]: zerotier-one.service: Failed with result 'signal'.
Sep 21 15:18:51  systemd[1]: zerotier-one.service: Consumed 2.091s CPU time.
Sep 21 15:18:52  systemd[1]: zerotier-one.service: Scheduled restart job, restart counter is at 3.
Sep 21 15:18:52  systemd[1]: Stopped ZeroTier One.
Sep 21 15:18:52  systemd[1]: zerotier-one.service: Consumed 2.091s CPU time.
Sep 21 15:18:52  systemd[1]: Started ZeroTier One.
Sep 21 15:18:52  zerotier-one[1921991]: Starting Control Plane...
Sep 21 15:18:52  zerotier-one[1921991]: Starting V6 Control Plane...
Sep 21 15:19:02  zerotier-one[1921991]: terminate called after throwing an instance of 'std::bad_cast'
Sep 21 15:19:02  zerotier-one[1921991]:   what():  std::bad_cast
Sep 21 15:19:02  systemd[1]: zerotier-one.service: Main process exited, code=killed, status=6/ABRT
Sep 21 15:19:02  systemd[1]: zerotier-one.service: Failed with result 'signal'.
Sep 21 15:19:02  systemd[1]: zerotier-one.service: Consumed 4.136s CPU time.
Sep 21 15:19:02  systemd[1]: zerotier-one.service: Scheduled restart job, restart counter is at 4.
Sep 21 15:19:02  systemd[1]: Stopped ZeroTier One.
Sep 21 15:19:02  systemd[1]: zerotier-one.service: Consumed 4.136s CPU time.
Sep 21 15:19:02  systemd[1]: Started ZeroTier One.
Sep 21 15:19:02  zerotier-one[1922166]: Starting Control Plane...
Sep 21 15:19:02  zerotier-one[1922166]: Starting V6 Control Plane...
Sep 21 15:19:02  systemd-udevd[1922160]: Using default interface naming scheme 'v247'.
Sep 21 15:19:02  systemd-udevd[1922160]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:19:02  systemd-udevd[1922159]: Using default interface naming scheme 'v247'.
Sep 21 15:19:02  systemd-udevd[1922159]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:19:02  systemd-udevd[1922210]: Using default interface naming scheme 'v247'.
Sep 21 15:19:02  systemd-udevd[1922210]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:19:07  zerotier-one[1922166]: terminate called after throwing an instance of 'std::bad_cast'
Sep 21 15:19:07  zerotier-one[1922166]:   what():  std::bad_cast
Sep 21 15:19:07  systemd[1]: zerotier-one.service: Main process exited, code=killed, status=6/ABRT
Sep 21 15:19:07  systemd[1]: zerotier-one.service: Failed with result 'signal'.
Sep 21 15:19:07  systemd[1]: zerotier-one.service: Consumed 2.084s CPU time.
Sep 21 15:19:07  systemd[1]: zerotier-one.service: Scheduled restart job, restart counter is at 5.
Sep 21 15:19:07  systemd[1]: Stopped ZeroTier One.
Sep 21 15:19:07  systemd[1]: zerotier-one.service: Consumed 2.084s CPU time.
Sep 21 15:19:07  systemd[1]: Started ZeroTier One.
Sep 21 15:19:07  zerotier-one[1922283]: Starting Control Plane...
Sep 21 15:19:07  zerotier-one[1922283]: Starting V6 Control Plane...
Sep 21 15:19:07  systemd-udevd[1922278]: Using default interface naming scheme 'v247'.
Sep 21 15:19:07  systemd-udevd[1922278]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:19:07  systemd-udevd[1922277]: Using default interface naming scheme 'v247'.
Sep 21 15:19:07  systemd-udevd[1922277]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:19:07  systemd-udevd[1922327]: Using default interface naming scheme 'v247'.
Sep 21 15:19:07  systemd-udevd[1922327]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:19:12  zerotier-one[1922283]: terminate called after throwing an instance of 'std::bad_cast'
Sep 21 15:19:12  zerotier-one[1922283]:   what():  std::bad_cast
Sep 21 15:19:12  systemd[1]: zerotier-one.service: Main process exited, code=killed, status=6/ABRT
Sep 21 15:19:12  systemd[1]: zerotier-one.service: Failed with result 'signal'.
Sep 21 15:19:12  systemd[1]: zerotier-one.service: Consumed 1.971s CPU time.
Sep 21 15:19:12  systemd[1]: zerotier-one.service: Scheduled restart job, restart counter is at 6.
Sep 21 15:19:12  systemd[1]: Stopped ZeroTier One.
Sep 21 15:19:12  systemd[1]: zerotier-one.service: Consumed 1.971s CPU time.
Sep 21 15:19:12  systemd[1]: Started ZeroTier One.
Sep 21 15:19:12  zerotier-one[1922403]: Starting Control Plane...
Sep 21 15:19:12  zerotier-one[1922403]: Starting V6 Control Plane...
Sep 21 15:19:12  systemd-udevd[1922399]: Using default interface naming scheme 'v247'.
Sep 21 15:19:12  systemd-udevd[1922399]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:19:12  systemd-udevd[1922398]: Using default interface naming scheme 'v247'.
Sep 21 15:19:12  systemd-udevd[1922398]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:19:12  systemd-udevd[1922447]: Using default interface naming scheme 'v247'.
Sep 21 15:19:12  systemd-udevd[1922447]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:20:17  zerotier-one[1922403]: terminate called after throwing an instance of 'std::bad_cast'
Sep 21 15:20:17  zerotier-one[1922403]:   what():  std::bad_cast
Sep 21 15:20:17  systemd[1]: zerotier-one.service: Main process exited, code=killed, status=6/ABRT
Sep 21 15:20:17  systemd[1]: zerotier-one.service: Failed with result 'signal'.
Sep 21 15:20:17  systemd[1]: zerotier-one.service: Consumed 14.644s CPU time.
Sep 21 15:20:18  systemd[1]: zerotier-one.service: Scheduled restart job, restart counter is at 7.
Sep 21 15:20:18  systemd[1]: Stopped ZeroTier One.
Sep 21 15:20:18  systemd[1]: zerotier-one.service: Consumed 14.644s CPU time.
Sep 21 15:20:18  systemd[1]: Started ZeroTier One.
Sep 21 15:20:18  zerotier-one[1923279]: Starting Control Plane...
Sep 21 15:20:18  zerotier-one[1923279]: Starting V6 Control Plane...
Sep 21 15:20:18  systemd-udevd[1923274]: Using default interface naming scheme 'v247'.
Sep 21 15:20:18  systemd-udevd[1923274]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:20:18  systemd-udevd[1923273]: Using default interface naming scheme 'v247'.
Sep 21 15:20:18  systemd-udevd[1923273]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:20:18  systemd-udevd[1923326]: Using default interface naming scheme 'v247'.
Sep 21 15:20:18  systemd-udevd[1923326]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 21 15:20:33  zerotier-one[1923279]: terminate called after throwing an instance of 'std::bad_cast'
Sep 21 15:20:33  zerotier-one[1923279]:   what():  std::bad_cast
Sep 21 15:20:33  systemd[1]: zerotier-one.service: Main process exited, code=killed, status=6/ABRT
Sep 21 15:20:33  systemd[1]: zerotier-one.service: Failed with result 'signal'.
Sep 21 15:20:33  systemd[1]: zerotier-one.service: Consumed 6.650s CPU time.
Sep 21 15:20:33  systemd[1]: zerotier-one.service: Scheduled restart job, restart counter is at 8.
Sep 21 15:20:33  systemd[1]: Stopped ZeroTier One.
Sep 21 15:20:33  systemd[1]: zerotier-one.service: Consumed 6.650s CPU time.
Sep 21 15:20:33  systemd[1]: Started ZeroTier One.
Sep 21 15:20:33  zerotier-one[1923498]: Starting V6 Control Plane...
Sep 21 15:20:33  zerotier-one[1923498]: Starting Control Plane...

journalctl -u zerotier-one.service:

Sep 21 14:15:25 zerotier-one[1682037]: terminate called after throwing an instance of 'std::bad_cast'
Sep 21 14:15:25 zerotier-one[1682037]:   what():  std::bad_cast
Sep 21 14:15:25 systemd[1]: zerotier-one.service: Main process exited, code=killed, status=6/ABRT
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ An ExecStart= process belonging to unit zerotier-one.service has exited.
░░
░░ The process' exit code is 'killed' and its exit status is 6.
Sep 21 14:15:25 systemd[1]: zerotier-one.service: Failed with result 'signal'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit zerotier-one.service has entered the 'failed' state with result 'signal'.
Sep 21 14:15:25 systemd[1]: zerotier-one.service: Consumed 31min 30.129s CPU time.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit zerotier-one.service completed and consumed the indicated resources.
Sep 21 14:15:25 systemd[1]: zerotier-one.service: Scheduled restart job, restart counter is at 1.
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ Automatic restarting of the unit zerotier-one.service has been scheduled, as the result for
░░ the configured Restart= setting for the unit.
Sep 21 14:15:25  systemd[1]: Stopped ZeroTier One.
░░ Subject: A stop job for unit zerotier-one.service has finished
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A stop job for unit zerotier-one.service has finished.
░░
░░ The job identifier is 93475 and the job result is done.
Sep 21 14:15:25 systemd[1]: zerotier-one.service: Consumed 31min 30.129s CPU time.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit zerotier-one.service completed and consumed the indicated resources.
Sep 21 14:15:25  systemd[1]: Started ZeroTier One.
░░ Subject: A start job for unit zerotier-one.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit zerotier-one.service has finished successfully.
░░
░░ The job identifier is 93475.
Sep 21 14:15:25 zerotier-one[1876000]: Starting Control Plane...
Sep 21 14:15:25  zerotier-one[1876000]: Starting V6 Control Plane...
Sep 21 14:20:03  systemd[1]: Stopping ZeroTier One...
░░ Subject: A stop job for unit zerotier-one.service has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A stop job for unit zerotier-one.service has begun execution.
░░
░░ The job identifier is 93611.
Sep 21 14:20:03  zerotier-one[1876000]: Control Plane Stopped
Sep 21 14:20:03  zerotier-one[1876000]: V6 Control Plane Stopped
Sep 21 14:20:04  systemd[1]: zerotier-one.service: Succeeded.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit zerotier-one.service has successfully entered the 'dead' state.
Sep 21 14:20:04 systemd[1]: Stopped ZeroTier One.
░░ Subject: A stop job for unit zerotier-one.service has finished
Sep 21 16:12:40 systemd[1]: zerotier-one.service: Scheduled restart job, restart counter is at 84.

... 84 restart jobs so far.

Ping:

64 bytes from 192.168.192.113: icmp_seq=201 ttl=64 time=27.131 ms 64 bytes from 192.168.192.113: icmp_seq=202 ttl=64 time=31.551 ms Request timeout for icmp_seq 203 64 bytes from 192.168.192.113: icmp_seq=204 ttl=64 time=33.131 ms 64 bytes from 192.168.192.113: icmp_seq=205 ttl=64 time=35.822 ms 64 bytes from 192.168.192.113: icmp_seq=206 ttl=64 time=27.665 ms 64 bytes from 192.168.192.113: icmp_seq=207 ttl=64 time=27.355 ms Request timeout for icmp_seq 208 64 bytes from 192.168.192.113: icmp_seq=209 ttl=64 time=36.238 ms 64 bytes from 192.168.192.113: icmp_seq=210 ttl=64 time=27.313 ms 64 bytes from 192.168.192.113: icmp_seq=211 ttl=64 time=27.291 ms 64 bytes from 192.168.192.113: icmp_seq=212 ttl=64 time=26.351 ms 64 bytes from 192.168.192.113: icmp_seq=213 ttl=64 time=29.436 ms Request timeout for icmp_seq 214 64 bytes from 192.168.192.113: icmp_seq=215 ttl=64 time=30.112 ms 64 bytes from 192.168.192.113: icmp_seq=216 ttl=64 time=27.853 ms Request timeout for icmp_seq 217 64 bytes from 192.168.192.113: icmp_seq=218 ttl=64 time=29.163 ms 64 bytes from 192.168.192.113: icmp_seq=219 ttl=64 time=26.628 ms 64 bytes from 192.168.192.113: icmp_seq=220 ttl=64 time=28.632 ms Request timeout for icmp_seq 221 64 bytes from 192.168.192.113: icmp_seq=222 ttl=64 time=28.864 ms 64 bytes from 192.168.192.113: icmp_seq=223 ttl=64 time=29.901 ms Request timeout for icmp_seq 224 64 bytes from 192.168.192.113: icmp_seq=225 ttl=64 time=33.486 ms 64 bytes from 192.168.192.113: icmp_seq=226 ttl=64 time=28.631 ms 64 bytes from 192.168.192.113: icmp_seq=227 ttl=64 time=27.899 ms 64 bytes from 192.168.192.113: icmp_seq=228 ttl=64 time=30.795 ms 64 bytes from 192.168.192.113: icmp_seq=229 ttl=64 time=34.670 ms 64 bytes from 192.168.192.113: icmp_seq=230 ttl=64 time=50.057 ms 64 bytes from 192.168.192.113: icmp_seq=231 ttl=64 time=29.791 ms 64 bytes from 192.168.192.113: icmp_seq=232 ttl=64 time=33.005 ms Request timeout for icmp_seq 233

There does not seem to be a common activity, action, job, or error prior to each 'bad_cast' event. They are also not happening at the same time. The first two are 5m apart but later they are back to back.

The primary workload: Nextcloud hosting about 600GB of files, but load is low and I am not seeing any correlation with errors or load related to nextcloud whenever ZT has a problem.

Version: zerotier-cli status 200 info xxxxxxxxx 1.12.2 ONLINE

OS: Linux 5.10.0-25-amd64 #1 SMP Debian 5.10.191-1 (2023-08-16) x86_64

If you need anything else please let me know.

rakundig commented 1 year ago

Since you said you ran iperf, I decided to do so as well.

iperf3-darwin -c 192.168.192.113 -p 5201 Connecting to host 192.168.192.113, port 5201 [ 7] local 192.168.192.79 port 64846 connected to 192.168.192.113 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd RTT [ 7] 0.00-1.00 sec 3.82 MBytes 32.1 Mbits/sec 13740 2.07 MBytes 27ms [ 7] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1665288 34.9 KBytes 28ms [ 7] 2.00-3.00 sec 10.1 MBytes 84.9 Mbits/sec 0 556 KBytes 40ms [ 7] 3.00-4.00 sec 6.51 MBytes 54.7 Mbits/sec 19236 419 KBytes 36ms [ 7] 4.00-5.00 sec 10.5 MBytes 87.8 Mbits/sec 0 462 KBytes 44ms [ 7] 5.00-6.00 sec 10.8 MBytes 90.4 Mbits/sec 0 483 KBytes 39ms [ 7] 6.00-7.00 sec 10.5 MBytes 88.0 Mbits/sec 0 523 KBytes 54ms [ 7] 7.00-8.00 sec 9.59 MBytes 80.5 Mbits/sec 5496 435 KBytes 54ms [ 7] 8.00-9.00 sec 10.4 MBytes 87.6 Mbits/sec 0 489 KBytes 39ms [ 7] 9.00-10.00 sec 10.7 MBytes 89.3 Mbits/sec 0 519 KBytes 34ms


[ ID] Interval Transfer Bitrate Retr [ 7] 0.00-10.00 sec 82.9 MBytes 69.5 Mbits/sec 1703760 sender [ 7] 0.00-10.04 sec 82.3 MBytes 68.7 Mbits/sec receiver

iperf Done.

rakundig commented 1 year ago

ran iperf -V again after a 'bad_cast' and status showing the port issue and got:

Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd          RTT
[  7]   0.00-1.00   sec  4.97 MBytes  41.6 Mbits/sec  2000544    706 KBytes   27ms
[  7]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec  98928    639 KBytes   27ms
[  7]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec  90684   1.06 MBytes   42ms
[  7]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec  96180   1.62 MBytes   28ms
[  7]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec  93432   1.62 MBytes   28ms
[  7]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec  96180   1.62 MBytes   27ms
[  7]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec  98928   1.62 MBytes   31ms
[  7]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec  87936   1.60 MBytes   30ms
[  7]   8.00-9.00   sec  2.18 MBytes  18.3 Mbits/sec  68700    588 KBytes   37ms
[  7]   9.00-10.00  sec  1.11 MBytes  9.28 Mbits/sec  68700    174 KBytes   33ms
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Retr
[  7]   0.00-10.00  sec  8.26 MBytes  6.93 Mbits/sec  2800212             sender
[  7]   0.00-10.03  sec  8.03 MBytes  6.71 Mbits/sec                  receiver
CPU Utilization: local/sender 0.8% (0.2%u/0.6%s), remote/receiver 0.0% (0.0%u/0.0%s)
rcv_tcp_congestion cubic

Odd thing is, I am ssh'd into the node over the ZT network... So ZT is working even with the errors. It is slow and it has a lot of little 'freezes' while I am typing or cat'ing a file at the remote node, but it does work.

I did a restart of ZT again and the port error is gone then ran iperf again and got:

[  7] local 192.168.192.79 port 65243 connected to 192.168.192.113 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd          RTT
[  7]   0.00-1.00   sec  4.52 MBytes  37.8 Mbits/sec  1690020    859 KBytes   56ms
[  7]   1.00-2.00   sec   912 KBytes  7.46 Mbits/sec  684252   1.28 MBytes   27ms
[  7]   2.00-3.00   sec  4.86 MBytes  40.9 Mbits/sec  41220    445 KBytes   36ms
[  7]   3.00-4.00   sec  10.6 MBytes  88.8 Mbits/sec    0    488 KBytes   42ms
[  7]   4.00-5.00   sec  10.6 MBytes  88.8 Mbits/sec    0    510 KBytes   51ms
[  7]   5.00-6.00   sec  5.51 MBytes  46.2 Mbits/sec  13740    107 KBytes   36ms
[  7]   6.00-7.00   sec  3.72 MBytes  31.2 Mbits/sec  1797192    762 KBytes   27ms
[  7]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec  96180   1.69 MBytes   28ms
[  7]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec  98928   1.69 MBytes   26ms
[  7]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec  96180   1.69 MBytes   28ms
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Retr
[  7]   0.00-10.00  sec  40.7 MBytes  34.1 Mbits/sec  4517712             sender
[  7]   0.00-10.03  sec  39.2 MBytes  32.8 Mbits/sec                  receiver
CPU Utilization: local/sender 2.1% (0.4%u/1.7%s), remote/receiver 0.9% (0.0%u/0.9%s)
rcv_tcp_congestion cubic
laduke commented 1 year ago

Thank you. We need to figure out what is triggering it. Does it happen on any of your other nodes?

rakundig commented 1 year ago

First I’ll say zerotier is overall great.

As for other nodes with the issue— Not that I can find, but this is Debian and the other nodes are Ubuntu. Not a huge difference but maybe enough?

Also the other nodes are installed via snap vs downloading and installing.

So I don’t have an exact comparison, rendering the comparison invalid.

The node with this issue is an “older” design/setup (older in this case is just over a year), and since then we’ve moved to Ubuntu and snap install for ZT.

I’m not saying snaps are great or solve it, actually we’ve tried them for a few services and I don’t really like them in general but the ZT snap hasn’t had any issues so far so I’ve let it be. Just pointing out that that’s how ZT is installed on the other nodes and why they aren’t a good comparison.

We’ve considered attempting to load ZT via download/install and also adding a snap install but there are obvious concerns about some kind of conflicts from two zerotier overlays on the same node, and we have yet to attempt it in the lab.

To answer the question of why not just zap the node and set it up like the others without issues is that it’s deployed at a remote customer site, is business critical for them, and so they are extremely reluctant to have it wiped and reinstalled with new software.

Their contract isn’t up for almost a year so I can’t do anything much until then unless they agree. We’ve even offered to swap with a newer box but they declined and prefer we fix it VS swap it. Paranoid about the swap being or causing issues, which wouldn’t actually be an issue and for us is pretty simple, but they’re concerned about changes.

Silver lining is that means it’s a critical tool for them that they rely on so that’s a plus 😁, otherwise they wouldn’t care about it breaking or being swapped.

Unless I can resolve it somehow I’ll need to do whatever possible to convince them (somehow) to let me reset it and setup like the others. It’s failing too much and causing issues leading to unavailability on an almost daily basis.

I’ve got a restart script run via CRON workaround in the meantime to check the port files and if empty restart ZT. It’s helped but its a bandaid.

Began to have these issues recently, after upgrading to 1.12 from 1.10.

Downgrading to 1.10 is being looked at, but again it’s at a remote client site and as everyone knows it’s a risk to fiddle with networking from a remote location.

So we’re trying to dev the scripts to perform the downgrade and account for every possible scenario that we can think of so that once we remove the current version, and lose remote access, it’ll install the older one and return to connectivity without issue after successfully installing 1.10.

We’re also looking at our excess or old HW to find something like the HW in that node so we can install the same OS and SW for testing and troubleshooting in the lab. Most everything is VMs in the lab which we can try to emulate on but it isn’t the same as on metal so the test isn’t valid unless it fails exactly the same way I suppose since that indicates it’s purely software related and HW isn’t a factor.

Unfortunately there’s no apparent cause outside of a possible bug that perhaps is only exposed on that OS at that version/kernel/patch level, and perhaps hardware spec and its firmware. There’s no Pattern to the restarts as there are periods of no issues, periods with a few restarts and various time in between without issues, and sometimes continuous back to back restarts for hours.

Midnight or noon it can happen, or not. Doesn’t appear load related. Not obviously anyhow.

And logs don’t show any correlating event that we might see every time or most of the time before or near a restart. Often there is no logged activities between the more frequent restarts, and in the time between less frequent ones is normal INFO stuff of various sorts.

Turning up logging didn’t provide any new insights, just larger logs.

We haven’t tried a trace yet, but that’s about all that hasn’t been tried.

While it’s at a client’s site it’s fully managed by us and they can’t even login to the node itself. Just to clarify it’s not a “too many cooks in the kitchen” scenario.

And since it’s only one of a couple deployed outside our lab and a POC it’s “special”. Meaning continual monitoring, configuration controls and lots of eyeballs watching and so on to make sure it goes ok and learn from any issues or whatever so we can improve on the next deployment and the next until we’re comfy making it standard.

Since it’s a POC it’s used at a very friendly client that’s informed of the testing nature of it, the risks, and they’re being great about the recent issues, but even so, that’ll only last so long.

Oh and it’sa “vanilla” install. Ie no ZT policies etc in place. It’s a couple nodes (on-site and offsite for DR), 4 client windows devices that use it, and a few of our MacBooks for mgmt and nothing else. We didn’t plan to start adding policies until we knew it was working ok with a default install without any changes, policies or anything else that may affect how it works, or things that we did that could mess it up like adding a buggy policy or something.

It was installed on-site going on 10m ago after running in the shop for a bit more than a month. No issues until recently.

The only difference between the prior 9-ish months and recently is the upgrade to the latest ZT version. That’s the only change, aside from a few apt update/upgrades to the node that is.

Wish I had more. If there’s some way to dial up ZT logging to get lots more info on its internal workings or any other tricks to get better details I’m all ears.

Alternatively, if you have downgrade recommendations that minimize risk of loss of connectivity during the downgrade if something goes awry I welcome the advice. At the moment we’re considering loading another agent to establish a backup tunnel just in case. Not perfect but it’sa 90% solution. Occasionally we’ve witnessed issues with multiple different overlay softwares on the same node leading to one failing or intermittent connection or routing hiccups so it’s no guarantee.

ZT is a great software and we’ve been using and expanding use internally for years without any real issues. This is the first of two nodes in the “PoC” use outside ourselves and the lab. Mostly just works, except this one node.

On Fri, Sep 22, 2023 at 12:16 PM Travis LaDuke @.***> wrote:

Thank you. We need to figure out what is triggering it. Does it happen on any of your other nodes?

— Reply to this email directly, view it on GitHub https://github.com/zerotier/ZeroTierOne/issues/2119#issuecomment-1731771953, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO7B36ZQRAGIGAO4VNN55K3X3XBVDANCNFSM6AAAAAA4ILM6GU . You are receiving this because you authored the thread.Message ID: @.***>

laduke commented 1 year ago

Thanks for writing that up. We've been looking into it more but no solid leads yet.

jcleng commented 4 months ago

also in docker


04:30:45 root@vps ~ _ docker exec -it myzerotier bash
root@vps:/# zerotier-one
Starting V6 Control Plane...
Starting Control Plane...
terminate called after throwing an instance of 'std::bad_cast'
  what():  std::bad_cast
Aborted (core dumped)
laduke commented 4 months ago

Please provide steps to reproduce