Open grapexy opened 2 years ago
Noticed this comment by @joseph-henry on a similar use case:
My use case involves having the edge router on a mobile platform, which means the signal strength (and bandwidth on my WAN interfaces on the edge), will vary depending on the location. Do you still recommend using balance-xor for this 2-to-1 setup (where the edge has 2 physicai WAN interfaces and the server has 1 physical WAN + 1 sub-interface)?
Yes I'd use the
balance-xor
for this. Ourbalance-aware
mode needs some work before it would be useful in this case.
Is balance-aware
supposed to work properly for many-to-one asymmetric, dynamic links now or does is still need more work? Also curious, what kind of work was/is required?
After a few days of testing, it got better, not sure what changed however. Now I'm getting around 100-120 Mbps of aggregated bandwidth, but that's only marginally more than the speed of a single link.
I've also noticed that multiple paths are used for single stream TCP connections (e.g rsync or iperf3 without parallel connections). This is the case for balance-xor
, or balance-aware
with flow-dynamic
or flow-static
. tcpdump shows that zerotier is using both links with around 50/50 allocation with any of the settings. I was under the impression that hashing of src_port ^ dst_port ^ proto
would result in only single path being used in this case.
Another observation is that CPU usage jumps from around 5-10% to about 60% as soon as any bonding policy is enabled.
I've also noticed that multiple paths are used for single stream TCP connections (e.g rsync or iperf3 without parallel connections). This is the case for
balance-xor
, orbalance-aware
withflow-dynamic
orflow-static
. tcpdump shows that zerotier is using both links with around 50/50 allocation with any of the settings. I was under the impression that hashing ofsrc_port ^ dst_port ^ proto
would result in only single path being used in this case.
Found the culprit. If any custom policy is used, with a custom name, even with default settings, flow assignment does not happen (noticed that there were no assign in-flow
debug messages in logs with custom policies). Changing defaultBondingPolicy
to predefined policies properly assigns traffic to a single flow. However, this means that custom policies are effectively broken.
For example, the following fails to produce any assign in-flow
etc. debug messages and flow hashing does not happen, which also results in same-hash flows going in and out through multiple links:
"defaultBondingPolicy": "custom-balance-aware",
"policies": {
"custom-balance-aware": {
"basePolicy": "balance-aware"
}
}
This however, works (produces logs, same-hash flows stay on single path):
"defaultBondingPolicy": "balance-aware"
And as using default policy names for custom policies is an error condition (error: custom policy (balance-aware) will be ignored, cannot use standard policy names for custom policies
), any custom configuration of policies is impossible. And docs should probably be updated to not use examples that result in this error.
Thanks for reporting this. I'll take a peek today. Can you tell me which branch you're using? I often use a custom policy and don't see this issue.
I think there was a similar issue a while back but that was fixed.
@joseph-henry both peers are on 1.10.1
Saw the commit related to link selection and can say that this still persists on dev.
I'm guessing that flows don't go through the standard path selection in the bonding layer and get in and out on random paths, hence the missing "assign out-flow" "assign in-flow" debug messages whenever custom policy is activated. I don't know much C, so couldn't really find out why though.
@joseph-henry I believe the issue here is that flow hashing is never enabled for custom policies that need it.
When bonds are initialized, _defaultPolicy
is set to 0 for custom policies:
https://github.com/zerotier/ZeroTierOne/blob/04d1862e3ae0d916f78779a9fc0f058b25fd469d/node/Bond.hpp#L335-L353 https://github.com/zerotier/ZeroTierOne/blob/04d1862e3ae0d916f78779a9fc0f058b25fd469d/service/OneService.cpp#L2047-L2048
So when setBondParameters
is called, _defaultPolicy
and _policy
are always evaluated to 0:
And when the lines responsible for allowing flow-hashing are reached, we're evaluating defaults for ZT_BOND_POLICY_NONE
(0), rather than a custom policy's base policy:
Because the real policy is actually set down below, which does not do anything to set flow-hashing: https://github.com/zerotier/ZeroTierOne/blob/04d1862e3ae0d916f78779a9fc0f058b25fd469d/node/Bond.cpp#L1791-L1794
I have Many-to-One multipath configured with 2 WAN links on site A and 1 WAN link on site B.
Both of site A links are symmetric 100 Mbps. Site B link is 1 Gbps.
Site B is configured as a default gateway for all Site A oubound connections, and the only NAT is ZT > WAN. There is no NAT for the tunnel.
Latency from Site A to Site B is ~40ms and both are using ZT version 1.10.1.
With balance-xor or balance-aware on both sides, I'm able to get 200 Mbps with iperf3 UDP:
However, iperf3 TCP (10 parallel connections) with same settings is extremely slow:
Testing a single stream wget download is better, but still unable to achieve even a single link speed's throughput:
Site A local.conf:
Site B local.conf:
OPNsense on both ends, however, I've tried OpenWRT on Site A and got same results.
I've also tried adding two more links of dynamic speed (LTE) and still got exact same results.
bond show
command displays 2 links being used, 9993 port is open on both ends and trace logs show that all links are being used.Just as a note, I was previously using MPTCP (openmptcprouter) and was able to achieve 200 Mbps with default configuration and up to 400 Mbps with 2 additional dynamic links with some tinkering and was hoping ZT would be able to do roughly the same as ZT allows for far more advanced configurations. Are these results expected?