xdp-project / xdp-tutorial

XDP tutorial
2.44k stars 577 forks source link

Performance degradation of XDP??? #292

Closed huang195 closed 2 years ago

huang195 commented 2 years ago

So, today we did a measurement of network performance comparing native performance vs. the simplest XDP program, and it was a bit of a shocker. The results were obtained running iperf3 with all default parameters. The client and server processes were created in separate network namespaces with separate veth pairs and are connected via a Linux bridge in the host network namespace (pretty much the same as what testenv.sh does).

Without any ebpf programs, we get about 45 Gb/s, and with an ebpf program attached to the XDP hook point, we get 8 Gb/s. The ebpf program is just:

SEC("xdp_stats")
int xdp_stats_prog(struct xdp_md *ctx)
{
    return XDP_PASS;
}

and it's loaded and attached using xdp-loader like so:

xdp-loader load -p /sys/fs/bpf/veth-ns1 -s xdp_stats veth-ns1 xdp_kern.o

Since we're attaching additional code in the packet path, I expect some very minor performance hit, but going from 45 Gbps to 8 Gbps was rather unexpected.

I'm running the experiments on a Fedora35 VM using VirtualBox running on a Mac. I've also tried running the same experiment in other setups, and the results are not exactly the same, but I see a significant drop in throughput in all settings. Anything I might have overlooked?

huang195 commented 2 years ago

One other interesting thing to note is that since I didn't specify in xdp-loader a -m parameter, it defaults to the native mode. However, if I add -m skb to the above xdp-loader command, I get 20Gbps. This very counter-intuitive, and it is still very far away from the native throughput of 45 Gbps.

tohojo commented 2 years ago

and it's loaded and attached using xdp-loader like so:

xdp-loader load -p /sys/fs/bpf/veth-ns1 -s xdp_stats veth-ns1 xdp_kern.o

So you're running a veth-to-veth test? Running XDP on a veth causes TSO superpackets to be split up into individual packets, and may cause data copying because the frames need to be linearised. This is the cause of the slowdown; if you're doing things on veth devices, I'd suggest using TC-bpf instead of XDP.

huang195 commented 2 years ago

So you're running a veth-to-veth test?

Yes, I created 2 network namespaces, each with its own veth pair. The xdp program is attached to the veth ends that sit in the host network namespace and are connected via a Linux bridge.

Running XDP on a veth causes TSO superpackets to be split up into individual packets, and may cause data copying because the frames need to be linearised. This is the cause of the slowdown; if you're doing things on veth devices, I'd suggest using TC-bpf instead of XDP.

Thank you for pin pointing exactly what the issue was. Initially, I was suspecting it was something related to GRO or GSO. Since iperf3 client generates almost all the traffic, I detached the xdp program from the iperf3 side, and I saw the iperf3 throughtput went back up to the native performance. This is a good indication that TSO is the culprit here.

if you're doing things on veth devices, I'd suggest using TC-bpf instead of XDP.

Yep, I will run similar experiments on TC hook points next and see if there are anything unexpected. But are you also hinting that XDP programs should NOT be attached to veth devices? I guess that makes sense since packets being sent from a container network namespace will always get received on the other end of the veth pair in the host network namespace, and this makes xdp programs that are meant for ingress packet path to receive packets on the egress path. Is this the correct way to look at this or are there any valid use case for attaching xdp programs to veth devices?

You mentioned attaching XDP program causes "data copying because frames need to be linearized" . Can you elaborate a bit more on this? Doesn't that interfere with offloading on the packet receiving path, i.e., GRO?

tohojo commented 2 years ago

Hai Huang @.***> writes:

Yep, I will run similar experiments on TC hook points next and see if there are anything unexpected. But are you also hinting that XDP programs should NOT be attached to veth devices? I guess that makes sense since packets being sent from a container network namespace will always get received on the other end of the veth pair in the host network namespace, and this makes xdp programs that are meant for ingress packet path to receive packets on the egress path. Is this the correct way to look at this or are there any valid use case for attaching xdp programs to veth devices?

If you also run XDP on a physical interface, and use XDP_REDIRECT to send frames into a veth device, they will be preserved in XDP frame form as they traverse through the veth; so if you run another XDP program on the other end of the veth pair, you can process the XDP frame directly, before the kernel builds an SKB out of it.

In all other cases, packets traversing a veth already has an skb assigned, so you don't really get any speedup from XDP (most of the speedup comes from not building an skb at all). You can still run XDP for compatibility reasons (running the same code on physical and virtual devices, for instance), but since the skb is already there you might as well use tc-bpf, so you can actually access it and its metadata fields.

You mentioned attaching XDP program causes "data copying because frames need to be linearized" . Can you elaborate a bit more on this? Doesn't that interfere with offloading on the packet receiving path, i.e., GRO?

It's only done when needed: XDP assumes a frame is linear, but depending on how the skb was built it may not be (header split in hardware, TCP copy-on-write, etc). See: https://elixir.bootlin.com/linux/latest/source/drivers/net/veth.c#L703

huang195 commented 2 years ago

Thank you for that explanation that made things more clear for me. In the example you gave, if the setup is:

eth0: host NIC device veth-host: veth end that sits in the host namespace veth-guest: veth end that sits in the container namespace

What you were saying is an XDP program should be attached to eth0 and veth-guest, so that when a packet is received on eth0, it can be XDP_REDIRECT'ed to veth-host, which will send the packet to veth-guest, and when veth-guest's XDP program receives it, the XDP frame will remain intact. This would allow opportunities to optimize by not sending the packet up the network stack. Did I understand you correctly?

huang195 commented 2 years ago

This is the cause of the slowdown; if you're doing things on veth devices, I'd suggest using TC-bpf instead of XDP.

So I tried to attach the simplest eBPF program to the TC ingress and egress (i.e., it just returns TC_ACT_OK), and there is again pretty significant throughput drop in iperf3. It's not as bad as XDP, but it's still noticeable. In this particular VM, veth-to-veth throughput dropped from 36 Gbps to 30 Gbps. TBH, I wasn't really expecting any performance drop due to eBPF programs are supposed to run inline in the kernel context. Could this still be caused by things like TSO, GSO, etc.?

I used the following commands to load and attach my eBPF program:

.output/iproute2/tc qdisc add dev veth-ns1 clsact
.output/iproute2/tc filter add dev veth-ns1 ingress bpf da obj .output/tc_kern.o sec tc_ingress
.output/iproute2/tc filter add dev veth-ns1 egress bpf da obj .output/tc_kern.o sec tc_egress

This is done for both iperf3 client and server sides, and for both ingress and egress.

tohojo commented 2 years ago

What you were saying is an XDP program should be attached to eth0 and veth-guest, so that when a packet is received on eth0, it can be XDP_REDIRECT'ed to veth-host, which will send the packet to veth-guest, and when veth-guest's XDP program receives it, the XDP frame will remain intact. This would allow opportunities to optimize by not sending the packet up the network stack. Did I understand you correctly?

Yes, exactly. Depends on your use case if this is something that makes sense, of course...

TBH, I wasn't really expecting any performance drop due to eBPF programs are supposed to run inline in the kernel context.

Well, it runs in kernel context, but it's still code that's running, so the overhead is not 0. There's a static branch being enabled when you install clsact, and there's some overhead of executing the BPF program as well. The latter can be significant depending on your kernel config: it uses an indirect call, which becomes fairly expensive if you've enabled spectre mitigations.

Could this still be caused by things like TSO, GSO, etc.?

I would expect the overhead would be larger if it's caused by lack of TSO/GSO...

huang195 commented 2 years ago

The latter can be significant depending on your kernel config: it uses an indirect call, which becomes fairly expensive if you've enabled spectre mitigations.

Oh I did not know that. My VM looks like this:

root@fedora:~/xdp/src$ cat /sys/devices/system/cpu/vulnerabilities/spectre_v1
Mitigation: usercopy/swapgs barriers and __user pointer sanitization
root@fedora:~/xdp/src$ cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 
Mitigation: Retpolines, STIBP: disabled, RSB filling

Does this mean running BPF programs on this machine will be expensive?

Through this whole exercise of doing performance testing on simple BPF programs attached to different places in the network stack, it feels like it can significantly impact performance because i) things like TSO/GSO, ii) running BPF programs can sometimes be expensive due to things like spectre mitigation enablement, iii) overhead from clsact that's required for attaching to TC hook points, which otherwise wouldn't even be needed. I wonder how network solutions that uses BPF like Cilium or Calico deal with issues like these as runtime environments could differ significantly from one another.

The use of BPF programs in the network stack is to reduce code execution paths, but it feels like it also adds overheads while doing so, so there seems to be a fine line to keep this balance.

tohojo commented 2 years ago

Hai Huang @.***> writes:

The latter can be significant depending on your kernel config: it uses an indirect call, which becomes fairly expensive if you've enabled spectre mitigations.

Oh I did not know that. My VM looks like this:

***@***.***:~/xdp/src$ cat /sys/devices/system/cpu/vulnerabilities/spectre_v1
Mitigation: usercopy/swapgs barriers and __user pointer sanitization
***@***.***:~/xdp/src$ cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 
Mitigation: Retpolines, STIBP: disabled, RSB filling

Does this mean running BPF programs on this machine will be expensive?

The retpolines make indirect calls a lot more expensive than they would be otherwise. Not all BPF program types use indirect calls, but TC-BPF is one of the ones that do.

Through this whole exercise of doing performance testing on simple BPF programs attached to different places in the network stack, it feels like it can significantly impact performance because i) things like TSO/GSO, ii) running BPF programs can sometimes be expensive due to things like spectre mitigation enablement, iii) overhead from clsact that's required for attaching to TC hook points, which otherwise wouldn't even be needed. I wonder how network solutions that uses BPF like Cilium or Calico deal with issues like these as runtime environments could differ significantly from one another.

The use of BPF programs in the network stack is to reduce code execution paths, but it feels like it also adds overheads while doing so, so there seems to be a fine line to keep this balance.

As I like to remind people: BPF is not magic fairy dust :)

Like any technology it has benefits and drawbacks and you'll need to evaluate them for your use case. While the overhead of BPF is quite low, it's not 0; in many cases the capabilities outweigh any overhead (for instance, achieving the same thing you can do with BPF by other means can be just as expensive if not more), but, well, that's up to you to evaluate...

huang195 commented 2 years ago

@tohojo thank you for the information!

huang195 commented 2 years ago

@tohojo I just heard a talk you co-hosted today, and one of the slides compared xdp vs Linux kernel stack when transmitting from 100Gbps link to 10Gbps, and it showed xdp was getting 2-3Gbps due to lack of queuing in the xdp path. I'm wondering if the lack of queuing in the xdp path plays a role in what I was observing? Or is tso the main culprit here?

tohojo commented 2 years ago

Hai Huang @.***> writes:

@tohojo I just heard a talk you co-hosted today, and one of the slides compared xdp vs Linux kernel stack when transmitting from 100Gbps link to 10Gbps, and it showed xdp was getting 2-3Gbps due to lack of queuing in the xdp path. I'm wondering if the lack of queuing in the xdp path plays a role in what I was observing? Or is tso the main culprit here?

No, I don't think your issue has anything to do with the lack of queueing: when you're returning XDP_PASS, the packets go straight to the networking stack and any queueing will be handled there... The thing Frey showed on that graph only happens when you forward packets using XDP_REDIRECT...

huang195 commented 2 years ago

Thank you for the clarification. Great presentation!