Can the XDP program attached to the CPUMAP entry redirect the packet to AF_XDP instead of networking stack?

arukshani commented 9 months ago

Hi,

We are using our application inside a namespace and have an XDP program running on veth to redirect packets to a user space AF_XDP program. We see that our application and the IRQ processing happening on the same core limiting the throughput. To separate these two is it possible to attach a XDP program to CPUMAP entry so that it can redirect the packets to AF_XDP instead of passing packets to the normal networking stack?

Thank you.

tohojo commented 9 months ago

Hmm, that's a good question! I'm not actually sure - @magnus-karlsson will hopefully know :)

magnus-karlsson commented 9 months ago

Never tried anything like that. But if you redirect something with a CPUMAP, how will you be able to redirect it to an XSKMAP too? I would instead just change the irq affinity mask so it cannot run on the same core as the one you have your application on. Or the opposite, make sure your user-space program has a cpumask that does not include the core that the irq runs on.

tohojo commented 9 months ago

Magnus Karlsson @.***> writes:

Never tried anything like that. But if you redirect something with a CPUMAP, how will you be able to redirect it to an XSKMAP too?

CPUMAP supports running a second XDP program after the redirect (as does devmap); this can be used as a kind of software RSS if your NIC doesn't support the kind of hashing you want, for example. That second program can itself return XDP_REDIRECT, into another map. Not sure what would happen with an XSKMAP in this case, though, as you're no longer in driver context, so I dunno if the RXQ binding works at all there? That's what I was hoping you'd know ;)

magnus-karlsson commented 9 months ago

Yes, you a right. Forgot about that one, though I have no idea what would happen in this case. Try it out, or just use the affinities like in the previous mail. It does work.

arukshani commented 9 months ago

Hi, I changed the irq affinity mask with rps_cpus for both the veth pairs however it worked only for the veth that does not have any XDP stuff on it. For example, following is our setup. When I changed the rps_cpus for both veth0 and veth1 it only works for veth0 (for ack packets). Packets that are received by veth1 from veth0 (softirq) still endup processing on the iperf core. [veth1 has an XDP program to direct packets to AF_XDP]. @tohojo @magnus-karlsson

However, when there are no XDP programs on veths RPS works perfectly fine. We confirmed that with the following setup. Is there a reason why rps_cpus can't work when there is an XDP program running on it? Is this a bug? Thank you!

magnus-karlsson commented 9 months ago

I have never used RPS in conjunction with this. But regardless of that, let me check so I understand this. You have 2 cores in each namespace and you would like for the NIC interrupt processing plus veth1 to run on one core and the rest (veth0 + iperf) to run on the other core? And what you are seeing is that only NIC runs on the first core and the rest on the second?

arukshani commented 9 months ago

@magnus-karlsson No, not exactly. Following is what we want. Iperf client is pinned to - Core 1 Veth0 IRQ processing is set to - Core 2 (via _/sys/class/net/veth0/queues/rx-0/rpscpus) Veth1 IRQ processing is set to - Core 3 (via _/sys/class/net/veth1/queues/rx-0/rpscpus) Physical NIC is not a problem. Its IRQ happens on a different core anyway.

The problem is that Veth1 IRQ processing happens on iperf client core (Core 1 instead of Core3). We want the Veth1 IRQ to happen on Core3 as set by rps_cpus value. This works fine when there is no XDP program running on Veth1. But when there is an XDP program on Veth1, IRQ ends up on iperf client core. We want to separate iperf client core from Veth1 IRQ processing.

Thank you!

magnus-karlsson commented 9 months ago

Thanks for the info. This likely has to do something with the fact that the execution model of veth changes when you install an XDP program. It starts to use NAPI that is using softirq for its execution and that is basically a deferred interrupt that can be executed at a number of places in the kernel. But I do not know the veth driver in any detail, so do not know how it is handling all this. With a regular NIC driver, you always have an interrupt you can steer and that triggers processing.

I saw that the veth driver is using napi_schedule(). When I implemented AF_XDP zero-copy support in a number of drivers, I could never use this as it moved around driver execution between cores, when I wanted it to stay in one just like you. Instead I had to rely on sending an IPI to the driver which always ended triggering an interrupt and NAPI processing on the right core. But for veth, that does not exist I believe. Maybe you should take this to someone that knows the veth driver for some better insight?

So I would say that this has nothing to do with AF_XDP per say. Installing an XDP program should be enough to change this behavior. AF_XDP is only some extra code after XDP (at least in copy-mode that you are using here).

Just note that all this is a theory as I have not run anything myself.

arukshani commented 9 months ago

@magnus-karlsson Thank you so much for the clarification. I was able to route the veth IRQ processing to a different core by applying SO_BUSY_POLL socket option to the AF_XDP socket. Thank you!

xdp-project / bpf-examples

Can the XDP program attached to the CPUMAP entry redirect the packet to AF_XDP instead of networking stack? #105