xdp-project / xdp-tutorial

XDP tutorial
2.49k stars 579 forks source link

advanced03-AF-XDP batch processing/transmission #304

Open lancewolves opened 2 years ago

lancewolves commented 2 years ago

Hey everybody, I'm currently trying to modify the the af_xdp_user.c program to process and transmit packets in batches as per the hint in the process_packet() function.

Has anyone looked into this? I am struggling to understand how to implement this without slowing down the earlier packages in the batch.

lancewolves commented 2 years ago

So I have implemented some batch processing in the user space application and when testing with ping flood I can't see any difference in performance at all. Since my debug messages tell me, that at no time there is more than one packet being processed at a time, I guess that the virtual ethernet interfaces can't actually send and receive messages in parallel and I will only see a difference if I use hardware interfaces. Am I correct in this?

Further I would like to understand where this supposed speed increase would come from? In both designs the sendto() syscall is only admitted once all received packets have been processed. The only difference I see is that in the original, every packet is written to the TX ring individually while in my batched version xsk_ring_prodreserve and xsk_ring_prodsubmit are called only once. Does memory allocation for individual packets need more resources over all, than for an array of packets?

zzxgzgz commented 1 year ago

Hi @lancewolves, thank you very much for sharing.

I am also interesting in this question, the rcvd variable is alway 1, I'm not sure under which situation will it be more than one.

Also, I would like to know if there is a way to add multi-threading to this af_xdp program. Right now this AF_XDP program is single threaded, and the performance I got is not that good. I wonder if adding multi-threading is feasible and if it can improve the performance.

lancewolves commented 1 year ago

Hi @zzxgzgz,

If you have the af_xdp_user.c program running on one machine and ping flood it from another for example, then the af_xdp machine might receive multiple packets in the time it takes it to pack and send the ICMP echo. In that case there would be more than 1 packet received and then batching does make sense.

I haven't looked into multi threading, but in the lpc paper on af_xdp @magnus-karlsson and @bjoto write about packing frames on one core and polling the kernel for tx/rx operations on another.

Also my follow up question didn't really make sense, because there is no memory allocation after the UMEM is created. I'm just leaving this issue open in case someone wants to write up a few clarifications about the points I mentioned. Might be helpful.

magnus-karlsson commented 1 year ago

Hey everybody, I'm currently trying to modify the the af_xdp_user.c program to process and transmit packets in batches as per the hint in the process_packet() function.

Has anyone looked into this? I am struggling to understand how to implement this without slowing down the earlier packages in the batch.

The trade-off between latency and throughput is fundamental and nothing you can do anything about. If you batch, you get better throughput but worse latency, and conversely. The key here is finding the balance that provides the most bang for the buck in your app.

magnus-karlsson commented 1 year ago

So I have implemented some batch processing in the user space application and when testing with ping flood I can't see any difference in performance at all. Since my debug messages tell me, that at no time there is more than one packet being processed at a time, I guess that the virtual ethernet interfaces can't actually send and receive messages in parallel and I will only see a difference if I use hardware interfaces. Am I correct in this?

Further I would like to understand where this supposed speed increase would come from? In both designs the sendto() syscall is only admitted once all received packets have been processed. The only difference I see is that in the original, every packet is written to the TX ring individually while in my batched version xsk_ring_prodreserve and xsk_ring_prodsubmit are called only once. Does memory allocation for individual packets need more resources over all, than for an array of packets?

The performance benefit of batching comes from amortizing costs that are independent on the number of packets. An example of this is the sendto() syscall overhead. Does not matter how many packets you send, the syscall overhead (the time it takes to perform a NULL syscall) is still the same. Not seeing any performance benefit might just mean your current perf bottleneck is somewhere else. For example, if you do not load the CPU close to 80 - 100%, decreasing the amount of instructions will not matter much.

tohojo commented 1 year ago

Also, a ping flood is rather slow, so it's not a very good performance test. You need something that can produce millions of packets per second. The in-kernel pktgen tool is one option, the T-rex packet generator is another. But you'd probably want to run either of those on a separate physical machine anyway...