Closed rayoluo closed 4 years ago
oluoluo notifications@github.com writes:
So, does this mean that I can't use xdp to redirect tcp network traffic between two containers, or just I made some mistakes? Wanna help!! thx!!
First off, did you check if it works without XDP? I.e., if you just enable regular kernel forwarding between the two containers? Your packet dump indicates that the packets do arrive but nothing is replying to them...
Yes, it works without XDP. But once I attach XDP, TCP Retransmission happens. I can't figure out why this appears.
oluoluo notifications@github.com writes:
Yes, it works without XDP. But once I attach XDP, TCP Retransmission happens. I can't figure out why this appears.
Hmm, have you looked at what return codes the XDP program(s) are returning?
You can also try using xdpdump (from xdp-tools: https://github.com/xdp-project/xdp-tools ) to dump the packets before and after you XDP program processes them and see if anything looks odd...
thanks, I'll try using xdpdump for a check.
You need a newer clang...
met some other problems again... 😢
oluoluo notifications@github.com writes:
met some other problems again... 😢
Hmm, what kernel version are you running? xdpdump needs a quite new kernel to work, unfortunately...
[root@centos8 packet03-redirecting]# uname -a
Linux centos8.novalocal 4.18.0-193.6.3.el8_2.x86_64 #1 SMP Wed Jun 10 11:09:32 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Which kernel version can xdpdump run normally on? Is 5.4.0 okay?
oluoluo notifications@github.com writes:
[root@centos8 packet03-redirecting]# uname -a Linux centos8.novalocal 4.18.0-193.6.3.el8_2.x86_64 #1 SMP Wed Jun 10 11:09:32 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Ohh, you're running the stock centos kernel? That won't work until CentOS 8.3 is released (XDP will be enabled with RHEL 8.3). I suspect that may be why things are not working for you at all...
Which kernel version can xdpdump run normally on? Is 5.4.0 okay?
Nope, you'll need 5.7+ for xdpdump...
I did the same experiment in the Vmware virtual machine, and it didn’t work as well. Its kernel version is:
root@ubuntu:/home/luo# uname -a
Linux ubuntu 5.4.0-48-generic #52~18.04.1-Ubuntu SMP Thu Sep 10 12:50:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Could it be the same problem?
oluoluo notifications@github.com writes:
I did the same experiment in the Vmware virtual machine, and it didn’t work as well. Its kernel version is:
root@ubuntu:/home/luo# uname -a Linux ubuntu 5.4.0-48-generic #52~18.04.1-Ubuntu SMP Thu Sep 10 12:50:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Could it be the same problem?
Hmm, the basic redirect ought to work on 5.4 at least... Looking at which status codes your XDP program actually returns would be a good first step. You can do that with the "stats" program that is part of the tutorial, even without xdpdump
Thanks for your advice! After running "xdp_stats", it shows that the xdp program returns the expecting value "XDP_REDIRECT". Here is the running result of the program "xdp_stats":
XDP-action
XDP_ABORTED 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002721
XDP_DROP 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002722
XDP_PASS 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002719
XDP_TX 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002726
XDP_REDIRECT 7 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002732
XDP-action
XDP_ABORTED 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002385
XDP_DROP 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002386
XDP_PASS 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002386
XDP_TX 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002386
XDP_REDIRECT 7 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002387
XDP-action
XDP_ABORTED 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001837
XDP_DROP 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001834
XDP_PASS 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001835
XDP_TX 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001835
XDP_REDIRECT 7 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001835
XDP-action
XDP_ABORTED 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002254
XDP_DROP 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002177
XDP_PASS 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002144
XDP_TX 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002116
XDP_REDIRECT 8 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002090
XDP-action
XDP_ABORTED 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001461
XDP_DROP 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001490
XDP_PASS 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001504
XDP_TX 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001515
XDP_REDIRECT 8 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001526
XDP-action
XDP_ABORTED 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001769
XDP_DROP 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001811
XDP_PASS 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001946
XDP_TX 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.001992
XDP_REDIRECT 8 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002008
XDP-action
XDP_ABORTED 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.004167
XDP_DROP 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.004178
XDP_PASS 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.004062
XDP_TX 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.004033
XDP_REDIRECT 9 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.004032
XDP-action
XDP_ABORTED 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002167
XDP_DROP 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002158
XDP_PASS 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002159
XDP_TX 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002162
XDP_REDIRECT 9 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002164
XDP-action
XDP_ABORTED 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002270
XDP_DROP 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002276
XDP_PASS 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002273
XDP_TX 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002270
XDP_REDIRECT 9 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.002268
There are two types of XDP programs. The XDP program attached to the "veth15a38ef" interface only modifies the target MAC address of the link layer. The other type of XDP program attached to the "eth0" interface in the namespace of docker which only returns XDP_PASS. One container can ping another, while can't curl. when the TCP "SYN" packet arrives at the "eth0" interface of the target docker container, it does not respond with a TCP "ACK" packet. very strange...
The result of tcpdump at the target container's namespace:
root@ubuntu:/home/luo/xdp-tutorial/packet03-redirecting# ip netns exec tomcat02 tcpdump -i eth0 -ev
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
04:06:53.439331 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50870, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x9914), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076087700 ecr 0,nop,wscale 7], length 0
04:06:54.441073 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50871, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x952a), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076088702 ecr 0,nop,wscale 7], length 0
04:06:56.457771 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50872, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x8d4a), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076090718 ecr 0,nop,wscale 7], length 0
04:06:58.665668 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has ubuntu tell 172.17.0.2, length 28
04:06:58.665708 02:42:ac:11:00:03 (oui Unknown) > 02:42:ac:11:00:02 (oui Unknown), ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply ubuntu is-at 02:42:ac:11:00:03 (oui Unknown), length 28
04:07:00.713382 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50873, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x7caa), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076094974 ecr 0,nop,wscale 7], length 0
04:07:08.905609 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50874, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x5caa), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076103166 ecr 0,nop,wscale 7], length 0
04:07:25.033508 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50875, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x1daa), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076119294 ecr 0,nop,wscale 7], length 0
But you do see the SYN arrive at the destination? Did you inspect it with wireshark and check that things like the destination 5-tuple marches and there are no checksum issues?
Ah, well there's a hint: "cksum 0x5856 (incorrect -> 0x9914)"
On 21 October 2020 13:24:57 CEST, oluoluo notifications@github.com wrote:
The result of tcpdump at the target container's namespace:
root@ubuntu:/home/luo/xdp-tutorial/packet03-redirecting# ip netns exec tomcat02 tcpdump -i eth0 -ev tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 04:06:53.439331 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50870, offset 0, flags [DF], proto TCP (6), length 60) 172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x9914), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076087700 ecr 0,nop,wscale 7], length 0 04:06:54.441073 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50871, offset 0, flags [DF], proto TCP (6), length 60) 172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x952a), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076088702 ecr 0,nop,wscale 7], length 0 04:06:56.457771 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50872, offset 0, flags [DF], proto TCP (6), length 60) 172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x8d4a), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076090718 ecr 0,nop,wscale 7], length 0 04:06:58.665668 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has ubuntu tell 172.17.0.2, length 28 04:06:58.665708 02:42:ac:11:00:03 (oui Unknown) > 02:42:ac:11:00:02 (oui Unknown), ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply ubuntu is-at 02:42:ac:11:00:03 (oui Unknown), length 28 04:07:00.713382 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50873, offset 0, flags [DF], proto TCP (6), length 60) 172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x7caa), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076094974 ecr 0,nop,wscale 7], length 0 04:07:08.905609 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50874, offset 0, flags [DF], proto TCP (6), length 60) 172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x5caa), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076103166 ecr 0,nop,wscale 7], length 0 04:07:25.033508 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 50875, offset 0, flags [DF], proto TCP (6), length 60) 172.17.0.2.44498 > ubuntu.http: Flags [S], cksum 0x5856 (incorrect -> 0x1daa), seq 1956521919, win 64240, options [mss 1460,sackOK,TS val 2076119294 ecr 0,nop,wscale 7], length 0
But this seems not to make sense, because I only changed the destination MAC address of the data packet, and it seems that there is no need to recalculate the checksum. When I unload the XDP program and curl the target container again, the result of tcpdump in the target container's namespace also has an incorrect checksum... 😂
root@ubuntu:/home/luo/xdp-tutorial/packet03-redirecting# ip netns exec tomcat02 tcpdump -i eth0 -ev
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C04:33:05.655205 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 54988, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.55278 > ubuntu.http-alt: Flags [S], cksum 0x5856 (incorrect -> 0x458a), seq 2841509789, win 64240, options [mss 1460,sackOK,TS val 2077659916 ecr 0,nop,wscale 7], length 0
04:33:05.655287 02:42:ac:11:00:03 (oui Unknown) > 02:42:ac:11:00:02 (oui Unknown), ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
ubuntu.http-alt > 172.17.0.2.55278: Flags [S.], cksum 0x5856 (incorrect -> 0x39a3), seq 925366983, ack 2841509790, win 65160, options [mss 1460,sackOK,TS val 928292602 ecr 2077659916,nop,wscale 7], length 0
04:33:05.655364 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 54989, offset 0, flags [DF], proto TCP (6), length 52)
172.17.0.2.55278 > ubuntu.http-alt: Flags [.], cksum 0x584e (incorrect -> 0x6502), ack 1, win 502, options [nop,nop,TS val 2077659916 ecr 928292602], length 0
04:33:05.659255 02:42:ac:11:00:02 (oui Unknown) > 02:42:ac:11:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 145: (tos 0x0, ttl 64, id 54990, offset 0, flags [DF], proto TCP (6), length 131)
172.17.0.2.55278 > ubuntu.http-alt: Flags [P.], cksum 0x589d (incorrect -> 0xa54a), seq 1:80, ack 1, win 502, options [nop,nop,TS val 2077659920 ecr 928292602], length 79: HTTP, length: 79
GET / HTTP/1.1
Host: 172.17.0.3:8080
User-Agent: curl/7.58.0
Accept: */*
04:33:05.659277 02:42:ac:11:00:03 (oui Unknown) > 02:42:ac:11:00:02 (oui Unknown), ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 37071, offset 0, flags [DF], proto TCP (6), length 52)
ubuntu.http-alt > 172.17.0.2.55278: Flags [.], cksum 0x584e (incorrect -> 0x64a4), ack 80, win 509, options [nop,nop,TS val 928292606 ecr 2077659920], length
oluoluo notifications@github.com writes:
But this seems not to make sense, because I only changed the destination MAC address of the data packet, and it seems that there is no need to recalculate the checksum.
I think the issue here is that if you're not using XDP, the sender side will mark the packet to say that verifying the checksum is unnecessary, and because the same skb survives all the way to the receiver side, that information is carried with it. But when there's an XDP program in the loop, the skb is destroyed and re-created, and the full checksum validation kicks in, causing the packet to be dropped. So when using XDP in this way, you're going to have to fix up the TCP header checksum even though you don't modify the header itself...
Did you mean that I need to recalculate the TCP checksum and write it to the TCP checksum field before redirecting the package?
It seems that bpf_l4(3)_csum_replace
or bpf_csum_diff
can't be used here since they calculate the checksum in an incremental way, which means that the old checksum value needs to be provided. But the old checksum value is incorrect. So I plan to recalculate the checksum in a more "violent" way just like the following:
static __always_inline __u16 checksum(
__u16 *buf,
int bufsize,
__u32 saddr,
__u32 daddr
)
{
__u32 sum = 0;
int i;
// while (bufsize > 1) {
// sum += *(__u16 *)buf++;
// bufsize -= 2;
// }
#pragma clang loop unroll(full)
for (; bufsize > 1; bufsize -= 2) {
sum += *(__u16 *)buf++;
}
if (bufsize > 0) {
__u8 left_overs[2] = {0};
left_overs[0] = *buf & 0xff;
sum += *(__u16 *)left_overs;
}
__u16 *psd = (__u16 *)&saddr;
sum += *psd++;
sum += *psd;
psd = (__u16 *)&daddr;
sum += *psd++;
sum += *psd;
sum += bpf_htons((__u16)bufsize);
sum += bpf_htons(0x0006);
// while (sum >> 16) {
// sum = (sum & 0xffff) + (sum >> 16);
// }
sum = (sum & 0xffff) + (sum >> 16);
sum = (sum & 0xffff) + (sum >> 16);
return ~sum;
}
// add new section
SEC("xdp_redirect_new")
int xdp_redirect_new_func(struct xdp_md *ctx)
{
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct hdr_cursor nh;
struct ethhdr *eth;
int eth_type;
int action = XDP_PASS;
unsigned char *dst;
struct iphdr *iph;
struct tcphdr *tcph;
/* These keep track of the next header type and iterator pointer */
nh.pos = data;
/* Parse Ethernet and IP/IPv6 headers */
eth_type = parse_ethhdr(&nh, data_end, ð);
if (eth_type != bpf_htons(ETH_P_IP))
goto out;
/* Do we know where to redirect this packet? */
dst = bpf_map_lookup_elem(&redirect_params, eth->h_source);
if (!dst)
goto out;
/* Set a proper destination address */
memcpy(eth->h_dest, dst, ETH_ALEN);
bpfprint("point 1\n");
iph = (struct iphdr *)(eth + 1);
if ((void *)(iph + 1) > data_end) {
bpfprint("point 2\n");
goto out;
}
bpfprint("iph->protocol: %d\n", iph->protocol);
if (iph->protocol == IPPROTO_TCP) {
bpfprint("point 3\n");
tcph = (struct tcphdr *)(iph + 1);
int tcplen = iph->tot_len - iph->ihl * 4;
if ((void *)(tcph + 1) > data_end) {
bpfprint("point 4\n");
goto out;
}
tcph->check = 0;
__u16 newCheck = checksum((__u16 *)tcph, tcplen, iph->saddr, iph->daddr);
tcph->check = newCheck;
bpfprint("calculate checksum value: %x\n", tcph->check);
}
bpfprint("point 5\n");
action = bpf_redirect_map(&tx_port, 0, 0);
out:
return xdp_stats_record_action(ctx, action);
}
But when I try to load this BPF program into the kernel, an error occurs. And I found the problem is that loops with arbitrary numbers are not allowed in the BPF program, which may cause this error while loading this section. Could you please tell me any solution to this problem? Appreciate your help! 😅
oluoluo notifications@github.com writes:
Did you mean that I need to recalculate the TCP checksum and write it to the TCP checksum field before redirecting the package? It seems that
bpf_l4(3)_csum_replace
orbpf_csum_diff
You can use bpf_csum_diff() with a "before" csum of 0 :)
Thanks for the feedback! Now I use bpf_csum_diff() to do a full checksum instead of an increment checksum calculation. As I load it, an error occurs from the verifier, "R4 min value is negative, either use unsigned or 'var &= const'". Code shows as below. Is there any way to fix it? (this is quite like this issue & this one):
__attribute__((__always_inline__))
static inline void ipv4_l4_csum(void *data_start, __u32 data_size,
__u64 *csum, struct iphdr *iph) {
__u32 tmp = 0;
*csum = bpf_csum_diff(0, 0, &iph->saddr, sizeof(__be32), *csum);
*csum = bpf_csum_diff(0, 0, &iph->daddr, sizeof(__be32), *csum);
// __builtin_bswap32 equals to htonl()
tmp = __builtin_bswap32((__u32)(iph->protocol));
*csum = bpf_csum_diff(0, 0, &tmp, sizeof(__u32), *csum);
tmp = __builtin_bswap32((__u32)(data_size));
*csum = bpf_csum_diff(0, 0, &tmp, sizeof(__u32), *csum);
*csum = bpf_csum_diff(0, 0, data_start, data_size, *csum);
*csum = csum_fold_helper(*csum);
}
...
int tcplen = bpf_ntohs(iph->tot_len) - iph->ihl * 4;
bpfprint("tcplen value: %d\n", tcplen);
tcph->check = 0;
cs = 0;
// ipv4_l4_csum(tcph, 20, &cs, iph);
ipv4_l4_csum((void *)tcph, (__u32)tcplen, &cs, iph);
tcph->check = cs;
...
// err info
293: (bf) r1 = r8
294: (dc) r1 = be32 r1
295: (63) *(u32 *)(r10 -40) = r1
296: (b7) r1 = 0
297: (b7) r2 = 0
298: (bf) r3 = r7
299: (b7) r4 = 4
300: (bf) r5 = r0
301: (85) call bpf_csum_diff#28
last_idx 301 first_idx 293
regs=4 stack=0 before 300: (bf) r5 = r0
regs=4 stack=0 before 299: (b7) r4 = 4
regs=4 stack=0 before 298: (bf) r3 = r7
regs=4 stack=0 before 297: (b7) r2 = 0
last_idx 301 first_idx 293
regs=10 stack=0 before 300: (bf) r5 = r0
regs=10 stack=0 before 299: (b7) r4 = 4
302: (b7) r1 = 0
303: (b7) r2 = 0
304: (79) r3 = *(u64 *)(r10 -48)
305: (bf) r4 = r8
306: (bf) r5 = r0
307: (85) call bpf_csum_diff#28
last_idx 307 first_idx 293
regs=4 stack=0 before 306: (bf) r5 = r0
regs=4 stack=0 before 305: (bf) r4 = r8
regs=4 stack=0 before 304: (79) r3 = *(u64 *)(r10 -48)
regs=4 stack=0 before 303: (b7) r2 = 0
R4 min value is negative, either use unsigned or 'var &= const'
processed 312 insns (limit 1000000) max_states_per_insn 0 total_states 19 peak_states 19 mark_read 8
libbpf: -- END LOG --
libbpf: failed to load program 'xdp_redirect_new'
libbpf: failed to load object 'xdp_prog_kern.o'
ERR: loading BPF-OBJ file(xdp_prog_kern.o) (-22): Invalid argument
ERR: loading file: xdp_prog_kern.o
oluoluo notifications@github.com writes:
Thanks for the feedback! Now I use bpf_csum_diff() to do a full checksum instead of an increment checksum calculation. As I load it, an error occurs from the verifier, "R4 min value is negative, either use unsigned or 'var &= const'".
I suspect that particular error is because you're using an 'int' for tcplen instead of an unsigned value :)
Hmm...After I update this statement into __u32 tcplen = (__u32)(bpf_ntohs(iph->tot_len) - iph->ihl * 4);
, this error remains the same... 😢
last_idx 304 first_idx 287
regs=4 stack=0 before 303: (bf) r5 = r0
regs=4 stack=0 before 302: (b7) r4 = 4
regs=4 stack=0 before 301: (bf) r3 = r8
regs=4 stack=0 before 300: (b7) r2 = 0
last_idx 304 first_idx 287
regs=10 stack=0 before 303: (bf) r5 = r0
regs=10 stack=0 before 302: (b7) r4 = 4
305: (bf) r1 = r7
306: (dc) r1 = be32 r1
307: (63) *(u32 *)(r10 -40) = r1
308: (b7) r1 = 0
309: (b7) r2 = 0
310: (bf) r3 = r8
311: (b7) r4 = 4
312: (bf) r5 = r0
313: (85) call bpf_csum_diff#28
last_idx 313 first_idx 305
regs=4 stack=0 before 312: (bf) r5 = r0
regs=4 stack=0 before 311: (b7) r4 = 4
regs=4 stack=0 before 310: (bf) r3 = r8
regs=4 stack=0 before 309: (b7) r2 = 0
last_idx 313 first_idx 305
regs=10 stack=0 before 312: (bf) r5 = r0
regs=10 stack=0 before 311: (b7) r4 = 4
314: (b7) r1 = 0
315: (b7) r2 = 0
316: (79) r3 = *(u64 *)(r10 -48)
317: (bf) r4 = r7
318: (bf) r5 = r0
319: (85) call bpf_csum_diff#28
last_idx 319 first_idx 305
regs=4 stack=0 before 318: (bf) r5 = r0
regs=4 stack=0 before 317: (bf) r4 = r7
regs=4 stack=0 before 316: (79) r3 = *(u64 *)(r10 -48)
regs=4 stack=0 before 315: (b7) r2 = 0
R4 min value is negative, either use unsigned or 'var &= const'
processed 324 insns (limit 1000000) max_states_per_insn 0 total_states 19 peak_states 19 mark_read 8
libbpf: -- END LOG --
libbpf: failed to load program 'xdp_redirect_new'
libbpf: failed to load object 'xdp_prog_kern.o'
ERR: loading BPF-OBJ file(xdp_prog_kern.o) (-22): Invalid argument
ERR: loading file: xdp_prog_kern.o
oluoluo notifications@github.com writes: Thanks for the feedback! Now I use bpf_csum_diff() to do a full checksum instead of an increment checksum calculation. As I load it, an error occurs from the verifier, "R4 min value is negative, either use unsigned or 'var &= const'". I suspect that particular error is because you're using an 'int' for
tcplen
instead of an unsigned value :)
It seems the problem is relevant with the variable tcplen
, when I update ipv4_l4_csum((void *)tcph, tcplen, &cs, iph);
into ipv4_l4_csum((void *)tcph, 20, &cs, iph);
, everything works fine...
Hi @tohojo, I fix the incorrect checksum by running ethtool -K eth0 tx off
in the containers' namespace. Then I use iperf3 to test the TCP traffic speed in the container, and the result is as followed:
Is there any way to achieve higher throughput? could allocating more rx queues be helpful? (such as using this command ip link add veth0 numrxqueues 20 numtxqueues 20 type veth peer name veth1 netns tomcat01 numrxqueues 20 numtxqueues 20
). And is there any other methods I don't know. I'm quite new to xdp... Thanks!
oluoluo notifications@github.com writes:
Hi @tohojo, I fix the incorrect checksum by running
ethtool -K eth0 tx off
in the containers' namespace.
Ah, cool! We should probably include that in the setup script...
Then I use iperf3 to test the TCP traffic speed in the container, and the result is as followed: Is there any way to achieve higher throughput? could allocating more rx queues be helpful? (such as using this command
ip link add veth0 numrxqueues 20 numtxqueues 20 type veth peer name veth1 netns tomcat01 numrxqueues 20 numtxqueues 20
). And is there any other methods I don't know. I'm quite new to xdp... Thanks!
More txqueues are not going to help you with a single TCP flow. Using XDP on veth causes TSO to stop working, and packets will be linearised. This hurts TCP throughput compared to just regular forwarding between two veths. Nothing to be done about this for now, although there is some work in progress to fix this in the kernel. But until then, using XDP_REDIRECT on TCP traffic between veth devices is just going to be slower...
@tohojo this thread is from almost 2 years ago, and we're running into similar issues with having to do software checksumming on the sender side for the packets to be not dropped on the receiver side. This impacts performance significantly (throughput went from 20Gbps to 4Gbps between 2 containers on the same machine) I'm wondering what is the status of the kernel work to fix this problem? Also, are there any workarounds that you can suggest while waiting for fix(es) in the kernel?
@huang195 see this talk at LPC next week (by @netoptimizer) for current status on adding support for hardware offloads to XDP: https://lpc.events/event/16/contributions/1362/
More work will be needed to support it on the TX side as well, but the hints described in that talk will be a prerequisite for this.
If you're forwarding traffic between containers/veths you could use the TC BPF hook instead...
@tohojo thanks for the heads up. The talk looks super interesting and it's coming up in a week or so. The reason we're not using tc hook points is because we're using afxdp, and since user level program cannot change skb field, we're kind of stuck with checksumming.
More txqueues are not going to help you with a single TCP flow. Using XDP on veth causes TSO to stop working, and packets will be linearised. This hurts TCP throughput compared to just regular forwarding between two veths. Nothing to be done about this for now, although there is some work in progress to fix this in the kernel. But until then, using XDP_REDIRECT on TCP traffic between veth devices is just going to be slower...
@tohojo I'm running a experiment compare the performance of xdp and the linux bridge in linux 6.2.0-36-generic: iperf between two node. And I find that the the bandwidth using Linux bridge can reach 85Gbit/s but the bandwidth using xdp can only reach 16Gbit/s. Not sure whether cause by this problem. Is there any new progress to fix this?
Ah, cool! We should probably include that in the setup script...
thanks to this issue, ethtool -K eth0 tx off
just fix the tcp redirect problem I encounter too.
like a magic, It's a bit incredible.
@linuxholic @rayoluo somehow, even with "ethtool -K eth0 tx off" the same verifier problem persists.
any idea why?
After completing tutorial Packet03, I tried to build a datapath between two docker instances(use tomcat image) named tomcat01 and tomcat02 using xdp. I followed the instructs in Assignment 3(Extend to a bidirectional router) in Packet03. After that, tomcat01 could ping tomcat02, so did tomcat02. However, when I ran "curl 172.17.0.3:8080"(IP of tomcat02) inside tomcat01, nothing showed in the terminal. The following are the details:
ping
worked butcurl
not. After that, I rantcpdump -i eth0 -ev -w tomcat.cap
in tomcat02, the following is the result showed in Wireshark:So, does this mean that I can't use xdp to redirect tcp network traffic between two containers, or just I made some mistakes? Wanna help!! thx!!