xdp-project / xdp-tutorial

XDP tutorial
2.49k stars 579 forks source link

Multiple userland sockets to use the same interface / queue #270

Closed cyanide-burnout closed 2 years ago

cyanide-burnout commented 2 years ago

We have found that only one PF_XDP socket can be used at the same time to access NICs queue in whole system. Could be that solved somehow?

stefansaraev commented 2 years ago
test ~ # cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

test ~ # uname -a
Linux test 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux

test ~ # lspci -nnkv
...
04:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
        DeviceName: NIC1
        Subsystem: Dell NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [1028:001f]
        Flags: bus master, fast devsel, latency 0, IRQ 17, NUMA node 0, IOMMU group 24
        Memory at 92c30000 (64-bit, prefetchable) [size=64K]
        Memory at 92c40000 (64-bit, prefetchable) [size=64K]
        Memory at 92c50000 (64-bit, prefetchable) [size=64K]
        Expansion ROM at 90000000 [disabled] [size=256K]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
        Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
        Capabilities: [ac] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [13c] Device Serial Number 00-00-4c-d9-8f-5b-b1-50
        Capabilities: [150] Power Budgeting <?>
        Capabilities: [160] Virtual Channel
        Kernel driver in use: tg3
        Kernel modules: tg3
...
test ~ # lshw -class network
  *-network:0               
       description: Ethernet interface
       product: NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
       vendor: Broadcom Inc. and subsidiaries
       physical id: 0
       bus info: pci@0000:04:00.0
       logical name: eth0
       version: 00
       serial: 4c:d9:8f:5b:b1:50
       size: 1Gbit/s
       capacity: 1Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pm vpd msi msix pciexpress bus_master cap_list rom ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt 1000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=tg3 driverversion=5.10.0-11-amd64 duplex=full firmware=FFV21.40.2 bc 5720-v1.39 ip=x.x.x.x latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
       resources: irq:17 memory:92c30000-92c3ffff memory:92c40000-92c4ffff memory:92c50000-92c5ffff memory:90000000-9003ffff

heres what we are testing on. debian bullseye, tg3. let us know if you need more details.

cyanide-burnout commented 2 years ago

The code:

  channel->handle = socket(PF_XDP, SOCK_RAW, 0);

  channel->memory.headroom   = 0;
  channel->memory.chunk_size = FRAME_SIZE;
  channel->memory.len        = channel->buffer.length * channel->memory.chunk_size;

  channel->data        = (uint8_t*)memalign(getpagesize(), channel->memory.len);
  channel->memory.addr = (__u64)channel->data;

  address.sxdp_family   = PF_XDP;
  address.sxdp_flags    = state->flags;
  address.sxdp_ifindex  = state->interface;
  address.sxdp_queue_id = number;

  if ((setsockopt(channel->handle, SOL_XDP, XDP_UMEM_REG, &channel->memory, sizeof(struct xdp_umem_reg)) != 0) ||
      (length = state->length) && (setsockopt(channel->handle, SOL_XDP, XDP_TX_RING,              &length, sizeof(int)) != 0) ||
      (length = state->length) && (setsockopt(channel->handle, SOL_XDP, XDP_RX_RING,              &length, sizeof(int)) != 0) ||
      (length = state->length) && (setsockopt(channel->handle, SOL_XDP, XDP_UMEM_FILL_RING,       &length, sizeof(int)) != 0) ||
      (length = state->length) && (setsockopt(channel->handle, SOL_XDP, XDP_UMEM_COMPLETION_RING, &length, sizeof(int)) != 0) ||
      (length = sizeof(struct xdp_mmap_offsets)) && (getsockopt(channel->handle, SOL_XDP, XDP_MMAP_OFFSETS, &positions, &length) != 0))
  {
    error = strerror(errno);
    state->table.report(LOG_ERR, "Error inizializing XDP buffer (1): %s (%i)\n", error, errno);
    return -1;
  }

  channel->ring1.size   = positions.tx.desc + state->length * sizeof(struct xdp_desc);
  channel->ring2.size   = positions.cr.desc + state->length * sizeof(__u64);
  channel->ring1.memory = (uint8_t*)mmap(NULL, channel->ring1.size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, channel->handle, XDP_PGOFF_TX_RING);
  channel->ring2.memory = (uint8_t*)mmap(NULL, channel->ring2.size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, channel->handle, XDP_UMEM_PGOFF_COMPLETION_RING);

  if ((channel->ring1.memory == MAP_FAILED) ||
      (channel->ring2.memory == MAP_FAILED) ||
      (bind(channel->handle, (struct sockaddr*)&address, sizeof(struct sockaddr_xdp)) != 0))
  {
    error = strerror(errno);
    state->table.report(LOG_ERR, "Error inizializing XDP buffer (2): %s (%i)\n", error, errno);
    return -1;
  }

On the second once when we are trying to open the same NIC + queue, we have Error inizializing XDP buffer (2): Device or resource busy (16)

magnus-karlsson commented 2 years ago

If you want to bind multiple sockets to the same netdev + queue_id you need to use the XDP_SHARED_UMEM mode that you indicate with a flag with the same name to the bind call. I strongly suggest that you use libxdp to do the setup for you. Is there any reason that you do this manually? Is there anything missing from libxdp?

Take a look at the xdpsock_user.c and xdpsock_kern.c in samples/bpf/ in the Linux repo for how to use this mode to bind multiple sockets to a single netdev and queue id. If you want to know how to use the low level APIs, then take a look at the libxdp code, but I suggest that you use the library instead. One thing to note is that you can only have a single fill ring and completion ring tied to that shared umem, but you have one Tx and Rx ring per socket.

cyanide-burnout commented 2 years ago

Ok, I get it. So in case of multiple processes there is no way to use the same NIC except some kind of gymnastic with shared UMEM and shared queues. Please fix documentation - https://www.kernel.org/doc/html/latest/networking/af_xdp.html It says: The UMEM can be shared between processes, if desired. There is nothing regarding to a mandatory requirement to use shared UMEM to get multiple sockets work. We also found some issues with leaking descriptors when amount of submitted packets is greater then configured descriptors (ethtool -g). Documentation also says nothing about that. Then finally, about libxdp. There are two reasons why we avoid from using it: it does nothing except wrapping the same socket/mmap APIs as well as it is not supplied in standard Debian distribution. How it can help with solving this issue? Anyway, sorry for criticism and thank you for such opportunity and fast response.

cyanide-burnout commented 2 years ago

I have next question due to this limitation: i have several instances of a daemon, each of them can be bound to different RX queues. Also i have shared eBPF (fds are transferred by unix sockets - that works well). Is it possible to redirect packets between rx queues inside eBPF program prior redirecting to XDP? I wrote eBPF program two years ago and almost forgot possibilities.

magnus-karlsson commented 2 years ago

Ok, I get it. So in case of multiple processes there is no way to use the same NIC except some kind of gymnastic with shared UMEM and shared queues. Please fix documentation - https://www.kernel.org/doc/html/latest/networking/af_xdp.html It says: The UMEM can be shared between processes, if desired. There is nothing regarding to a mandatory requirement to use shared UMEM to get multiple sockets work.

This is not correct as I was probably not clear enough. Your statement above is true only if you are trying to bind two or more sockets to the same netdev and queue id. If you bind two or more sockets to different netdev/queue_id tuples, then you are not required to share a UMEM (but you can if you want). So binding one socket to eth0/queue0 and the other one to eth0/queue1 does not require any sharing of the umem. The absolutely most common case is to bind sockets to different netdev/queue_id tuples and use HW packet steering in the NIC. If you bind to the same netdev/queue_id, you have to use an XDP program to steer your packets.

Note that AF_XDP was designed with zero-copy foremost in mind and in this mode it is impossible to support non-shared UMEMs on the same netdev and queue_id. In copy-mode, on the other hand, it would be possible to copy out the buffer to the "right" UMEM in case there was more than one possible. But no one asked or wanted this support, so this has not been implemented.

We also found some issues with leaking descriptors when amount of submitted packets is greater then configured descriptors (ethtool -g). Documentation also says nothing about that.

That sounds like a bug. Could you please provide more details and a reproducer?

Then finally, about libxdp. There are two reasons why we avoid from using it: it does nothing except wrapping the same socket/mmap APIs as well as it is not supplied in standard Debian distribution. How it can help with solving this issue? Anyway, sorry for criticism and thank you for such opportunity and fast response.

Got it. Would be nice if libxdp got into Debian. Does Debian have libbpf? If so, you can use that instead. Note though, that the AF_XDP support in libbpf is in deprecated mode right now in mainline. Though Debian likely has an older version, if it has one.

magnus-karlsson commented 2 years ago

I have next question due to this limitation: i have several instances of a daemon, each of them can be bound to different RX queues. Also i have shared eBPF (fds are transferred by unix sockets - that works well). Is it possible to redirect packets between rx queues inside eBPF program prior redirecting to XDP? I wrote eBPF program two years ago and almost forgot possibilities.

Please see the previous reply. Briefly, in zero-copy mode this is impossible and in copy-mode this support has not been implemented. If your packet enters on queue X, it has to go to a socket bound to queue X. Though this is fixable (in copy-mode) if you are willing to submit some patches :-).

cyanide-burnout commented 2 years ago

Yes, libbpf is supplied with Debian. Unfortunately I cannot share my code, it's a part of huge project under NDA and its XDP modules cannot work without main part of the system. Finaly I have chosen to use ranges of queues by each process and route inbound traffic prior eBPF by using traffic control / N-tuples (thanks ethtool, but it's sources are not so clear). All works fine except errno 16 from time to time... Mostly limiting "in flight" descriptors helps (ethtool -g).