High rate of packet loss and latency in a UDP multicast scenario

weaveworks / weave

Simple, resilient multi-host containers networking and more.

https://www.weave.works

Apache License 2.0

6.62k stars 670 forks source link

High rate of packet loss and latency in a UDP multicast scenario #3601

Open masantiago opened 5 years ago

masantiago commented 5 years ago

What happened?

High rate of packet loss and latency in a UDP Multicast Scenario with low bitrate.

How to reproduce it?

Two kubernetes nodes with weave net
One POD (ffmpeg) produces 4 multicasts (same IP and 4 different ports; e.g 239.200.1.1:2000, 239.200.1.1:2001, 239.200.1.1:2002, 239.200.1.1:2003) output with an overall of around 10 Mbps. Let it be the "producer"
n PODs (shaka) consuming the same 4 multicasts. Let them be the "consumers"

Two scenarios:

Producer runs in node 1 and consumers run in node 2. With 1 consumer, everything is ok. But if I move to 10 consumers, all the veth interfaces drop quite a few packets.
Producer and consumer run in the same node. There are not any dropped packets, but there is a high latency.

The MTU is set by default to 1376 as the underlying machine is a VM with adapters MTU = 1500

Is there a BW limitation to prevent more than one consumer to listen to multicast? I have verified that all "n" consumers are doing JOIN succesfully. Should I configure anything

Versions:

$ weave version
2.5.1
$ docker version
18.06.0-ce
$ uname -a
Linux k8s-master3 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
1.13.2

Logs:

$ kubectl logs -n kube-system <weave-net-pod> weave

Nothing relevant in logs

murali-reddy commented 5 years ago

Producer runs in node 1 and consumers run in node 2. With 1 consumer, everything is ok. But if I move to 10 consumers, all the veth interfaces drop quite a few packets.

@masantiago Weave for multicast uses broadcast underneath. So each node would receive the multicast packets irrespective of any pod joining the multicast group running on the node or not. So ideally it should not matter its 1 or n consumer. Are these 10 consumers running in same node? Do you see packet drops even if they are spread across the cluster. When do you see most packet drops? any other pattern/information will be useful.

masantiago commented 5 years ago

Thank you for your answer@murali-reddy.

Find below two scenarios:

1 producer in node A. 1 consumer in node B. The veth of node B that is sending the traffic to the consumer POD is:

vethwepl4a8b280 Link encap:Ethernet HWaddr 9e:f9:37:d7:4a:32
inet6 addr: fe80::9cf9:37ff:fed7:4a32/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1376 Metric:1 RX packets:3 errors:0 dropped:0 overruns:0 frame:0 TX packets:506406 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:150 (150.0 B) TX bytes:417741116 (417.7 MB)

The statistics for that veth is: (using nload)

Avg: 8.93 Mbit/s Min: 6.11 Mbit/s Max: 11.69 Mbit/s

1 producer in node A. 10 consumers in node B. The same veth of node B turns to in a few seconds to:

vethwepl4a8b280 Link encap:Ethernet HWaddr 9e:f9:37:d7:4a:32
inet6 addr: fe80::9cf9:37ff:fed7:4a32/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1376 Metric:1 RX packets:3 errors:0 dropped:0 overruns:0 frame:0 TX packets:917815 errors:0 dropped:609 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:150 (150.0 B) TX bytes:754565754 (754.5 MB)

The statistics remain more or less the same: (considering that the video encoding evolves) Avg: 8.27 Mbit/s Min: 3.45 Mbits Max: 11.78 Mbit/s

If I have a look to the rest of veth of the other PODs, they all have dropped packets.

Therefore, this is the case for consumer and producers in separate nodes. What is weirdest, is the case when I co-locate consumer and producers in the same node. There are not any drops but the latency of UDP packets increases and my video service cannot work. I can go to more detail on this scenario in another post.

To put all the context, I am testing this with two Kubernetes nodes that are actually 2 x VMs of VirtualBox with Bridge adapter. Also, I increased the Linux buffers just in case to up to 50MB:

sudo sysctl net.core.rmem_max=52428800 sudo sysctl net.core.rmem_default=52428800 sudo sysctl net.core.wmem_default=52428800 sudo sysctl net.core.wmem_max=52428800

I do not know where it is the bottleneck: weave, VM, other,... Hope you can help me out.

murali-reddy commented 5 years ago

Thanks for the details.

I do not know where it is the bottleneck: weave, VM, other,... Hope you can help me out.

Weave's implementation of multicast is not optimal at the moment (see https://github.com/weaveworks/weave/issues/178). There is unnecessary overhead on the nodes when there is no receivers. That's the only issue. This does not apply to your setup.

Weave configures OVS (to broadcasts across the nodes) and L2 multicast ethernet address to broadcast across the pods connected to weave bridge. This should scale easily to the data rates you are sending given there in enough processing power.

Not sure the configuration of virtual-box VM's, but i would rule out any issues due to virtual box by running on bare-metal and see how it compares.

masantiago commented 5 years ago

I do agree, testing on bare-metal would be a key.

Honestly, what drove me to think of any problem of weave was the third scenario I put forward at the very beginning:

1 producer in node B. 10 consumers in node B. Both are co-located. In this case, there are neither dropped packets nor overrun. But the result is that the consumers are receiving the UDP packets with a high delay. Those consumers are packaging video and it is clear how the outcome is delivered very late.

Does it make sense? How can it have anything to do with VM if it is inside the machine? Do you find any implication in weave thinking that 10 PODs are receving the same multicast in every veth adapter?

Take for granted that machine is healthy in both CPU and RAM.

Thank you very much for all your help.

murali-reddy commented 5 years ago

1 producer in node B. 10 consumers in node B. Both are co-located. In this case, there are neither dropped packets nor overrun. But the result is that the consumers are receiving the UDP packets with a high delay. Those consumers are packaging video and it is clear how the outcome is delivered very late.

What is the latency you are observing?

In the co-located scenario, all the pods are connected to same weave bridge, Weave does not even configure anything on the bridge for multicast. Only for packets that are to be sent out to other nodes weave configures OVS datapath. But in this case its just Linux bridge handling L2 multicast ethernet packets. Its a soft-switch so there will be overhead (depending on pps) but should not be high for the data rates you are sending.

Does it make sense? How can it have anything to do with VM if it is inside the machine?

May be not. But we need a reference to compare against.

Do you find any implication in weave thinking that 10 PODs are receving the same multicast in every veth adapter?

No. At least in this case it just Linux bridge involved.

Does making producer send lower pps has any effect interms of latency in this case and dropped packets are inter-node case?

masantiago commented 5 years ago

Hi again. After some analysis, I found out that the co-located scenario was actually failing because of the bottleneck of the hard disk.

Therefore, all my effort is now addressed to the scenario of different nodes: 1 producer in node A. 10 consumers in node B. As said, with around 100 Mbps (10 x 10 Mbps/channel) all the veth start dropping packets.

I have changed to use host network in the PODs, both in the producer and the consumer. There is not any dropped package in this case. That demonstrates that it has something to do with the weave overlay network. Do you suggest any observation in any logs?

I have even taken care of the MTU size. It is now configured to MTU = 1410 in the veth, to give room to the headers of the overlay. And, more over, the UDP packets are sent of 1344 bytes (7x188 payload + 8 UDP header + 20 IP header).

murali-reddy commented 5 years ago

I have changed to use host network in the PODs, both in the producer and the consumer. There is not any dropped package in this case. That demonstrates that it has something to do with the weave overlay network. Do you suggest any observation in any logs?

Do you see any dropped packets on vethwe-datapath or vethwe-bridge? Weave ensures a copy of multicast packet is sent over vethwe-datapath <->vethwe-bridge pair connecting weave bridge. Afterwards its linux bridge that does the job of sending the packets on veth interfaces to the pods.

Just to see if bridge being bottleneck, can we spread to consumers to two different nodes?

masantiago commented 5 years ago

Do you see any dropped packets on vethwe-datapath or vethwe-bridge? Weave ensures a copy of multicast packet is sent over vethwe-datapath <->vethwe-bridge pair connecting weave bridge. Afterwards its linux bridge that does the job of sending the packets on veth interfaces to the pods.

Yes, both. One in RX and TX in the other. Do not pay too much attention to the different number, because I may have rebooted or update the MTU. Find below the data for the node where consumers are located.

vethwe-bridge Link encap:Ethernet  HWaddr 4a:14:8a:06:42:43  
          inet6 addr: fe80::4814:8aff:fe06:4243/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1410  Metric:1
          RX packets:73101219 errors:0 **dropped:14298** overruns:0 frame:0
          TX packets:370469 errors:0 dropped:11 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:94164964198 (94.1 GB)  TX bytes:1327008051 (1.3 GB)

vethwe-datapath Link encap:Ethernet  HWaddr fe:84:d1:e1:6b:c1  
          inet6 addr: fe80::fc84:d1ff:fee1:6bc1/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1410  Metric:1
          RX packets:370469 errors:0 dropped:22 overruns:0 frame:0
          TX packets:73101219 errors:0 **dropped:7149** overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1327008051 (1.3 GB)  TX bytes:94164964198 (94.1 GB)

Nevertheless, the weave adapter is free of errors in the same node.

weave     Link encap:Ethernet  HWaddr da:77:d3:9b:d6:85  
          inet addr:10.47.0.0  Bcast:10.47.255.255  Mask:255.240.0.0
          inet6 addr: fe80::d877:d3ff:fe9b:d685/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1410  Metric:1
          RX packets:73315637 errors:0 dropped:0 overruns:0 frame:0
          TX packets:491727 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:93471805701 (93.4 GB)  TX bytes:1392778738 (1.3 GB)

Does it make any sense to you?

bboreham commented 5 years ago

Hi, can you get the overall status out of one the Weave Net daemons as outlined at https://www.weave.works/docs/net/latest/kubernetes/kube-addon/#troubleshooting ?

(There are two options there - one via a script and one via curl)

Then also do weave status connections or the equivalent curl ... /status/connections

masantiago commented 5 years ago

Hi, can you get the overall status out of one the Weave Net daemons as outlined at https://www.weave.works/docs/net/latest/kubernetes/kube-addon/#troubleshooting ?

kubectl exec -n kube-system weave-net-g49kt -c weave -- /home/weave/weave --local status connections
-> 172.31.1.117:6783     established fastdp 3a:29:40:a7:14:76(k8s-master2) mtu=1410
-> 172.31.1.123:6783     established fastdp ba:fc:80:e9:b1:ef(k8s-master1) mtu=1410
-> 172.31.1.119:6783     failed      cannot connect to ourself, retry: never

Any other logs to check?