reproducible kernel panic w. 4.19.0 & parallel iperf threads P>8, weave/2.5.*

fgeorgatos commented 5 years ago

What you expected to happen?

No kernel crash for parallel net streams: iperf3 -c <server_cni_ip> -P1,2,4,8,16,32,64,128 i.e. the receiving end should be able to tolerate multiple parallel network streams, for P >= ~8.

My Request For Comments is, if this is reproducible for any other installations, since kernel 4.19.x series is very popular across a number of distributions (f.i. centos7+elrepo) and i've seen it in many other k8s deployments; testing it is cheap since it is a one-liner, run on 2 pods.

What happened?

kernel panic, reproducible with iperf for P=~16 or greater, sometimes also for P=8:

IMPORTANT: bug is NOT reproducible without involving a cni (precisely: over openvswitch layer).

How to reproduce it?

You need to install iperf3 -s inside a test pod with kernel/4.19.0 or a "bug-compatible", then pick a client pod and simply try: iperf3 -c <server_cni_ip> -P1,2,4,8,16,32,64,128 On a problematic kernel, the kernel panic will occur about midway on the above sequence.

A convenient oneliner: echo 1 2 4 8 16 32 64 128|xargs -n1 iperf3 -c <server_cni_ip> -P

N.B. the crashing system is always the traffic receiving server that listens to that CNI ip.

Anything else we need to know?

The configuration tried here regards an openstack qemu back-end, deploying via rancher.

Mentioning it, because it could be a factor and/or even bug cause, in some conceivable way, although my bigger question is if having enabled FASTDP service could be a factor, since I have noticed that if it gets disabled traffic throughput drops and kernel panic ceases (i.e. there is a correlation, but not necessarily causal relationship).

Versions:

$ weave version

/home/weave # ./weave --local version
weave 2.5.0

(i've tried a later version, reproducible)

$ docker version

Docker version 18.09.0, build 4d60db4

(i've tried a later version, reproducible)

$ uname -a

Linux zzzz-newtype1 4.19.0-1.el7.elrepo.x86_64 #1 SMP Mon Oct 22 10:40:32 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

(other versions seem to not have the issue)

$ kubectl version

> kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.7", GitCommit:"6f482974b76db3f1e0f5d24605a9d1d38fad9a2b", GitTreeState:"clean", BuildDate:"2019-03-25T02:52:13Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Logs:

the kernel ultimately dies with:

kernel panic - not syncing: fatal exception in interrupt

stacktrace:

[241580.714445] general protection fault: 0000 [#1] SMP PTI
[241580.722835] CPU: 4 PID: 32 Comm: ksoftirqd/4 Kdump: loaded Not tainted 4.19.0-1.el7.elrepo.x86_64 #1
[241580.725964] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[241580.729146] RIP: 0010:dev_hard_start_xmit+0x38/0x220
[241580.730801] Code: 55 41 54 53 48 89 fb 48 83 ec 28 48 85 ff 48 89 55 c0 48 89 4d b0 0f 84 be 01 00 00 48 8d
86 90 00 00 00 49 89 f4 48 89 45 b8 <48> 8b 03 48 c7 03 00 00 00 00 48 85 c0 48 89 45 c8 48 8b 05 10 04
[241580.736457] RSP: 0018:ffffc90000d8b648 EFLAGS: 00010202
[241580.738132] RAX: ffff8801eaab7a00 RBX: dead000000000100 RCX: ffff8801eaab7a00
[241580.740882] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000286
[241580.743499] RBP: ffffc90000d8b698 R08: ffff88022fef1b00 R09: 0000000000000019
[241580.746177] R10: 0000000000000004 R11: 000000000000088c R12: ffff880233abe000
[241580.748896] R13: 0000000000000000 R14: 000000000000818e R15: ffff88022fef1b00
[241580.751536] FS:  0000000000000000(0000) GS:ffff880237b00000(0000) knlGS:0000000000000000
[241580.754388] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[241580.756298] CR2: 00007feef7243c60 CR3: 000000000220a002 CR4: 00000000003606e0
[241580.758883] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[241580.761474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[241580.764088] Call Trace:
[241580.765256]  __dev_queue_xmit+0x6b3/0x900
[241580.766720]  ? dev_hard_start_xmit+0xa7/0x220
[241580.768227]  dev_queue_xmit+0x10/0x20
[241580.769657]  ovs_vport_send+0xcf/0x170 [openvswitch]
[241580.771307]  do_output+0x53/0xf0 [openvswitch]
[241580.772826]  do_execute_actions+0xb88/0xbc0 [openvswitch]
[241580.774522]  ? dev_queue_xmit+0x10/0x20
[241580.775934]  ? do_output+0x53/0xf0 [openvswitch]
[241580.777505]  ? do_execute_actions+0xb88/0xbc0 [openvswitch]
[241580.779257]  ovs_execute_actions+0x4c/0x120 [openvswitch]
[241580.780958]  ovs_dp_process_packet+0x84/0x120 [openvswitch]
[241580.782690]  ? ovs_ct_update_key+0x9f/0xe0 [openvswitch]
[241580.784381]  ovs_vport_receive+0x73/0xd0 [openvswitch]
[241580.786030]  ? ovs_dp_process_packet+0x84/0x120 [openvswitch]
[241580.787805]  ? ovs_ct_update_key+0x9f/0xe0 [openvswitch]
[241580.789487]  ? ovs_vport_receive+0x73/0xd0 [openvswitch]
[241580.791174]  ? xfrm4_rcv+0x3b/0x40
[241580.792496]  ? xfrm4_esp_rcv+0x39/0x70
[241580.793892]  ? __kmalloc_node_track_caller+0x190/0x280
[241580.795571]  netdev_frame_hook+0xd9/0x160 [openvswitch]
[241580.797236]  __netif_receive_skb_core+0x211/0xb30
[241580.798806]  ? skb_copy_bits+0x15f/0x280
[241580.800254]  ? __pskb_pull_tail+0x81/0x460
[241580.801758]  __netif_receive_skb_one_core+0x3b/0x80
[241580.803368]  __netif_receive_skb+0x18/0x60
[241580.804818]  netif_receive_skb_internal+0x45/0xf0
[241580.806383]  ? tcp4_gro_complete+0x86/0x90
[241580.807855]  napi_gro_complete+0x73/0x90
[241580.809275]  dev_gro_receive+0x65f/0x670
[241580.810690]  napi_gro_receive+0x38/0xf0
[241580.812095]  gro_cell_poll+0x5c/0x90
[241580.813444]  net_rx_action+0x289/0x3d0
[241580.814825]  __do_softirq+0xd1/0x287
[241580.816181]  run_ksoftirqd+0x2b/0x40
[241580.817529]  smpboot_thread_fn+0x11f/0x180
[241580.818977]  kthread+0x105/0x140
[241580.820267]  ? sort_range+0x30/0x30
[241580.821623]  ? kthread_bind+0x20/0x20
[241580.822992]  ret_from_fork+0x35/0x40
[241580.824351] Modules linked in: xt_esp esp4 xfrm4_mode_transport xt_policy iptable_mangle dummy vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nsh nf_nat_ipv6 nf_conncount veth xt_statistic xt_nat xt_NFLOG xt_physdev nfnetlink_log ip_set_hash_ip xt_set ip_set ipt_REJECT nf_reject_ipv4 xt_mark xt_comment ipt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c overlay kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ttm aesni_intel drm_kms_helper crypto_simd cryptd drm glue_helper syscopyarea sysfillrect joydev sysimgblt virtio_balloon pcspkr i2c_piix4 input_leds fb_sys_fops ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net net_failover failover
[241580.844182]  virtio_blk crc32c_intel ata_piix serio_raw virtio_pci virtio_ring virtio floppy libata sunrpc scsi_transport_iscsi

Conclusion

If the above bug report is considered historic (since the said kernel is old),
please consider this feature request instead:

provide a test/qualification test for new a weave CNI deployment, which proves that a couple dozen parallel network streams are not crashing the platform. It could be as simple as:
- iperf3 -s & echo 1 2 4 8 16 32 64 128|xargs -n1 iperf3 -c my_iperf_daemonset -P
why: verify that CNI stack is functional, at least at some level of parallelism
what: add an extra QA check point

fgeorgatos commented 5 years ago

btw. the codepaths are varying somewhat from stacktrace to stacktrace, but the following calls seem to be the common ones (presumably skb related):

      8  ? __kmalloc_node_track_caller+0x190/0x280
      8  ? __pskb_pull_tail+0x81/0x460
      8  ? ovs_ct_update_key+0x9f/0xe0 [openvswitch]
      8  __do_softirq+0xd1/0x287
      8  __netif_receive_skb+0x18/0x60
      8  __netif_receive_skb_core+0x211/0xb30
      8  __netif_receive_skb_one_core+0x3b/0x80
      8  dev_gro_receive+0x65f/0x670
      8  gro_cell_poll+0x5c/0x90
      8  napi_gro_complete+0x73/0x90
      8  napi_gro_receive+0x38/0xf0
      8  net_rx_action+0x289/0x3d0
      8  netdev_frame_hook+0xd9/0x160 [openvswitch]
      8  netif_receive_skb_internal+0x45/0xf0

murali-reddy commented 5 years ago

thanks @fgeorgatos for reporting this issue

if having enabled FASTDP service could be a factor, since I have noticed that if it gets disabled traffic throughput drops and kernel panic ceases

fastdp should run fine with 4.19 kernel in fact https://github.com/weaveworks/weave/pull/3430 fixed issue with 4.19 compatibility

It could be either the specific combination (openstack/qemu) or parallel network streams that must be causing this issue.

From the stack trace potentially panic is due to OVS data path that Weave's fastdp uses.

fgeorgatos commented 5 years ago

@murali-reddy thanks for the feedback.

fyi. the cause/fix must be hidden somewhere along this linux kernel git diff (<1000 lines until known bugfix point): https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/?id=v4.19.1&id2=v4.19&dt=2

However, I have run out of ideas about how to corner it, rigorously; @brb any suggestions? We can reasonably assume that the ipv6, mellanox & eth drivers, smc & sparc diffs are irrelevant; imho, it is possibly this one: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/net/openvswitch/flow_netlink.c?id=v4.19.1&id2=v4.19

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index a70097ecf33c2..865ecef681969 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -3030,7 +3030,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
             * is already present */
            if (mac_proto != MAC_PROTO_NONE)
                return -EINVAL;
-           mac_proto = MAC_PROTO_NONE;
+           mac_proto = MAC_PROTO_ETHERNET;
            break;

        case OVS_ACTION_ATTR_POP_ETH:
@@ -3038,7 +3038,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
                return -EINVAL;
            if (vlan_tci & htons(VLAN_TAG_PRESENT))
                return -EINVAL;
-           mac_proto = MAC_PROTO_ETHERNET;
+           mac_proto = MAC_PROTO_NONE;
            break;

        case OVS_ACTION_ATTR_PUSH_NSH:

weaveworks / weave