xebd / accel-ppp

High performance PPTP/L2TP/PPPoE/IPoE server for Linux
GNU General Public License v2.0
296 stars 108 forks source link

ipoe gets completely stuck on ipoe session timeout from one client with connectivity issues #163

Closed cygnusb closed 2 years ago

cygnusb commented 2 years ago

We are currently trying to roll out ipoe instead of pptp (with accel-ppp) setup for our users.

Our ipoe setup gets completely stuck sometimes (mostly within hours) with a handful of first migrated users, due to one client with connectivity issues and ipoe session timeout (due to missing DHCP Request Packet). Setup is using L2 with DHCPv4.

2021-12-31T16:40:36+01:00 ipoe-a accel-pppd[894027]: ipoe4:zxxxxx: send [DHCPv4 Ack xid=60d751f0 ciaddr=192.168.183.12 yiaddr=192.168.183.12 chaddr=98:da:c4:xx:yy:zz <Message-Type Ack> <Server-ID 192.168.183.254> <Lease-Time 600> <T1 300> <T2 525> <Router 192.168.183.254> <Subnet 255.255.255.0> <DNS a.b.99.33,a.b.99.49> <Option-249 10c0a8c0a8b701> <Relay-Agent {Agent-Circuit-ID _000402130010} {Agent-Remote-ID _0006c0c520846e91} {Subscriber-ID zxxxxx}>]
2021-12-31T16:45:37+01:00 ipoe-a accel-pppd[894027]: ipoe4:zxxxxx: recv [DHCPv4 Request xid=60d751f0 ciaddr=192.168.183.12 chaddr=98:da:c4:xx:yy:zz <Message-Type Request> <Max-Message-Size 1024> <Client-ID 0198dac4f83beb> <Host-Name TL-WR841N> <Vendor-Class 4d53465420352e30> <Request-List Subnet,Router,DNS,Domain-Name,Vendor-Specific,44,46,47,Route,Classless-Route,249> <Relay-Agent {Agent-Circuit-ID _000402130010} {Agent-Remote-ID _0006c0c520846e91} {Subscriber-ID zxxxxx}>]
2021-12-31T16:45:37+01:00 ipoe-a accel-pppd[894027]: ipoe4:zxxxxx: send [DHCPv4 Ack xid=60d751f0 ciaddr=192.168.183.12 yiaddr=192.168.183.12 chaddr=98:da:c4:xx:yy:zz <Message-Type Ack> <Server-ID 192.168.183.254> <Lease-Time 600> <T1 300> <T2 525> <Router 192.168.183.254> <Subnet 255.255.255.0> <DNS a.b.99.33,a.b.99.49> <Option-249 10c0a8c0a8b701> <Relay-Agent {Agent-Circuit-ID _000402130010} {Agent-Remote-ID _0006c0c520846e91} {Subscriber-ID zxxxxx}>]
2021-12-31T16:50:29+01:00 ipoe-a accel-pppd[894027]: ipoe4:zxxxxx: send [RADIUS(2) Accounting-Request id=3 <User-Name "zxxxxx"> <NAS-Identifier "ipoe-a.xxxxxxx.de"> <NAS-IP-Address 192.168.3.201> <NAS-Port 1372> <NAS-Port-Id "ipoe4"> <NAS-Port-Type Ethernet> <Calling-Station-Id "98:da:c4:xx:yy:zz"> <Called-Station-Id "enp8s0.531"> <Acct-Status-Type Alive> <Acct-Authentic RADIUS> <Acct-Session-Id "4dc9cbe0b5561fa6"> <Acct-Session-Time 2400> <Acct-Input-Octets 24804027> <Acct-Output-Octets 138533192> <Acct-Input-Packets 62052> <Acct-Output-Packets 103299> <Acct-Input-Gigawords 0> <Acct-Output-Gigawords 0> <Framed-IP-Address 192.168.183.12> <Framed-Interface-Id 9ada:c4xxx> <Framed-IPv6-Prefix 2a00:5ba0:xxxx:a004::/64> <Delegated-IPv6-Prefix 2a00:5ba0:xxxx:b004::/64>]
2021-12-31T17:00:37+01:00 ipoe-a accel-pppd[894027]: ipoe4:zxxxxx: ipoe: session timed out
2021-12-31T17:00:37+01:00 ipoe-a accel-pppd[894027]: ipoe4:zxxxxx: send [RADIUS(2) Accounting-Request id=4 <User-Name "zxxxxx"> <NAS-Identifier "ipoe-a.xxxxxxx.de"> <NAS-IP-Address 192.168.3.201> <NAS-Port 1372> <NAS-Port-Id "ipoe4"> <NAS-Port-Type Ethernet> <Calling-Station-Id "98:da:c4:xx:yy:zz"> <Called-Station-Id "enp8s0.531"> <Acct-Status-Type Stop> <Acct-Authentic RADIUS> <Acct-Session-Id "4dc9cbe0b5561fa6"> <Acct-Session-Time 3008> <Acct-Input-Octets 43914623> <Acct-Output-Octets 150632821> <Acct-Input-Packets 86310> <Acct-Output-Packets 117397> <Acct-Input-Gigawords 0> <Acct-Output-Gigawords 0> <Framed-IP-Address 192.168.183.12> <Framed-Interface-Id 9ada:c4xxx> <Framed-IPv6-Prefix 2a00:5ba0:xxxx:a004::/64> <Delegated-IPv6-Prefix 2a00:5ba0:xxxx:b004::/64> <Acct-Terminate-Cause Lost-Carrier>]
2021-12-31T17:00:37+01:00 ipoe-a accel-pppd[894027]: ipoe4:zxxxxx: pppd_compat: ip-down started (pid 1011287)
2021-12-31T17:00:37+01:00 ipoe-a cygnus.ipoetools.base_script[101 base_script.py:64/script_init###script_init done for user zxxxxx int_ip 192.168.183.12 mac 98:da:c4:xx:yy:zz vlan 531 ip6 2a00:5ba0:xxxx:a004:9ada:c4xxx/64 ip6_dp 2a00:5ba0:xxxx:b004::/64
2021-12-31T17:00:37+01:00 ipoe-a cygnus.ipoetools_logout[1011287] ipoe_logout.py:64/logout###client zxxxxx ip 192.168.183.12 is connected with public ip - delete dummy ip and public ip set
2021-12-31T17:00:37+01:00 ipoe-a cygnus.ipoetools_logout[1011287] ipoe_logout.py:30/del_dummy_ip###deleting nat ip 212.201.70.35 on dummy0 for zxxxxx on vlan ipoe-a.xxxxxxx.de internal ip 192.168.183.12
2021-12-31T17:00:37+01:00 ipoe-a accel-pppd[894027]: ipoe4:zxxxxx: pppd_compat: ip-down finished (0)

After this we do not see any more log output and accel-ppp is completely stuck. Calling accel-cmd show sessions is not producing any output and not returning to terminal. I do see some connectivity issues with this particular client in our switch logs. So as one can also see from the log, the next DHCP request gets lost due to client connectivity outage.

I must say that we have some custom changes to our accel-ppp (which I will submit later on as a Pull request), but I am quite sure that these small changes are not the cause of this. These include extracting the DHCP Option 82 Suboption Subscriber-ID including logging that Option, adding a Class-B route (Option 249) and setting a ip-pool and l4-redirect-pool on each interface config. Custom build is based on commit 1b8711cf75a7c278d99840112bc7a396398e0205 from Tue Oct 5 16:15:43 2021.

IPV6 Prefix/DP come from Radius. Acct-Interim Interval is set by accel-ppp. Attached is our accel-ppp.conf.

I was not really able to reproduce this with test clients cutting the network connection.

My guess is that this is due to some timing issue, maybe possibly a race condition. Any input is highly appreciated.

accel-ppp.conf.txt

DmitriyEshenko commented 2 years ago

Hello @cygnusb , could you please try to install accel-ppp with debug information and run it in GDB? https://accel-ppp.readthedocs.io/en/latest/debugging/index.html

cygnusb commented 2 years ago

Ok, thanks. I could reproduce this with some use.

It seems four threads go into a deadlock

 gdb -q /usr/sbin/accel-pppd 2212365                                                                                            
Reading symbols from /usr/sbin/accel-pppd...done.
Attaching to program: /usr/sbin/accel-pppd, process 2212365
[New LWP 2212367]
[New LWP 2212368]
[New LWP 2212370]
[New LWP 2212371]
[New LWP 2212372]
[New LWP 2212373]
[New LWP 2212374]
[New LWP 2212375]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fad3a3f414c in sigtimedwait () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-164.el8.x86_64 lua-libs-5.3.4-12.el8.x86_64 openssl-libs-1.1.1k-5.el8_5.x86_64 pcre-8.42-6.el8.x86_64 zlib-1.2.11-17.el8.x86_64
(gdb) where
#0  0x00007fad3a3f414c in sigtimedwait () from /lib64/libc.so.6
#1  0x00007fad3b3857ac in sigwait () from /lib64/libpthread.so.0
#2  0x0000000000431488 in main (_argc=5, _argv=0x7ffe82f98398) at /home/valerius/accel-ppp-2021/accel-pppd/main.c:447
(gdb) info threads
  Id   Target Id                                        Frame
* 1    Thread 0x7fad3bbc7240 (LWP 2212365) "accel-pppd" 0x00007fad3a3f414c in sigtimedwait () from /lib64/libc.so.6
  2    Thread 0x7fad38b92700 (LWP 2212367) "accel-pppd" 0x00007fad3a3f414c in sigtimedwait () from /lib64/libc.so.6
  3    Thread 0x7fad37b74700 (LWP 2212368) "accel-pppd" 0x00007fad3b3813fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  4    Thread 0x7fad3bba5700 (LWP 2212370) "accel-pppd" 0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
  5    Thread 0x7fad37373700 (LWP 2212371) "accel-pppd" 0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
  6    Thread 0x7fad37272700 (LWP 2212372) "accel-pppd" 0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
  7    Thread 0x7fad37171700 (LWP 2212373) "accel-pppd" 0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
  8    Thread 0x7fad37070700 (LWP 2212374) "accel-pppd" 0x00007fad3a4b90f7 in epoll_wait () from /lib64/libc.so.6
  9    Thread 0x7fad3686f700 (LWP 2212375) "accel-pppd" 0x00007fad3a4b90f7 in epoll_wait () from /lib64/libc.so.6

(gdb) thread 4
[Switching to thread 4 (Thread 0x7fad3bba5700 (LWP 2212370))]
#0  0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fad3b37da79 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007fad39d9abba in arp_ctx_read (ah=0x7fad28018cb8) at /home/valerius/accel-ppp-2021/accel-pppd/ctrl/ipoe/arp.c:73
#3  0x00007fad3b79fe7b in ctx_thread (ctx=0x15a5278) at /home/valerius/accel-ppp-2021/accel-pppd/triton/triton.c:273
#4  0x00007fad3b79fb5b in triton_thread (thread=0x17750c8) at /home/valerius/accel-ppp-2021/accel-pppd/triton/triton.c:192
#5  0x00007fad3b37b17a in start_thread () from /lib64/libpthread.so.0
#6  0x00007fad3a4b8dc3 in clone () from /lib64/libc.so.6

(gdb) thread 5
[Switching to thread 5 (Thread 0x7fad37373700 (LWP 2212371))]
#0  0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fad3b37da79 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007fad39d8a404 in ipoe_session_finished (s=0x7fad1c027930) at /home/valerius/accel-ppp-2021/accel-pppd/ctrl/ipoe/ipoe.c:1275
#3  0x0000000000406b8b in ap_session_finished (ses=0x7fad1c027930) at /home/valerius/accel-ppp-2021/accel-pppd/session.c:216
#4  0x00007fad39d8a54d in ipoe_session_terminated (ses=0x7fad1c027848) at /home/valerius/accel-ppp-2021/accel-pppd/ctrl/ipoe/ipoe.c:1291
#5  0x00007fad39d8a6ad in ipoe_session_terminate (s=0x7fad1c027930, hard=1) at /home/valerius/accel-ppp-2021/accel-pppd/ctrl/ipoe/ipoe.c:1318
#6  0x0000000000406f36 in ap_session_terminate (ses=0x7fad1c027930, cause=6, hard=1) at /home/valerius/accel-ppp-2021/accel-pppd/session.c:300
#7  0x00007fad39d8738f in ipoe_session_l4_redirect_timeout (t=0x7fad1c0278a0) at /home/valerius/accel-ppp-2021/accel-pppd/ctrl/ipoe/ipoe.c:403
#8  0x00007fad3b79fce4 in ctx_thread (ctx=0x7fad1c00c0d8) at /home/valerius/accel-ppp-2021/accel-pppd/triton/triton.c:234
#9  0x00007fad3b79fb5b in triton_thread (thread=0x1775308) at /home/valerius/accel-ppp-2021/accel-pppd/triton/triton.c:192
#10 0x00007fad3b37b17a in start_thread () from /lib64/libpthread.so.0
#11 0x00007fad3a4b8dc3 in clone () from /lib64/libc.so.6

(gdb) thread 6
[Switching to thread 6 (Thread 0x7fad37272700 (LWP 2212372))]
#0  0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fad3b37da79 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007fad39d8a404 in ipoe_session_finished (s=0x7fad2c016be0) at /home/valerius/accel-ppp-2021/accel-pppd/ctrl/ipoe/ipoe.c:1275
#3  0x0000000000406b8b in ap_session_finished (ses=0x7fad2c016be0) at /home/valerius/accel-ppp-2021/accel-pppd/session.c:216
#4  0x00007fad39d8a54d in ipoe_session_terminated (ses=0x7fad2c016af8) at /home/valerius/accel-ppp-2021/accel-pppd/ctrl/ipoe/ipoe.c:1291
#5  0x00007fad39d8a6ad in ipoe_session_terminate (s=0x7fad2c016be0, hard=1) at /home/valerius/accel-ppp-2021/accel-pppd/ctrl/ipoe/ipoe.c:1318
#6  0x0000000000406f36 in ap_session_terminate (ses=0x7fad2c016be0, cause=9, hard=1) at /home/valerius/accel-ppp-2021/accel-pppd/session.c:300
#7  0x00007fad39d87310 in ipoe_session_timeout (t=0x7fad2c016b28) at /home/valerius/accel-ppp-2021/accel-pppd/ctrl/ipoe/ipoe.c:392
#8  0x00007fad3b79fce4 in ctx_thread (ctx=0x7fad2c00b4f8) at /home/valerius/accel-ppp-2021/accel-pppd/triton/triton.c:234
#9  0x00007fad3b79fb5b in triton_thread (thread=0x1775548) at /home/valerius/accel-ppp-2021/accel-pppd/triton/triton.c:192
#10 0x00007fad3b37b17a in start_thread () from /lib64/libpthread.so.0
#11 0x00007fad3a4b8dc3 in clone () from /lib64/libc.so.6

(gdb) thread 7
[Switching to thread 7 (Thread 0x7fad37171700 (LWP 2212373))]
#0  0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007fad3b38475d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fad3b37da79 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007fad39d9abba in arp_ctx_read (ah=0x7fad1c0697b8) at /home/valerius/accel-ppp-2021/accel-pppd/ctrl/ipoe/arp.c:73
#3  0x00007fad3b79fe7b in ctx_thread (ctx=0x15a8278) at /home/valerius/accel-ppp-2021/accel-pppd/triton/triton.c:273
#4  0x00007fad3b79fb5b in triton_thread (thread=0x1775788) at /home/valerius/accel-ppp-2021/accel-pppd/triton/triton.c:192
#5  0x00007fad3b37b17a in start_thread () from /lib64/libpthread.so.0
#6  0x00007fad3a4b8dc3 in clone () from /lib64/libc.so.6

arp.c:73 is following pthread_mutex_lock:

          ah2.ar_hrd = htons(ARPHRD_ETHER);
          ah2.ar_pro = htons(ETH_P_IP);
          ah2.ar_hln = ETH_ALEN;
          ah2.ar_pln = 4;
          ah2.ar_op = htons(ARPOP_REPLY);

          pthread_mutex_lock(&ipoe->lock);

accel-pppd/ctrl/ipoe/ipoe.c:1275 is the last line of this code

          if (s->ifindex == ses->serv->ifindex && strcmp(s->ifname, ses->serv->ifname)) {
                  int flags;

                  log_info2("ipoe: rename %s to %s\n", s->ifname, ses->serv->ifname);

                  strcpy(ifr.ifr_name, s->ifname);

                  ioctl(sock_fd, SIOCGIFFLAGS, &ifr);
                  flags = ifr.ifr_flags;
                  if (flags & IFF_UP) {
                          ifr.ifr_flags &= ~IFF_UP;
                          ioctl(sock_fd, SIOCSIFFLAGS, &ifr);
                  }

                  strcpy(ifr.ifr_newname, ses->serv->ifname);
                  ioctl(sock_fd, SIOCSIFNAME, &ifr);

                  strcpy(ifr.ifr_name, ses->serv->ifname);
                  ifr.ifr_flags = flags | IFF_UP;
                  ioctl(sock_fd, SIOCSIFFLAGS, &ifr);
          }

          pthread_mutex_lock(&ses->serv->lock);
cygnusb commented 2 years ago

System is Alma Linux 8 (RHEL 8 clone) with Kernel 4.18.0-348.7.1.el8_5.x86_64 using pull request 159

I have in addition setup the secondary system (has this vlan with weight=0) using Debian 11 to check whether this is related to the OS

cygnusb commented 2 years ago

And thread 3:

(gdb) thread 3
[Switching to thread 3 (Thread 0x7fad37b74700 (LWP 2212368))]
#0  0x00007fad3b3813fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007fad3b3813fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000040b429 in uc_thread (unused=0x0) at /home/valerius/accel-ppp-2021/accel-pppd/ppp/ppp.c:351
#2  0x00007fad3b37b17a in start_thread () from /lib64/libpthread.so.0
#3  0x00007fad3a4b8dc3 in clone () from /lib64/libc.so.6
cygnusb commented 2 years ago

Think this has to due with a patch from us in arp.c not releasing the lock. Sorry to bother and thanks for the debugging hints