world-direct / suss

System Update Support Service for kubernetes
Apache License 2.0
0 stars 0 forks source link

release lock if update terminates the suss pod #16

Closed gprossliner closed 8 months ago

gprossliner commented 8 months ago

On testing by using the ./update-example..sh script we notices a bug / feature gap that causes the lock not to be released if the container running suss is terminated by the system update e.g. because it updates the configured CRI tools (docker in this case).

Why it happens:

  1. script executes /synchronize and acquired the lock
  2. /teardown is executed, but not relevant here
  3. script updates the packages incl. docker
  4. the update causes the suss container to be terminated
  5. script checks for reboot (required in this case)
  6. script wants to notify suss for /releasedelayed but this fails because suss in not running

Some notes:

Possible resolutions:

  1. Default to release lock delayed, when suss is not notified of any further action
  2. Handle errors / retry in script and check if suss would restart before the host is restarted and so the release command can be successfully executed by the script from the restarted container.
gprossliner commented 8 months ago

Log (removed some noise not relevant here)

   1: Loaded plugins: fastestmirror, product-id, search-disabled-repos, subscription-
   2:               : manager
   3: Loading mirror speeds from cached hostfile
   4: 
   5: docker-buildx-plugin.x86_64      0.12.1-1.el7          docker-ce-stable         
   6: docker-ce.x86_64                 3:25.0.1-1.el7        docker-ce-stable         
   7: docker-ce-cli.x86_64             1:25.0.1-1.el7        docker-ce-stable         
   8: docker-ce-rootless-extras.x86_64 25.0.1-1.el7          docker-ce-stable         
   9: docker-compose-plugin.x86_64     2.24.2-1.el7          docker-ce-stable         
  10: kernel.x86_64                    3.10.0-1160.108.1.el7 WD_CentOS_7_Linux_x86_64_CentOS_7_Updates_x86_64
  11: kernel-tools.x86_64              3.10.0-1160.108.1.el7 WD_CentOS_7_Linux_x86_64_CentOS_7_Updates_x86_64
  12: kernel-tools-libs.x86_64         3.10.0-1160.108.1.el7 WD_CentOS_7_Linux_x86_64_CentOS_7_Updates_x86_64
  ...
  37: SUSS: synchronize
  38:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
  39:                                  Dload  Upload   Total   Spent    Left  Speed
  40: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
  41:                                  Dload  Upload   Total   Spent    Left  Speed
  42: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  43: SUSS: teardown
  44:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
  45:                                  Dload  Upload   Total   Spent    Left  Speed
  46: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  47: Loaded plugins: fastestmirror, product-id, search-disabled-repos, subscription-
  48:               : manager
  49: Loading mirror speeds from cached hostfile
  50: 
100   137    0   137    0     0    116      0 --:--:--  0:00:01 --:--:--   116Resolving Dependencies
  51: --> Running transaction check
  52: ---> Package docker-buildx-plugin.x86_64 0:0.11.2-1.el7 will be updated
  53: ---> Package docker-buildx-plugin.x86_64 0:0.12.1-1.el7 will be an update
  54: ---> Package docker-ce.x86_64 3:24.0.7-1.el7 will be updated
  ...
 121: 
 122: ================================================================================
 123:  Package                   Arch   Version               Repository         Size
 124: ================================================================================
 125: Installing:  126:  kernel                    x86_64 3.10.0-1160.108.1.el7 WD_CentOS_7_Linux_x86_64_CentOS_7_Updates_x86_64
 127:                                                                            52 M
 128: Updating:  129:  docker-buildx-plugin      x86_64 0.12.1-1.el7          docker-ce-stable   13 M
 ...
 186: Removing:  187:  kernel                    x86_64 3.10.0-1160.92.1.el7  @WD_CentOS_7_Linux_x86_64_CentOS_7_Updates_x86_64
 188:                                                                            66 M
 189: 
 190: Transaction Summary
 191: ================================================================================
 192: Install   1 Package
 193: Upgrade  31 Packages
 194: Remove    1 Package
 195: 
 196: Total download size: 158 M
 197: Downloading packages:  198: Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
...
 210: Total                                               14 MB/s | 158 MB  00:11     
 211: Running transaction check
 212: 
100   137    0   137    0     0      9      0 --:--:--  0:00:15 --:--:--     0Running transaction test
 213: 
100   137    0   137    0     0      8      0 --:--:--  0:00:16 --:--:--     0Transaction test succeeded
 214: Running transaction
 215:   Updating   : libsss_certmap-1.16.5-10.el7_9.16.x86_64                    1/64 
 216:   Updating   : libsss_idmap-1.16.5-10.el7_9.16.x86_64                      2/64 
 217:   Updating   : 1:net-snmp-libs-5.7.2-49.el7_9.4.x86_64                     3/64
100   137    0   137    0     0      7      0 --:--:--  0:00:17 --:--:--     0 
 218:   Updating   : libipa_hbac-1.16.5-10.el7_9.16.x86_64                       4/64 
 219:   Updating   : 1:net-snmp-agent-libs-5.7.2-49.el7_9.4.x86_64               5/64 
 220:   Updating   : libsss_autofs-1.16.5-10.el7_9.16.x86_64                     6/64 
 221:   Updating   : kernel-tools-libs-3.10.0-1160.108.1.el7.x86_64              7/64 
 222:   Updating   : python-sssdconfig-1.16.5-10.el7_9.16.noarch                 8/64 
...
 253:   Installing : kernel-3.10.0-1160.108.1.el7.x86_64                        32/64
...
 259:   Cleanup    : sssd-1.16.5-10.el7_9.15.x86_64                             33/64 
 260:   Cleanup    : sssd-ipa-1.16.5-10.el7_9.15.x86_64                         34/64 
...
 269: curl: (18) transfer closed with outstanding read data remaining
 270: lease successfully aquired by k8s-dev-master1
 271: node k8s-dev-master1 cordoned
 272: check if critical pods have exited
 273: critical pods have exited
 274:  
...
 297: Loaded plugins: fastestmirror, product-id, subscription-manager
 298:   Verifying  : sssd-ldap-1.16.5-10.el7_9.16.x86_64                         1/64 
 ...
 361:   Verifying  : libsss_nss_idmap-1.16.5-10.el7_9.15.x86_64                 64/64 
 362: 
 363: Removed:  364:   kernel.x86_64 0:3.10.0-1160.92.1.el7                                          
 365: 
 366: Installed:  367:   kernel.x86_64 0:3.10.0-1160.108.1.el7                                         
 368: 
 369: Updated:  370:   docker-buildx-plugin.x86_64 0:0.12.1-1.el7                                    
 371:   docker-ce.x86_64 3:25.0.1-1.el7                                               
 372:   docker-ce-cli.x86_64 1:25.0.1-1.el7                                           
 373:   docker-ce-rootless-extras.x86_64 0:25.0.1-1.el7                               
 374:   docker-compose-plugin.x86_64 0:2.24.2-1.el7                                   
 375:   kernel-tools.x86_64 0:3.10.0-1160.108.1.el7                                   
 376:   kernel-tools-libs.x86_64 0:3.10.0-1160.108.1.el7                              
 377:   libipa_hbac.x86_64 0:1.16.5-10.el7_9.16                                       
 378:   libsss_autofs.x86_64 0:1.16.5-10.el7_9.16                                     
 379:   libsss_certmap.x86_64 0:1.16.5-10.el7_9.16                                    
 380:   libsss_idmap.x86_64 0:1.16.5-10.el7_9.16                                      
 381:   libsss_nss_idmap.x86_64 0:1.16.5-10.el7_9.16                                  
 382:   libsss_sudo.x86_64 0:1.16.5-10.el7_9.16                                       
 383:   net-snmp.x86_64 1:5.7.2-49.el7_9.4                                            
 384:   net-snmp-agent-libs.x86_64 1:5.7.2-49.el7_9.4                                 
 385:   net-snmp-libs.x86_64 1:5.7.2-49.el7_9.4                                       
 386:   net-snmp-utils.x86_64 1:5.7.2-49.el7_9.4                                      
 387:   python-libipa_hbac.x86_64 0:1.16.5-10.el7_9.16                                
 388:   python-perf.x86_64 0:3.10.0-1160.108.1.el7                                    
 389:   python-sss-murmur.x86_64 0:1.16.5-10.el7_9.16                                 
 390:   python-sssdconfig.noarch 0:1.16.5-10.el7_9.16                                 
 391:   sssd.x86_64 0:1.16.5-10.el7_9.16                                              
 392:   sssd-ad.x86_64 0:1.16.5-10.el7_9.16                                           
 393:   sssd-client.x86_64 0:1.16.5-10.el7_9.16                                       
 394:   sssd-common.x86_64 0:1.16.5-10.el7_9.16                                       
 395:   sssd-common-pac.x86_64 0:1.16.5-10.el7_9.16                                   
 396:   sssd-ipa.x86_64 0:1.16.5-10.el7_9.16                                          
 397:   sssd-krb5.x86_64 0:1.16.5-10.el7_9.16                                         
 398:   sssd-krb5-common.x86_64 0:1.16.5-10.el7_9.16                                  
 399:   sssd-ldap.x86_64 0:1.16.5-10.el7_9.16                                         
 400:   sssd-proxy.x86_64 0:1.16.5-10.el7_9.16                                        
 401: 
 402: Complete!
 403: Core libraries or services have been updated:  404:   kernel -> 3.10.0-1160.108.1.el7
 405: 
 406: Reboot is required to ensure that your system benefits from these updates.
 407: 
 408: More information:  409: https://access.redhat.com/solutions/27943
 410: SUSS: releasedelayed
 411:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
 412:                                  Dload  Upload   Total   Spent    Left  Speed
 413: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 414: Shutdown scheduled for Mon 2024-01-29 14:16:16 CET, use 'shutdown -c' to cancel.
 415: stopping logstream
 416: /var/tmp/foreman-ssh-cmd-8abf4999-8825-4dfe-9dca-b289987599d7/script: line 1: kill: (13730) - No such process
 417: Exit status: 0
gprossliner commented 8 months ago

Log Analysis:

In Line 410 the script was obviously able to contact suss. So this log do not show the described behaviour. In last weeks execution the script was not able to complete successfully, and the node owning the lock remained in Cordoned state, so assuming that suss was not restarted (in time). Because of log retention periods, those logs are no longer available.

We will make some more tests if the suss container will successfully restart in this case.

gprossliner commented 8 months ago

We will test if suss will be restarted on a node that has not been updated yet:

[root@k8s-dev-master2 ~]# yum check-update --disablerepo='*' --enablerepo=docker-ce-stable
Failed to set locale, defaulting to C
Loaded plugins: fastestmirror, product-id, search-disabled-repos, subscription-manager
Loading mirror speeds from cached hostfile

docker-buildx-plugin.x86_64                                                                  0.12.1-1.el7                                                                docker-ce-stable
docker-ce.x86_64                                                                             3:25.0.1-1.el7                                                              docker-ce-stable
docker-ce-cli.x86_64                                                                         1:25.0.1-1.el7                                                              docker-ce-stable
docker-ce-rootless-extras.x86_64                                                             25.0.1-1.el7                                                                docker-ce-stable
docker-compose-plugin.x86_64                                                                 2.24.2-1.el7                                                                docker-ce-stable

On the k8s-dev-master2 node, suss in running in pod named suss-l7t9v without a restart. Let's run the update:

yum update --disablerepo='*' --enablerepo=docker-ce-stable

After confirmation and command completion, suss has been restarted, which is also annotated by kubelet:

[root@k8s-dev-master2 ~]# docker inspect d73 |grep restart
                "annotation.io.kubernetes.container.restartCount": "1",

We also see the restart in rancher, and are able to inspect the logs.

Tasks: