xorbit / LiFePO4wered-Pi

Access library, command line tool and daemon for the LiFePO4wered/Pi module
GNU General Public License v2.0
132 stars 31 forks source link

Fix systemd watchdog notifications #36

Closed Jookia closed 4 years ago

Jookia commented 4 years ago

This changes the code to unconditionally send a systemd watchdog notification every second.

xorbit commented 4 years ago

Nice! Thank you, I will try this.

Jookia commented 4 years ago

Just a note: The only commit 'necessary' is the first one (ping the watchdog), but the rest serve to reduce the difference between timing for systemd and non-systemd builds in case that's going to cause issues, and removing uneccessary error checking.

xorbit commented 4 years ago

I just tried this, and it's an improvement but there are still issues.

Fresh install of buster, running ./INSTALL.sh, I get this:

Failed to restart lifepo4wered-daemon.service: Unit lifepo4wered-daemon.service not found.
Synchronizing state of lifepo4wered-daemon.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable lifepo4wered-daemon
Created symlink /etc/systemd/system/sysinit.target.wants/lifepo4wered-daemon.service → /etc/systemd/system/lifepo4wered-daemon.service.

Job for lifepo4wered-daemon.service failed because the service did not take the steps required by its unit configuration.
See "systemctl status lifepo4wered-daemon.service" and "journalctl -xe" for details.
Enabling I2C in device tree configuration
You need to reboot for this to take effect!
Enabling I2C device module
You need to reboot for this to take effect!
Enabling UART in device tree configuration
You need to reboot for this to take effect!

Where the empty line is, there was a long wait time.

On shutdown, there was a minute and a half delay due to a stop job for the lifepo4wered daemon.

Rebooting, the boot log shows this error:

[FAILED] Failed to start daemon for lifepo4wered Pi hat.
See 'systemctl status lifepo4wered-daemon.service' for details.

Running systemctl status lifepo4wered-daemon.service, I get this:

● lifepo4wered-daemon.service - Daemon for lifepo4wered Pi hat
   Loaded: loaded (/etc/systemd/system/lifepo4wered-daemon.service; enabled; vendor preset: enabled)
   Active: deactivating (stop-sigterm) (Result: protocol)
  Process: 603 ExecStart=/usr/local/sbin/lifepo4wered-daemon (code=exited, status=0/SUCCESS)
 Main PID: 603 (code=exited, status=0/SUCCESS)
   Status: "Startup"
    Tasks: 1 (limit: 2068)
   CGroup: /system.slice/lifepo4wered-daemon.service
           └─604 /usr/local/sbin/lifepo4wered-daemon

Aug 13 12:36:06 pi3bplus systemd[1]: lifepo4wered-daemon.service: Got notification message from PID 604, but reception only permitted for main PID which is currently not known
Aug 13 12:36:07 pi3bplus systemd[1]: lifepo4wered-daemon.service: Got notification message from PID 604, but reception only permitted for main PID which is currently not known
Aug 13 12:36:08 pi3bplus systemd[1]: lifepo4wered-daemon.service: Got notification message from PID 604, but reception only permitted for main PID which is currently not known

The LiFePO4wered/Pi+ does seem to get notified when the system has booted (PI_RUNNING = 1). Eventually. It's kind of weird because sometimes it takes a while after the system has been sitting at the login prompt.

Pushing the button does seem to trigger a shutdown. However, shutting down the system by software (such as sudo shutdown -h now) does not seem to clear the PI_RUNNING flag to inform the LiFePO4wered/Pi+ of the changed system state. The stop job thing regularly (but not always) hangs the shutdown process for a minute or so. Since the LiFePO4wered/Pi+ isn't informed it does not turn power off after shutdown is complete.

Why does this never work right? Why is it so fragile? What do I even gain with systemd integration that's worth the bother?

Jookia commented 4 years ago

I think I see the issue. fb52ecc. Reverting it is an option, but it might be better to add an option to make it so the daemon doesn't fork in to the background instead of relying on detecting systemd for this- a lot of init systems like runit/s6 would appreciate this. I'll do a PR for it.

xorbit commented 4 years ago

Thanks. I thought a daemon was supposed to fork. Maybe that was true in the olden days but not anymore? As long as the PR works with and without systemd I'll be thankful if you can make it work the correct way.

Jookia commented 4 years ago

Can confirm reverting the commit fixed the issue. I didn't catch it because my Makefile branch was broken. Oops.

As for daemons forking, yes and no. Classic Unix would have lifepo4wered-daemon fork, set a PID file, then return an exit code from the main process indicating if the daemon started successfully. The forked process would then be orpahned and re-parented to PID 1 (init) and it would stick around forever and chat to SYSLOG. An init script would start the process, and also kill it based on the PID file, and report errors based on the status codes it gets. So this type of init system just starts and stops programs using shell scripts.

But since then there's been a lot of work in supervision systems. Supervision systems start processes as children then log the child's stdin/stdout and restart the child when it dies. This reduces a lot of complexity because the program no longer needs to fork or log to SYSLOG, and because the parent can track its child without a PID file and avoid some race conditions. This type of system is like runit or s6. It replaces a big init script with a one-liner to start the process.

Modern systems like systemd and OpenRC end up supporting both methods. But systemd supports a few more features that help add to robustness to supervision: Readiness notification and watchdogs. This just means that the supervised process reports to systemd on a socket when it's done starting up, and then now and then on the watchdog. This means systemd can detect if a process is hung or failed to start even if the process is still running. s6 supports readiness notification in a different way.

systemd seems to want to supervisor the process if it uses notification, so my change to make both systemd and non-systemd systems use a forking approach broke this. PR #39 fixes this.

xorbit commented 4 years ago

Thanks for the clarification.

With the latest change, I'm seeing this:

pi@pi3bplus:~ $ systemctl status lifepo4wered-daemon.service
● lifepo4wered-daemon.service - Daemon for lifepo4wered Pi hat
   Loaded: loaded (/etc/systemd/system/lifepo4wered-daemon.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-08-13 18:53:14 MDT; 24s ago
 Main PID: 341 (lifepo4wered-da)
   Status: "Startup"
    Tasks: 1 (limit: 2068)
   CGroup: /system.slice/lifepo4wered-daemon.service
           └─341 /usr/local/sbin/lifepo4wered-daemon

Aug 13 18:52:40 pi3bplus systemd[1]: Starting Daemon for lifepo4wered Pi hat...
Aug 13 18:52:40 pi3bplus LiFePO4wered[341]: LiFePO4wered daemon started
Aug 13 18:53:14 pi3bplus LiFePO4wered[341]: System time restored from RTC: 1597366394
Aug 13 18:53:14 pi3bplus systemd[1]: Started Daemon for lifepo4wered Pi hat.

I assume that's correct? The only thing I'm concerned about is Status: "Startup", is that normal? The daemon seems to notify the LiFePO4wered/Pi+ that it is shutting down now and the power is turned off.

Jookia commented 4 years ago

STATUS=Startup is user defined, so it might be best to remove it and also add STOPPING=1. I've opened #40 to add support for running in foreground and working on #41 for that.