yo8192 / fcron

fcron is an advanced cron for Linux/Unix systems
http://fcron.free.fr
GNU General Public License v2.0
131 stars 12 forks source link

Crash after suspend/resume and error messages in log #30

Closed AndyM48 closed 1 month ago

AndyM48 commented 3 months ago

I don't understand these log entries:

May 28 18:41:44 acer fcron[2531]: suspend/hibernate detected: we woke up after 33s instead of 60s. The system was suspended for 7s.
May 29 07:50:46 acer fcrondyn[24091]: Couldn't get data from socket during 10 seconds.: Success
May 29 07:51:01 acer fcron[2531]: SIGTERM signal received
May 29 07:51:57 acer fcron[24861]: fcron[24861] 3.3.1 started
May 29 07:51:57 acer fcron[24861]: @reboot jobs will only be run at computer's startup.
May 29 07:51:57 acer fcron[24861]: updating configuration from /var/spool/fcron
May 29 07:51:57 acer fcron[24861]: adding file andrew
May 29 07:51:57 acer fcron[24861]: Error while setting next exe time for job '/home/andrew/tcltk/bbc_feed': cl_nextexe overflowed (case2). now=1716961917, cl_timefreq=900, cl_nextexe=1716914560. D (truncated)
May 29 07:51:57 acer fcron[24861]: Setting cl_nextexe to now+cl_timefreq to prevent an infinite loop.
May 29 07:51:57 acer fcron[24861]: next execution will now be at 1716962817.
May 29 07:51:57 acer fcron[24861]: Error while setting next exe time for job '/home/andrew/tcltk/tp_ctrl.tcl auto': cl_nextexe overflowed (case2). now=1716961917, cl_timefreq=60, cl_nextexe=171691 (truncated)
May 29 07:51:57 acer fcron[24861]: Setting cl_nextexe to now+cl_timefreq to prevent an infinite loop.
May 29 07:51:57 acer fcron[24861]: next execution will now be at 1716961977.
May 29 07:51:57 acer fcron[24861]: adding file systab
May 29 07:51:57 acer fcron[24861]: Error while setting next exe time for job '* * * *  /usr/bin/run-parts /etc/cron.hourly': cl_nextexe overflowed (case2). now=1716961917, cl_timefreq=60, cl_nexte (truncated)
May 29 07:51:57 acer fcron[24861]: Setting cl_nextexe to now+cl_timefreq to prevent an infinite loop.
May 29 07:51:57 acer fcron[24861]: next execution will now be at 1716961977.

I have noticed similar suspend/hibernate entries (see first entry) in the logs before. In this case fcron stopped responding and I had to kill the process and restart the service. This has not happened before.

The rest of the messages I don't understand either?

yo8192 commented 3 months ago

These log lines are from the code: https://github.com/yo8192/fcron/blob/87e8f9536e618b1f8011b7b6e65661b4283296b4/database.c#L1576-L1589.

Less clear as the log lines get truncated... we should probably split these log lines to avoid (or limit) the truncation.

So what happens is fcron reads the (binary) fcrontab, and in there it says the next time the line/job should be run is in the past (after correcting for the sleep duration). In that case fcron takes a conservative view, doesn't run them immediately, but resets when they will be executed next.

This problem could happen if fcron failed to save the (binary) fcrontab when stopping, so it ended up loading an older version. Note that fcron also saves this every --savetime seconds, by default 1800 (half an hour).

To troubleshoot this, you could:

As for fcron no longer responding, that is odd... do you mean it wasn't running jobs anymore, or was there another symptom?

To troubleshoot, if possible, run fcron in debug mode and capture debug logs.

If and when it happens again, I'd suggest:

AndyM48 commented 3 months ago

That all makes sense.

I still don't understand the lines like: suspend/hibernate detected: we woke up after 33s instead of 60s. The system was suspended for 7s. here is another one suspend/hibernate detected: we woke up after 3352s instead of 900s. The system was suspended for 2450s.

As for fcron no longer responding, that is odd... do you mean it wasn't running jobs anymore, or was there another symptom?

Fcron showed a running process, fcrondyn said it could not find the socket after 10 secs (from memory), rc-service said that fcron was started, on restart fcron refused to stop so could not be restarted. Hence my; kill fcron and then start it again.

yo8192 commented 3 months ago

suspend/hibernate detected: we woke up after 3352s instead of 900s. The system was suspended for 2450s.

That means that fcron went to sleep, expecting to be woken up 900s later to do something. Instead it woke up 3352s later, and it detected the system was suspended or hibernated for 2450s in between. It uses that suspend/hibernation duration to adjust when tasks will be executed next (e.g. @-lines).

AndyM48 commented 3 months ago

OK, now I understand it. Thank you