zopefoundation / zdaemon

Python program that wraps commands to make them behave as proper daemons under Unix / Linux / Mac OS X
Other
21 stars 8 forks source link

zdaemon transcript thread dies if disk is full #1

Open mgedmin opened 11 years ago

mgedmin commented 11 years ago

Two days ago I had a bit of a problem with a disk full. Today my entire web server hung and stopped processing requests.

Long story short, when zdaemon's Transcript thread gets an IOError while writing to a log file, it just dies. zdaemon itself and the child program it is managing remain running. The child's stdout/stderr are now pointing to a pipe that is still open at both ends, but now no process ever reads from it. Things seem to run fine for a couple of days, then the kernel's pipe buffer becomes full and the child blocks in write(). While holding the logging mutex. Fun for the whole family, conveniently delayed from the root cause you think you already fixed.

tseaver commented 11 years ago

Would changing the code in copy()[1] to use the lock as a context manager make the pain go away? E.g.:

def copy(self):
    lock = self.lock
    i = [self.read_from]
    o = e = []
    while 1:
        ii, oo, ee = select.select(i, o, e)
        with lock:
            for fd in ii:
                self.write(os.read(fd, 8192))

[1] https://github.com/zopefoundation/zdaemon/blob/master/src/zdaemon/zdrun.py#L602

tseaver commented 11 years ago

Oops, that link should've been:

https://github.com/zopefoundation/zdaemon/blob/master/src/zdaemon/zdrun.py#L618

mgedmin commented 11 years ago

It's a good change, but it wouldn't help with this issue. The deadlock was not caused by waiting on a lock, it was caused by the child process blocking on a write to a full pipe with no readers.

What might help would be a try/finally (or a with block) that closes the self.read_from pipe. Then the child process would die with OSError(errno.EPIPE) instead of blocking some undetermined time later.

To actually make zdaemon survive a temporary disk full condition we'd have to add a loop around self.write that retries until the write succeeds. I'm not sure that's even possible using Python file objects (do we get the number of bytes successfully written by the underlying syscall?).

mgedmin commented 9 years ago

I'm not 100% happy with this: the death of the transcript thread ought to produce a log message somewhere (ha! as if that's possible if the disk is full, but anyway, there might be other causes) and it should cause the daemon manager to die ASAP, instead of waiting for the child process to try and log something to stdout. But I'm out of round tuits for today: I came to fix #8, and ended up spending all day on zdaemon.