xenogenesi / task-spooler

fork of ts (task spooler by Lluís Batlle i Rossell) to add GNU/Autotools support, and some helper to generate packages
GNU General Public License v2.0
80 stars 18 forks source link

errno 104, "Connection reset by peer" #4

Open kunsjef opened 8 years ago

kunsjef commented 8 years ago

I run icinga2 with checker servers in a cluster that all run task-spooler to keep the load down during reloads and restarts of icinga2 (there is an open bug that makes the load sky rocket). Most of the time this runs without problems, but every now and then task-spooler starts logging errors to /tmp/socket-ts.108.error. They look like this:

-------------------Warning
 Msg: JobID 206018 quit while running.
 errno 104, "Connection reset by peer"
date Wed May  4 20:14:15 2016
pid 633
type SERVER
New_jobs
  new_job
    jobid 205947
    command "/usr/bin/snmpget -v 2c -r 1 -t 5 -c <password> -Oe -OU <hostname> ciscoEnvMonSupplyState.1"
    state running
    result.errorlevel 0
    output_filename "NULL"
    store_output 0
    pid 16005
    should_keep_finished 0
  new_job
    jobid 205976
    command ....

What follows is a huge list (800+) of new jobs. The first 8 (the size of my queue) has PIDs, while the rest have separate JOBIDs, but no PIDs. After this long list of new jobs, this appears:

New_notifies
New_conns  new_conn
    socket 234
    hasjob "1"
    jobid 205947
  new_conn
    socket 665
    hasjob "1"
    jobid 205976
  new_conn
    socket 7
    hasjob "1"
    jobid 206018
  new_conn
    socket 277
    hasjob "1"
    jobid 206019
  new_conn
    socket 278
    hasjob "1"
    jobid 206021

Also this is a long list. And then this repeats. The last time this happened, this repeated 8183 times in about 20 minutes. The log file was 2.3 GB. I detected this when free disk space was starting to be low on one of the checkers.

# grep -c "May  4 20:" /tmp/socket-ts.108.error
8183
# ls -la /tmp/socket-ts.108.error
-rw-------  1 nagios nagios 2346969108 May  4 20:23 socket-ts.108.error

Also when this happens, task-spooler cannot limit the number of jobs it runs simultaneously. I have a limit of 8 jobs, but when this happens I can see hundreds of jobs running and hundreds of jobs in the queue. I can reproduce this error by restarting icinga2, generating a huge amount of jobs for TS to handle.

Can these errors be prevented, or is it possible to disable error-logging?

xenogenesi commented 7 years ago

Sorry I need better alarms for my repositories issues, I was thinking about using again ts (and merging this repository with the upstream) and just noticed your issue.

Have you already checked with the upstream version 1.0?

The first 8 (the size of my queue) has PIDs, while the rest have separate JOBIDs

So, the first 8 are running jobs, the rest are just queued... 800~ not bad!

New_notifies New_conns seems to be connections to ts's socket, one for each job, maybe there's a limit in the number of connections? or open files for the process by the system (ulimit?).

Eventually if 800~ is a limit per socket a workaround could be to use more fifo (up to 8 with 1 job per queue), as mentioned in ts's home page:

Have any amount of queues identified by name, writting a simple wrapper script for each (I use ts2, tsio, tsprint, etc)