webplatform / ops

http://webplatform.github.io/ops/
5 stars 1 forks source link

Setup Monit to flip switches in monitor.webplatform.org #160

Open renoirb opened 9 years ago

renoirb commented 9 years ago

We now have Monit to monitor crucial services on each nodes.

How about we make a services status page to have its switches flipped automatically through it.

Estimated work items

  1. Find how to catch service recovery in Monit when we could send a recovered call (i.e. service tests)
  2. Adjust cachet to have 1:1 mapping of components and monit checks
  3. Make mapping of monit status 1:1 with cachet
  4. Find way to make Monit send variables into update script
  5. Create cachet api update script that’ll be used by Monit
  6. Create API update only account

    Proposal

Cachet’ documentation is not very complete but we could use Monit event handler (see how they’d do it with a 3rd party provider)

Configure Monit to make a trigger

# An example of Salt stack managed Monit template
# refer to salt-states/mysql/files/monit.conf.jinja

check process mysql
  matching "mysql"
  group database
  start = "/usr/sbin/service mysql start"
  stop  = "/usr/sbin/service mysql stop"
  if failed host {{ ip4_interfaces[0]|default('127.0.0.1') }} port 3306
    protocol MYSQL then restart
  if not exist for 3 cycles then restart
  if 3 restarts within 5 cycles then exec /path/to/monit_update_cachet_db.sh

Setup an update script

#!/bin/sh

# /path/to/monit_update_cachet_db.sh

# Make an update to the cachet API
# -u would contain pre-populated cachet update only user
# components/2 would be the component id
# we’d have to figure out how monit tells status and make sure the value at status=3 is the right one
#10.10.10.2:8000 is the internal upstream service we send our update requests

/usr/bin/curl -u user:pass -XPUT \
  -d status=3 \
  10.10.10.2:8000/api/components/2

Example on how to update a component status

Using curl we an update of the database component into partial outage would look like this;

API call

Its using incident status 2, which would mean "partial outage". See also post-parameters section.

curl -u user:pass -XPUT -d status=3 10.10.10.2:8000/api/components/2
{
    "data": {
        "created_at": 1427482793,
        "description": "MariaDB database cluster nodes",
        "id": 2,
        "incident_count": 0,
        "name": "db cluster",
        "status": "Partial Outage",
        "status_id": 3,
        "updated_at": 1430332325
    }
}

How the status is displayed

monitor-dashboard-2