Project Details

  ReadmeDownload 

Download Version 0.4


Download dev snapshot
View Licence

INTRODUCTION

The Eye Of Horus is a monitoring and alerting tool for computers. It's mainly useful for monitoring network services (eg, HTTP or SMTP servers) and the internal status of Unix servers (eg, load, disk usage, process counts).

In that respect, it's a lot like Nagios, but in my opinion it's better. It lacks a few features Nagios has, but it is a very simple architecture to which they can easily be added.

It's a flexible thing made from independent modules with well-defined interfaces, making it easy to customise and extend, but out of the box it'll monitor your servers and produce a nice HTML summary of their status - OK, the looks need a bit of work, but that will come soon, and it can optionally integrate with the excellent (and I mean excellent) RRDTool to store logs of statistics (response times, number of packages with known security holes, etc) - and link from the status page to nice graphs of the historical behaviour of these statistics.

Also, it's really easy to add new service checks to it.

HOW IT WORKS

The core of the system is horus-check.py, a Python script which reads a configuration file (specified on the command line). The configuration file specifies a list of services - either network services, in which case the host to run the check from and the host to run the check 'at' are specified, or local services, in which case only the host to run the check from need be specified. In either case, if the host to run the check from is not specified, then it defaults to the local host.

The service types reference definitions in a file which is referenced from the configuration file. In the service definitions file, a shell command to check the service is given; this command must output service status in a defined format, as a single-line YAML list. The list must contain, at least, a single-word status (OK, WARNING, FAILURE, or UNKNOWN), then optionally numeric statistics, then optionally a status message. For example:

  [OK]
  [UNKNOWN]
  [OK, { load: 0.5, users: 3 }]
  [WARNING, { load: 3, users: 30 }]
  [FAILURE, { load: 95, users: 300 }]
  [UNKNOWN, { }, Could not find AWK executable]

When a check is to be performed from a remote host, Horus opens an ssh connection to that host. It is assumed that the user horus is run as will have an ssh key set up to enable it to ssh to all such hosts without requiring a password.

Having performed the checks, horus-check.py then:

  1. Reads in the status database named in the configuration file
  2. Updates the status database with the new status of hosts
  3. Computes an overall system status (the worst non-unknown status of any checked service)
  4. Examines the service dependencies, and marks any service whose state is no worse than might be expected (eg, no worse than the worst state of a service it depends upon) are automatically marked as 'quiet'
  5. Computes a list of differences between the old and new status (services added, services removed, services whose status has improved, services whose status has worsened)
  6. If there are any differences, invokes a notification script (named in the configuration file) with them, along with the overall status
  7. Invokes a logging script (named in the configuration file) with the new value of every statistic reported by the service checks; I will soon provide a sample logging script that uses RRDTool to generate nice graphs.

The status database (which is written in YAML, so easily accessible to user scripts) can then be used to generate HTML status report (see status.cgi).

INSTALLATION

Requires PyYAML

Copy and edit example.conf to suit your setup. Perhaps edit types.conf to add extra service types, if required, or change the commands to work on your systems.

Write your own change notification script(s), that accept a human-readable summary of the changes on stdin, and do something useful like email or SMS them on, then reference them in the notify-commands field of the configuration file.

Write your own parameter change notification script(s), that accept command line arguments like the supplied sample log.sh, and do something useful like update an RRDTool log, then reference them in the param-log-commands field of the configuration file.

Write your own scripts that parse the file specified in the status-database field of the configuration and produce funky system status displays. Try status.cgi as a starting point.

Run python horus-check.py <myconfig> at regular intervals, perhaps every five minutes from cron.

Set up status.cgi somewhere Apache will find it (edit it to point to the correct location of your status.db file) and you'll have a status report accessible via the Web. You can give GET parameters on the URL to filter the results:

All the files are in YAML format, and have fairly self-explanatory structures, although I shall document them when they stabilise...

CONFIGURATION

The configuration file is in YAML.

It has two top-level headings:

services:
config:

Under services should be a list of services to check, and under config, the paths to various other files are specified:

  services:
    - type: load
      params-ok: { load1: [0,2] } 
    - type: zombies
  config:
     status-database: status.db
     status-conf:
      status.cgi:
        rrdbase = rrd
     param-log-commands: [./rrdlog.py]
     param-log-conf:
      rrdlog.py:
        rrdbase = rrd
        step = 300
     notify-commands: [./smsnotify.py]
     notify-conf:
      smsnotify.py:
        to = 44555123456
     type-database: types.conf
     log: horus.log

Translated, that says to check for local load (and to count the one-minute load average param returned from the load service checker being within the range 0..2 inclusive as 'OK', overriding the default set in the service types file), and to check for local zombie count. Then various file names are specified in the configuration section, along with configuration for other components of the system - status-conf is copied verbatim into the status database file, to configure tools that process it; param-log-conf is passed to the parameter logging commands; the notify-conf is passed to the notify commands.

Note that parameter ranges may be made open-ended, by using -inf as the minimum or +inf as the maximum.

Every service declaration must specify a type, but all other fields are optional. A full list of fields used by Horus itself is given in this example:

  - type: http
    name: Main web site
    from: server1.example.com
    host: www.example.com
    quiet: False
    params-ok: { time: [0, 0.1] size: [500, 100] }
    params-warn: { time: [0, 1] }
    check-interval: 5m
    notify-conf:
      sms-notify.py:
         notify-list: [bob, james, keith]
    param-log-conf:
       rrdlog.py:
          enabled: no   # Do not make an RRD log
    status-conf:
      status.cgi:
        link: http://www.example.com/

However, any other fields mentioned are passed to the service checker itself.

The status-conf parameter of a service is copied verbatim to the status database file; it should be a list of tools that view the status database, with per-service configuration for each beneath. In this case, we ask the status.cgi Web reporting module to link this service to a specified URL. Likewise, notify-conf is passed to the notification plugins you have enabled, and param-log-conf is passed to the parameter logging plugins.

The check-interval allows us to specify a maximum checking frequency. If horus runs and the check interval has not passed since the service was last checked, then its previous status from the status database will be used. If a check interval is not specified, then one specified for the service type is used. If one is not specified there, then a global default from the configuration file is used. If one is not specified there then the system assumes a default of five minutes. If you do not override this, then horus will refuse to check anything more frequently than once every five minutes!

The check interval is specified with any combination of days, weeks, hours, minutes, and seconds. Eg, 1w1d1h1m1s means one week plus one day plus one hour plus one minute plus one second.

Also, services may have child services. The child services are those that depend on the 'parent'; if a service's status is not worse than all of its parent services, then it is not considered worth notifying, and is automatically marked as 'quiet'. Eg, a database server might have a dynamic Web site as a child service. If the database enters a WARNING state due to overload, then when the web site goes into WARNING since the response time is worsening, this fact will not be alerted; however, if the web site went to FAILURE while the database was still just WARNING, this would generate a notification, since the web site is in a worse state than would be expected from the state of the database.

This is specified like so:

  - type: pgsql
    host: db.example.com
    user: horus
    pass: fnargle
    children:
     - type: http
       host: www.example.com
       url: /test-db.php
       success-regex: "Database is OK"
     - type: http
       host: internate.example.com
       error-regex: "Database error"

The service types file is much simple. See the supplied types.conf for an example. Each service has just four properties:

  zombies:
     command: |
        ps -ax | awk -- "
        BEGIN { count = 0}
        { if (\$3==\"Z\") count = count +1; }
        END { print \"[OK, { zombies: \" count \" }]\" }"
     params-ok: { zombies: [0,5] }
     params-warn: { zombies: [0,20] }
     check-interval: 1m

The command property gives the shell command to run, params-ok lists the range of resulting parameter values which are considered OK, and params-warn lists the range of parameter values which, if not otherwise considered OK, are considered as worthy of a warning; and any parameter values outside of both ranges is considered a FAILURE case. check-interval specifies a default check interval for this service type.

Note that the zombies service always inherently reports 'OK', but that this may be overridden by the system if the parameters are out of range. This is in contrast to the system Nagios uses, where each service checker plugin is responsible for having allowed ranges specified to it as command line parameters, and it computing its own resulting status by doing the range checks itself. Horus avoids duplicating effort by keeping the service checker simple, and having the system worry about acceptable ranges.

WRITING PLUGINS

There are two kinds of plugins that horus-check.py invokes (not including anything that service check commands invoke).

Parameter logging

Service checks may optionally return parameters, which are numerical data reflecting the status of the service they monitor; eg, the exact round trip time on a ping, the number of connections currently being handled by a daemon, etc.

On every check, horus-check.py invokes any scripts listed in the param-log-commands section of the configuration file, once for each service that has parameters.

The arguments to the script are, in order, the name of the host the service check runs on, the name of the remote host the check is run against (if any), the name of the service, and the service type.

Fed into standard input of the script is a YAML structure with the following contents:

params is the params structure, as returned by the service checker; it's a dictionary mapping parameter names to their numeric values.

conf, if present, is a verbatim copy of the param-log-conf section from the configuration file's config section.

service-conf, if present, is a verbatim copy of the param-log-conf section from the service's definition in the configuration file.

By convention, the two conf sections are dictionaries mapping the name of a plugin to configuration for that plugin.

See rrdlog.py for an example of a plugin.

Notification

If any services changed status during the check, horus-check.py then runs any scripts listed in the notify-commands section of the configuration file, precisely once.

The only argument to the script is the overall system status (which is the worst status of any service, excepting UNKNOWN services).

However, standard input is fed a YAML document with the following contents:

conf, if present, is a verbatim copy of the notify-conf section from the configuration file's config section.

diffs is a dictionary of the changed services. The key is the name of the service, and the corresponding value has the following contents:

change is the type of change, and is either NEW (for a new service), OLD (for a service that has been removed from the configuration), IMPROVE (for a service whose status has improved), or WORSEN (for a status whose service has worsened).

quiet, a boolean indicating if the service is not to be announced (either explicitly in the configuration file, or because it depends on a service that's in no better state). This field always exists, except for OLD services.

conf, if present, is a verbatim copy of the service's notify-conf entry from the configuration file.

now is the current status of the service, and is present for all except OLD services.

was is the previous status of the service, present only on IMPROVEing or WORSENing services.

since is the time at which the previous state change occurred, present only on IMPROVEing or WORSENing services.

See notify.py for an example of a notification plugin.

ACCESSING THE STATUS DATABASE

The status database is designed for third-party components to access, in order to generate status reports.

It is a YAML document, consisting of:

conf, if present, is a verbatim copy of the configuration file's status-conf entry from the config section.

overall reflects the overall system status. It has three fields; checked which is the time of the last check, status is the overall system status (the worst service status, except for UNKNOWN services), and counts has fields for each service status, listing the number of services in that state.

status gives the detailled status of every service. It is a dictionary, with service names as the keys, and the following fields in the value:

checked is the timestamp at which this service was last checked

from is the host from which the check was run

host is the host which the check was run against (which differs from from in the case of network tests)

quiet is a boolean which is set if the service is marked as quiet in the configation file, or if it depends upon a service which is in no better state.

since is the time at which the service status last changed.

type is the type of service.

status is the current service status as returned by the checker, a list of up to three elements consisting of the state name, the parameters (a dictionary mapping parameter names to numeric values), and finally, the status message string.

conf, if present, is a verbatim copy of the service's status-conf section from the configuration file.

See status.cgi for an example of a CGI script that converts the status database into an HTML display.