Thursday, February 3, 2011

Do you have a creative use for nagios?

I'm looking for inspirations on nonstandard uses of monitoring systems like nagios, which is normally used to check whether HTTP is reponding etc. I'm curious how people have taken the simple nagios framework and run with it in unexpected ways, so I can steal borrow them.

  • To get the ball rolling, one example I've heard of is a guy who set up nagios checks to monitor his forum for unhealthy activity like large numbers of unreplied to threads and mean time between posts.

    drewrockshard : Ah, nope, not really something I've really looked into. Writting scripts for nagios, though, are very easy once you know how it interprets the code and arguments/status codes fed to it.
    From jldugger
  • What exactly do you mean? I've written a few scripts that monitor different things other than HTTP. I've even created a "URL Content" monitor of sorts (very basic one at that), that just checks for a certain chunk of text, and then if it reports less than 1 (0) instances of the text, it reports as "down", and if more than 1, it reports as up.

    Writting nagios scripts can be done with pretty much any language.

    James Lawrie : what does it do if there's exactly 1 instance of the text?
    drewrockshard : It sends out an email reporting that it failed the check, and it shows in nagios as "red" (down as nagios puts it).
  • I collect performance data into rrd data. So I made some checks to read several data points from recent checks and look for changes in trends -- these scripts can be useful. It is basically an automated way to read graphs.

    sinping : I would be very interested in this. I don't supposed you would mind sharing? It would be nice to do the same thing with Cacti graphs.
    Kyle Brandt : @sinping: Here is one of them .. I think :-) Kind of from a while ago now... http://www.kbrandt.com/files/nagiosMailGrowthOverTime
  • Here i have a SMS-Gateway using some USB-modems. Of course, i monitor the modems and the gateway itself. Because all of our SIM cards used there have a contingent of 1000 free SMS per month, i monitor the amount of already sent SMS via the normal Webinterface of out mobile network operator (small perl skript with WWW::Mechanize). If one SIM has no more free SMS to send, it gets deactivated by nagios - if the webinterface of the mobile network operator tells nagios, there are again 1000 free SMS to send, the modem gets activated again. In conjunction with nagios-grapher i have nice statistics too ...

    From m.sr
  • Perhaps another thing People could be interessted in:

    I make backups with dirvish of the whole infrastructure here. After the dirvish backup finished, i check the backup results with a small script and send the results from the backup machine to the nagios machine.

    On the nagios server a passive check for this is defined. The perhaps motst interessting thing here: i defined freshness_threshold with 93600 (= 26h) and check_command with check_dummy_args!2!'Last backup cycle too long ago' (and of course check_freshness with 1). This way i get automaticaly notified if a backup takes too long or didn't run without polling.

    From m.sr
  • I use nagios to monitor a high-performance-computing Linux cluster of 1100 nodes. Nagios is used to check the sshd process, hard drive SMART status, Infiniband network status, shared filesystem and disk usage. If any of these tests fail, the node is automatically taken out of the job scheduler's production pool so it may be serviced. So far, this has worked fairly well. Before Nagios was implemented on the cluster, we had many complains that programs would fail to start or they would crash immediately. After it was implemented, we almost no complains.

    I also use Nagios to monitor some Xen dom-U instances. If a dom-U VM would crash, Nagios would automatically reboot the VM.

    From ryanlim
  • I have numerous passive only services for file freshness status, and one active service for generating a report of file status. The active service executes a script which runs a report and dumps the results into the commandfile, that way I get a notification if (1) the report failed to run and (2) I get a broken out result of all of the files it ran queries against. The check runs once every 5 minutes, and file statuses update once every 5 minutes. It works very very well.

    I use this same concept in determining files to pull from external sources (http, ftp, etc.). Stick a script with the necessary repeat interval into NAGIOS that traverses directories on remote resources looking for files we need to pull. If it finds nothing, alert, if it finds something, exit OK and do the work to put the pull request on our queue.

    And aside from all this, I also have numerous "how old is this file" or "how old is this directory" checks that are dumb, and I loathe them a lot.

    From VxJasonxV
  • Besides all the common and boring stuff I've a monitor to check if it's the SysAdm day, which send and alert to all my users.

    I have plans also to implement a sound alert with festival for really dangerous fails, and plans to monitor the presence of boss in the headqarters. But they don't like to pay me for implementing pranks

    From theist
  • I put a couple of ideas up on my blog:

    Checking that backup files are valid

    Checking that web content is up to date

  • We use Nagios to check on database load, cache hit ratios, etc. We also wrote a custom script to check the length of our job queues to alert us when they get too large.

    From Andrew C
  • not mine, but this is the most creative use of nagios I have ever heard of. Hats off to this guy!

  • We had both Nagios and Solarwinds as our primary monitoring systems at the last place I was a NOC guy. Solarwinds was great for monitoring the Windows systems, but it was kind of flaky, so we did a lot of monitoring between the two systems to make them monitor each other. Lots of python scripts running SQL queries on the Solarwinds database to make sure that it didn't contain stale data.

    You can also exploit a Nagios "check script" to trigger a software update on a machine to make sure it us using the current version of whatever you want at regular intervals.

    On our NFS servers, there was no specific set of mounts that was permanently "correct," so the file server check scripts were set up to always issue an alert whenever the list of exported filesystems changed. That way, the guys running those machines always got notified when something was added or removed. If they were working on the machine at the time, they would ignore the alert. If they weren't, they would fix it. The "alert on delta" instead of "alert on state" concept helped reduce some of our communications overhead for that sort of stuff.

    We had 24 hour NOC monkeys to watch everything, so we also had a periodic "email is working" message that they would get according to schedule, and they would manually panic if none of the automated monitoring had noticed email was broken. That sort of thing is easy to set up as a "check script" even if an OK return value from the script doesn't tell you for sure that everything is okay. If you don't have the spare bodies to check this manually, you can also have a "send email" check script and a "check email" check script that work in unison, with the check email script alerting on high delivery latencies. It's not as complete a guarantee that the system is working end-to-end as having somebody actually reading it on their Blackberry and Outlook, but it covers the majority of possible problems.

    A lot of Nagios stuff is really going to be site-specific "see an itch, scratch an itch" kind of stuff. You just have to be a bit of a practical dreamer.

    From wrosecrans

0 comments:

Post a Comment