Lately, I’ve been spending more time back in the performance testing world.  The problem is that the testing we’re doing is taking days and in some cases, weeks to run.  No matter how diligent you are, you’re not going to be staring at your CPU / memory graphs the whole time.  The question is, what Open-Source tool should I be using to collect my metrics?

Previously, I’ve always used nmon and nmon analyser to collect and inspect ( respectively ) my metrics.  There are a few issues with it however, the most glaring of which is that the analyser tool is an Excel macro ( gross, out comes the windows VM ).  More recently I’ve been using cacti, which is a great tool for collecting system metrics, but the rrdtool defaults are a bit weak on data retention.  Basically, you end up losing your data after 24 hours.  Now, granted, I can modify cacti to increase the the number of data points stored, but there are a few issues:

  1. Data Collection is kludgy
  2. The server has a LOT to do
  3. The interface is beginning to age
  4. Adding a new host is kind of like pulling out your wisdom teeth

So, dentistry aside, I found Ganglia.  Under the covers, ganglia is really using the same database technology as cacti ( rrdtool ), but, the defaults are changed in one simple place.  In 30 seconds, I had reconfigured ALL rrd databases and metrics to store 3 years of full data points.  Pretty simple and powerful.

The big win for me though was provisioning.  The environment I’m working in has a new machine showing up each day ( or an old machine re-purposed ), so setup needs to be quick.  With Ganglia, there are two methods for doing this:

  1. Multicast ( The default)

It is what is sounds like.  You turn on data collector on a host and before you even know it… your host is in the web interface.  This is really great when dealing with large clusters ( http://coen.boisestate.edu/ece/raspberry-pi/ ) in a lab where boxes come in and out before you know it.

  1. Unicast ( The reality )

Multicast doesn’t work in EC2, or, in most corporate networks for that matter.  Your production environment is 4 firewalls and 9 routers from where your graphing node is.  The configuration for this mode is a bit more up-front work, but, once you get it setup, you just start the collector daemon and it connects to the mothership and does the rest ( provisioning, initial graphing, etc… )

 

If you’re looking for a monitoring solution that gives you all the data-points, is easy to provision and open-source… gotta go Ganglia!