How I Learned to Love Network Monitoring

How I Learned to Love Network Monitoring

I used to be surprised at how much resistance I’d get from decision makers when I’d bring up the subject of infrastructure monitoring.

This was hard for me to understand at first.  To a systems architect, having dialed-in monitoring systems in place to collect data, report on trends and alert when things are going wrong sounds like IT nirvana: a Distant Early Warning defense against whatever might menace a sysadmin’s weekend plans.

To most others, it probably sounds about as exciting as a pre-flight safety drill.  And that’s understandable: most of the time the network hums along, users are happy, and the plane doesn’t crash.

When things get a little bumpy, the floor drops beneath your feet and “Connected to…” on the status bar changes to “Trying to connect…” is when we start to want answers.  We start to demand those answers once the plane lands and we’ve spent the day without email while walking flash drives desk-to-desk.  And we’d sure like some answers before we start spending money on improvements that are simply best guesses as to what we might actually need.

At this point in my career I’ve presided over a few implementations and watched as many more met with varying degrees of success and failure.  I certainly now understand where a lot of the resistance comes from: data is often not meaningful and alerts are so numerous that the important ones get lost in a sea of noise.

Still, everyone eventually comes around to the realization that having some kind of monitoring going on is a good idea.  Once you get it right, it becomes easier to look back at what the decision-making process should have looked like.

Ultimately, the most important characteristic of the monitoring system you need is that it understands what you’re monitoring and knows what information will be relevant as a result.  It should understand this by default and because it has useful templates that are applied by rules during discovery, not because you spend an eternity telling it which things you want to pay attention to.  The opposite end of the spectrum is over-simplifying monitoring needs: simply pinging something to see whether it’s alive is a bit sad by modern standards of monitoring.  Tools like Zenoss and Nagios can be useful when correctly deployed, but they are vulnerable to these criticisms in the extreme.

By extension, how the system monitors is important as well.  Solarwinds just shelled out a huge chunk of money for N-able’s N-central platform, even though it had a fairly mature SNMP monitoring solution in its Orion product.  Why?  I can speculate that it is because N-central isn’t limited to SNMP, and you don’t learn what you need to know about Windows systems by SNMP alone.  That takes the Windows Management Instrumentation interface, and N-central is really good at using it.

Finally, what you get out of the system in reporting is also important: it makes the good data you collected useful.  If Cisco’s free Network Assistant were better at reporting, we wouldn’t turn to other tools when we needed to identify performance issues on a network.

When asked to choose a monitoring system, engineers will typically spend too much time worrying about getting data to choose systems capable of reporting on it in useful ways.  Management meanwhile will not appreciate the inherent shortcomings in a system that is purporting to deliver useful data by monitoring systems in a way that does not expose useful data (you’re going to monitor Windows servers with an SNMP crawl?  What do you mean by “uptime” exactly?)

In the end, Echo wound up looking at Kaseya, GFI, Zenoss, Nagios, PRTG and N-central.  While all of these systems are capable of being deployed correctly and can deliver the goods, we ultimately settled on two given the size of our organization and client base and the needs of one client in particular.  For that particular Linux-heavy client, we deployed Zenoss for its capacity for customization in that environment.  The rest of our client base is well-represented in our installation of Solarwinds N-central, which we like quite a bit: alerts are meaningful and noticed, clients have the ability to log in and see dashboards, engineers can produce simple to very detailed reports on inventory and performance metrics, and we’re able to identify the need for upgrades and changes.

We think we have the right shoe on the right foot.  Life is good, including life on weekends, for the sysadmins and CIOs alike.