Alarms are meant to be noticed. There are supposed to alert you about something. Usually about something alarming. Most alarms don’t work as intended, however, and there are various reasons for this. Some alarms are too loud, and if they’re too loud you may not be able to think and respond properly. I once experienced this years ago when we lived in a small apartment and one night the fire alarm went off unexpectedly. The sound in our apartment must have been 110 decibels, ringing so loud we woke up and couldn’t think. Other alarms can be too sensitive and ring when even something marginally hazardous happens, something you could prioritize for dealing with later instead of running around immediately trying to find a fire to put out. Other alarms may tell you there’s a problem but not what the nature of the problem is or how to deal with it.
There are lots of other alarms that can fail to function properly when they’re needed or produce the intended result when they’re triggered. Monitoring enterprise networks is one area where alarms are often used but just as often fail in their intended purpose. Alarms can alert administrators to problems with network reachability, performance, and integrity. There are lots of very good network monitoring solutions available from different vendors, and recently I interviewed the product manager of one of these solutions, one that I often recommend to my colleagues. But this got me thinking afterward about what it means to design, implement, and use alarms and alerts properly for monitoring network problems. After discussing this with some experienced colleagues in our IT profession we came up with three things you should think about when you plan on using alarms to alert you to problems happening on your network. You can call them criteria or best practices for utilizing network monitoring alarms. These are just our own combined takes on this subject, so if you have any other suggestions or recommendations as you read them, feel free to add them using the comments section at the end of this article. Share your expertise with our TechGenix IT professional community!
Network monitoring: Design for actionability
When you design or configure an alarm or alert to identify a condition of your network or a change in its condition, the alert should indicate something actionable. In other words, the representation of the alert — which may either an icon, a dialog box, a text message, a flashing UI element, an audible alarm, or some combination of the above — should also indicate in some way what action should be taken when the alert appears or alarm sounds. A small blinking red light beside a line of icons on your screen may look serious, but what does it mean to the network operator watching it? Realizing that “something’s wrong” can simply create panic and lead nowhere instead of leading to a quick and timely resolution of the problem. An alarm should include some description of the kind of problem and an explanation, or a link to an explanation, of how best to approach and hopefully resolve the situation. If you create an alarm for a condition for which you have no idea how to resolve it, you’re just creating more problems for the operators of your network.
No one likes to be aggravated. That fire alarm in our apartment I opened my story about was a truly aggravating experience for us! But I’m not really interested in talking about feelings here, I’m referring to aggregation, not aggravation! And it’s easy to confuse these two words, isn’t it? Or should I have said its easy, not “it’s” easy? I always get that wrong.
My point, however, is that one big alarm is usually not better than many smaller alarms. Problems often come in threes, in multiples, many things going wrong at once. They may have started when one wrong thing happens, one part failed, one individual blinked and didn’t watch where they were going. But the result is that after a few seconds there may be lots of things going wrong on your network — traffic getting routed to nowhere, latencies rising causing applications to fail and services timing out, users calling support desk to ask for tickets, customers phoning 800 numbers complaining about orders not being processed. If you aggregate several alarms so the big red flashing icon only appears when several of these conditions reach a certain threshold, then you may have already missed the first trigger that led to all the others occurring. This is why it’s usually best to keep alarms granular. Not too granular — you don’t want to get an alert about every single packet failure or failed logon happening on your network — but you want to know at least when the failure started happening, and where, and when, and what first step you should take to try and figure out what went wrong and how to fix it — see designed for actionability again in this regard.
Automation is the opiate of the system administration crowd. One of my all-time favorite blogs is The Lazy Administrator, which shows sysadmins how to find ways to keep their systems and network running with the least amount of effort. Automation can provide many benefits to the IT and other services of an organization. But when it comes to network monitoring, too much automation can backfire. Especially if you add all your automation in right at the beginning, either in the design of the monitoring product itself and its default configuration, or when you deploy and configure the product on your network.
That’s because networks change, they grow and evolve with time as new devices, technologies, and purposes change just as your business itself changes over time. Automation also can’t meet every possible situation, and when an unexpected event occurs an operator should be able to configure a one-off alarm or alert to deal with possible repeats of the situation in the future. Operators also like to customize monitoring systems to work well with their own particular personalities and biases and expertise, so shared monitoring consoles may have different automation requirements for different operators who use them. And when one operator finds something that another operator tuned differently in a way they don’t like, they may change the configuration without updating documentation properly. And this can lead to confusion that slows down response time when something serious happens on your network.
Sounding the alarm over network monitoring alarms
So, don’t get so hasty when using alarms to monitor your network. Make sure your alarms are actionable, alert you to real problems — key problems. Avoid anything too loud or overwhelmingly aggregated (or aggravating!) and don’t tie everything together too early hoping you can automate the solution instead of just the alert.
Featured image: Shutterstock