Network monitoring in a hybrid world: Talking with SolarWinds’ Chris O’Brien

Network monitoring and troubleshooting has evolved quite a bit from the old days when I used to scroll through endless events in log files or slap a Fluke on a cable. Today’s business networks often take a hybrid approach of combining on-premises servers with online applications and services provisioned from the cloud. And with rapidly changing business requirements and the need for businesses to become agile, software-defined networking (SDN) has been transforming enterprise networking even further by separating the physical network from its logical overlay. How can one best monitor today’s business networks in order to maintain optimal performance and be able to resolve problems when they occur? And where is networking monitoring headed in the future as IT infrastructure continues to evolve? I recently talked about these things with Chris O’Brien the Product Manager for SolarWinds Network Performance Monitor (NPM). Chris spent most of his career as a network engineer. He joined SolarWinds in 2014 to help build the future of network monitoring.

MITCH: Thanks Chris for agreeing to let me interview you about your network management line of products over there at SolarWinds.

CHRIS: Sure thing. I love talking about network monitoring!

MITCH: Let’s start off with something general. There are a lot of companies in the network monitoring space. Could you give us a brief introduction to SolarWinds and what makes your company different from the others?

CHRIS: SolarWinds was founded in 1999 by two network engineers who were looking for better tools to do their jobs, so they built them. They focused on building simple, powerful tools that just worked. Turns out, that’s a good formula. Lots of people wanted the tools, so they created SolarWinds.

The founders’ engineering spirit runs deeply through the company. SolarWinds is, in many ways, the antithesis to traditional enterprise software. Instead of having to talk with a salesperson to get a glimpse at a tool, you can use the online demo or download a fully functional trial. Instead of having to go through a protracted budgeting, quoting, and pitching process that often involves golf, airplanes, and the CTO, most of our tools can be bought immediately with budget the engineer has discretion over. Instead of paying for extensive professional services to get the tool up and running, the tools are built so engineers can do it themselves. This mantra of easy to try, easy to buy, and easy to use is at the heart of SolarWinds.

MITCH: From my conversations with IT pros who work both in enterprise environments and for cloud services providers, it sounds like software-defined networking (SDN) is becoming more and more popular these days. What’s the best way to monitor SDN environments?

CHRIS: SDN is definitely getting more popular, which is super exciting. I get asked about SDN at just about every event I go to, and especially at Cisco Live! As a result, I get to talk to a lot of folks about how their SDN implementation is going, what works and what doesn’t, how they’re monitoring today, and how they’d like to monitor tomorrow. You can think of SDN monitoring in two layers. The first layer is the physical layer. This is things like ports, CPUs, RAM, power supplies, and network cables. It’s not glamorous, but your frames and packets still flow on these pieces of hardware and that hardware still has to work. The second layer is the logical layer; the SDN overlay. SDN organizes connectivity into logical components that define what on this physical network is logically connected. In Cisco ACI parlance, these are things like tenants, EPGs, fabrics, and contracts.

SDN is definitely getting more popular, which is super exciting. I get asked about SDN at just about every event I go to

Both of these layers are super important. If either one fails, connectivity fails. You want to make sure you’re monitoring each one. We’ve had NPM customers covered for the first layer for quite some time now. In our latest release, NPM 12.4, we added support for the second layer with Cisco ACI. We’ll query the SDN controller, which Cisco calls the APIC, via API to discover and monitor the logical layer. Regardless of what monitoring solution you have, make sure both layers are covered!

MITCH: Utilizing public cloud services and implementing hybrid cloud infrastructures has brought many changes in the way most businesses and organizations “do IT.” How does using the cloud change how network monitoring is done?

CHRIS: Yeah, most IT shops today have to deal with both on-prem and cloud infrastructure. By nature, you have less access and control over cloud infrastructure. This is both good and bad. To the extent that the infrastructure runs well without your constant attention, it’s great. When it doesn’t, you’re a bit stuck. The customers I talk to tell me they’re responsible for their IT services regardless of where they are. This makes the lack of control and even simple visibility pretty painful during an outage.

We’ve been thinking a lot about this over the past couple of years. Both Amazon and Azure have APIs to query monitoring information about that infrastructure. SolarWinds Server & Application Monitor (SAM) supports both. Agents can still run on VM-based IaaS, which can be combined with the API data for a more complete picture. This is a big change vs. predominately WMI-based monitoring of Windows machines and SNMP for Linux/Unix.

The network side presents a different challenge. Cloud environments offer near zero visibility into their network infrastructure. This is true for SaaS apps, IaaS, and the service providers that are your transit to them. Historically, traceroute was the go-to tool to investigate these sorts of problems, but it isn’t allowed through most firewalls and doesn’t work with multipath, which is most of the internet today. To try to solve this problem, we built our own implementation that uses a packet driver to create packets and listen to responses. That’s what powers NetPath, a feature in NPM that discovers the network path from your source to any network destination, local or remote, your gear or someone else’s, along with hop-by-hop performance. We’ll have to keep coming up with new technologies like this as the infrastructure changes.

MITCH: Are there any other trends do you see happening in network monitoring?

CHRIS: Yes, two: user focus and API polling.

It’s becoming clear that our industry has been too focused on the infrastructure and not focused enough on the user. It’s natural, since we’re all geeks. I like watching the lights on a 300-pound chassis switch as much as anyone. If I’m honest, part of why I became an engineer is that I would rather interact with computers than with people. Still, the purpose of the network is connecting users to apps, and that has to become a bigger part of how we judge whether the network is providing good service or not. There’s a lot of ways to do this. You don’t have to buy a product to do it. It’s mostly about mindset. What do your users care about? What does good performance look like for them? What does bad performance look like? How can you measure it? The S in SLA is service, not infrastructure. Check out NetPath for a good example, but again, this is a mindset shift more than a tooling switch.

It’s becoming clear that our industry has been too focused on the infrastructure and not focused enough on the user. It’s natural, since we’re all geeks. I like watching the lights on a 300-pound chassis switch as much as anyone

SNMP has provided a ton of visibility into systems for a long time but is getting long in the tooth. SNMP is not particularly reliable, isn’t good at sending bulk amounts of data and supports very limited interaction. NETCONF is great, but I’m just not seeing the adoption amongst manufacturers for it to be as useful as it could be. API is starting to take hold. In the last couple of years, we’ve spent more time building API-based monitoring than SNMP. API is less consistently implemented and tends to be complex. In the end, we get the data wherever we can, while thinking through performance, scale, security, and effort required from the user. It’s becoming more often the case that API is the right way to get the data.

MITCH: I’ve noticed that SolarWinds talks a lot about network insight. What is that exactly?

CHRIS: The networks of 10 or 15 years ago were predominately switches and routers. If your switches and routers were running well, you were doing your job as a network engineer. Today, that is not the case. Advanced network appliances like firewalls, load balancers, WAN optimizers, and web proxies are often run by the network team and provide absolutely critical network services. Unfortunately, most tools, including SolarWinds tools some years ago, only know how to do a good job at monitoring routers, switches, and more recently, wireless gear. The data you need to understand the health and performance of a router or a switch is not the same data you need to understand the health and performance of a firewall or a load balancer. Adding one or two metrics won’t fix that. You have to look at the role the device plays in the network and ask how you can measure the devices’ performance of that role. Network Insight strives to do that deep dive, from-the-ground-up monitoring for these underrepresented devices. It’s time-consuming, but we’ve so far released Network Insight for F5 LTM and GTM, Network Insight for Cisco ASA, and Network Insight for Cisco Nexus. We have a lot more work to do!

MITCH: Many network and system administrators still struggle a lot with alert fatigue. What can be done to help alleviate this condition?

CHRIS: The first thing to do is step back and realize this is a human problem. It’s not just your team’s problem or just an IT problem. We see alert fatigue in hospitals, for instance. Even when human lives are at risk, noisy alerts can cause alert fatigue, which will cause humans to ignore alarms. If humans can’t bring themselves to always pay attention to noisy alerts when a human life is at risk, you can bet they can’t do it for network infrastructure! You have to level-up the quality of your alerts. There has been a lot of great material written and presented on this subject, so I’d suggest doing some online research. However, I can provide a framework to help you break down the problem. Alert fatigue is caused by too many alerts and alerts that are hard to consume. Too many alerts can be fixed by only sending actionable alerts (no informational alerts!), reducing systems that produce a disproportionate number of the alerts, and introducing redundancy that makes urgent alerts less likely. Alerts that are hard to consume can be fixed by adding contextual information to the alert, automating remediation steps, and making sure alerts provide a clear explanation of the problem. There’s a lot more to be said here, but as you break the problem into smaller pieces, it becomes easier to come up with ideas on how to improve.

MITCH: Just one more question for you. If you could give one piece of advice to people on how they can improve their network monitoring strategy, what would it be?

CHRIS: It took me years and years of being a network engineer to realize that there is no perfect network architecture. Every design has strengths and weaknesses. The same holds true in monitoring. The best performing monitoring environments I see use a combination of polling, synthetic probing, real user monitoring, and events. For example, real user monitoring gets you data that is more reflective of the user experience, but it does not have the consistency of synthetic probing. Synthetic probing tells you when performance degradation occurs, but not where the root cause is. Events give you the most timely information, but not a lot of context. Use each of these technologies to achieve what that technology is best at.

MITCH: Chris, thanks very much for giving us some of your valuable time!

CHRIS: Happy to! Thanks for having me.

Featured image: Shutterstock