With modern distributed environments comprising containers, microservices, cloud-native components, and more, they say it’s not really a question of if something will break, but rather when. This translates to that it’s no longer the ability to avoid incidents but rather the ability to bounce back swiftly and effectively — that’s the definition of good operational ability. Today, with DevOps, site reliability engineering (SRE), CD, shift-left, and an overall “blameless” culture, you can’t really point fingers anymore, and operational responsibility is shared by everyone. Additionally, we now live in a world that’s always-on, and user expectations are at an all-time high as far as acceptable service levels go. As opposed to a time when incident management was about debugging code, it’s now about keeping your services running as well as the ability to respond and recover from any unforeseen circumstances. This, and other factors like the pandemic, recession, and the fact that this is an election year, has led to an unprecedented amount of incidents in 2020.
Here’s a look at some startups putting incident management front and center.
The number of startups popping up in this sector is encouraging. We’re going to look a closer look at what they bring to incident management, starting with California-based startup, StackRox. Founded in 2014, StackRox earlier this month announced that it had raised $26.5 million in its Series B-1 round of funding and achieved a 240 percent increase in revenue in the first six months of this year. StackRox is a Kubernetes-native container security platform that addresses incident management and vulnerability discovery, risk profiling, compliance, visibility, and more.
In May, StackRox added several new incident management features to its platform, all of which, needless to say, stems from its Kubernetes-native architecture. The new features include timeline views that provide users with a chronological view of runtime events, federal benchmark checks that help with compliance, analyst notes for adding annotations to metadata, and advanced policies for more flexible policies that use Boolean operators. What’s interesting is that all these new features have been developed in collaboration with existing customers, both from the enterprise and the federal government.
StackRox also features an automated incident response system that allows you to preset a range of responses from a simple alert to actually terminating an affected pod or container. The cause of the incident is determined using anomaly detection and forensic capabilities that allow Stackrox to get to the root of each incident with ease. In addition to integrating natively with PagerDuty, Splunk, SumoLogic, and Google Command Security Center, StackRox also uses Istio to visualize networks and enforce network policies.
Another organization that doesn’t believe in wasting time trying to prevent an attack, stating the next attack is “inevitable,” is Israeli-based startup Mitiga. In July, Mitiga emerged from what people call stealth mode to raise $7 million in seed funding toward its incident response solutions and services. As opposed to organizations spending a lot of time and energy on protection, Mitiga specializes in managing post-hack environments, helping users navigate through incidents, and accelerating the return to business as usual.
Mitiga isn’t just about its managed services or incident response technology customized to each client’s requirements. While a lot of knowledge and expertise is indeed baked into their tech stack, an emergency operation center or “red” team is always standing by to deal with active incidents like breaches and malware attacks. This team comprises cybersecurity specialists who carry out assessments, penetration testing, forensic investigations, and even prep PR teams to deal with the aftermath of a breach.
Now when we said Mitiga doesn’t believe in prevention or protection, this doesn’t mean they’re sitting around waiting for an attack to occur. Mitiga’s brand of incident response is termed offensive readiness. Much like Netflix’s Chaos Monkey, Mitiga believes that modern environments require an aggressive approach and, as such, bring a blend of enterprise services and military-grade security to the world of incident management. The military background here comes from CEO Ariel Parnes, who was not only a colonel in Israel’s 8200 elite cyber unit but also the commander of the Cyber Special Ops force.
Next on our list is another startup taking a page out of the Netflix Chaos Monkey program and using rather aggressive tactics to prepare against hacks and breaches. Gremlin, based out of San Jose, Calif., announced in September 2018 that it had raised $18 million in its Series B round of funding. Chaos engineering is basically the process of attacking your own system and “breaking things on purpose” to desensitize and acclimatize teams to malfunctions, as well as high-stress situations.
The Chaos Monkey tool that randomly terminates instances, along with the Simian Army, was Netflix’s take on Chaos engineering. Lorne Kligerman, director of product at Gremlin, was quoted comparing Chaos engineering to a vaccine that “injects controlled harm to build immunity,” and of course, resilience. While hardly anyone has the kind of resources at their disposal as Netflix does, Gremlin announced free Chaos-Monkey-as-a-Service for all organizations building resilient web applications in February 2019. In April 2019, Gremlin announced integration with Spinnaker CD that was followed by announcements for Windows and Kubernetes compatibility later that same year.
While the concept of breaking-things-on-purpose might sound easy, the focus here is on inducing “controlled” failure to build resilience. Gremlin achieves this by providing users with a controlled environment where they can slowly stack-up errors like a big house-of-cards until something breaks. What’s equally important, and what Gremlin also provides users with, is the ability to gradually dial back those errors when something does break so that observations, assessments, and appropriate adjustments can be made. Gremlin also organizes a Chaos Conference in October every year that can be joined for free.
Last on our list, we have a Palo Alto, Calif.-based startup with a name that you can’t help but love. Blameless Inc., founded in 2017, provides users with the first end-to-end SRE platform that uses AI to autodetect and resolve incidents. Like Mitiga, Blameless was also developed in stealth mode while catering to a select group of early users. However, it emerged with an SRE platform that was ready to ship, as well as $20 million in funding.
While the focus of DevOps teams is often on quick and frequent releases, as we mentioned at the beginning of this post, it’s not a particular release or update that’s the product now, but rather a running service. This is why SRE teams focus on strengthening operational resilience, and as opposed to service level agreements (SLA), they have service level objectives (SLO) with consequences. What this means is an update or new feature can actually be stopped from being shipped if an SLO isn’t met. Similar to how Chaos engineering was only being used by Amazon and Netflix before Gremlin, SRE was only being used by very large organizations before Blameless was launched.
Incident management startups: Different but with a common thread
We’ve looked at four startups and their unique approaches to incident management, right from Kube-native to AI-assisted, Chaos-engineered, and even military-grade. While they may all be a bit different from each other, the common thread here is that modern ephemeral, distributed, hybrid environments with attack surfaces that are virtually limitless require new and “out-of-the-box” approaches to incident management. Thankfully, that’s what we’re beginning to see from startups in this sector, and with incidents at an all-time high, we won’t be surprised to see a few more emerge from stealth mode before the year is through.
Featured image: PickPik