Moving from reactive to proactive
To guide the journey toward self-healing, we needed a framework, a structured, data-driven approach that would help us shift as many issues as we could from a reactive, human response to a proactive, automated response. It’s a practical framework for AIOps that classifies IT issues into three categories:
1. Respond only. In this category, issues are submitted by people. These issues usually get routed to the IT Service Desk, which assesses the extent of the impact and calculates the priority. Even though this scenario is reactive in nature, I believe that we can be intelligent about the actual impact and priority and assign it to most qualified operational team to accelerate resolution.
The information and data on the Now Platform® enable us to be intelligent about estimating the impact. For example, if Finance notes that an ERP system is down during month end close, it automatically becomes a P1 priority. Another recent example is Customer Support. After we mobilized our Customer Support folks to work from home, any voice issues reported by support engineers become a P1. You can correlate many different data points such as persona, time, location, service, and application, to better understand the impact. This approach is better than asking an employee about the impact, which is usually subjective.
After the issue is resolved, we look at the root cause, again, in a data-driven way. If the issue is a systematic one, we trigger a process or technology improvement to capture the missing signal, bring that data into ServiceNow Event Management, and push it into the next category in the framework—prepare and respond.
2. Prepare and respond. In this category, we use ServiceNow® ITOM to first reduce monitoring noise by almost 99%. Then we generate real, actionable incidents by using event correlation, pattern recognition, and anomaly detection. The ultimate outcome of AIOps in my opinion is our ability to understand the exact impact of an infrastructure-related issue on a critical service, application, or an end user. In comparison to the previous category, IT is better prepared to respond; our teams can quickly react and minimize the impact on end users. We perform the same impact analysis and dynamic prioritization as described above, but the resolution is still manual.
Many of our use cases are in this category. Take for example critical third-party SaaS applications. We can’t prevent apps like video conferencing from going down, but we can be smart about triggering workflows, such as failover processes or even proactively ordering new hardware if it is an edge issue. It helps us quickly mobilize operational teams and focus on the right thing.
3. Predict and prevent (self-healing). In this category, a full-cycle AIOps process comes into play. IT can both predict and prevent issues using machine learning to identify anomalies, then proactively take a fully automated action. There is zero impact on end users and zero touch by the Ops teams. Our operations are much more efficient because we’ve removed the human factor. One of our most complex use cases in this category was also one of the first we could resolve proactively—our VPN service. By identifying abnormalities and correlating them with endpoint device data, we were able to automate the restoration of VPN services. Another use case was the wireless network connectivity. We reduced the amount of Wi-Fi related issues by almost 70% in one year while our company size increased by 30%. Needless to say, by proactively remediating these issues, we bring operational costs down and employees productivity up.