Anomaly Detection for Monitoring
As companies adopt agile development practices and deploy business-critical applications to the cloud, monitoring and analytics are necessary elements for a DevOps workflow. Here’s what you need to know about anomaly detection and how it adds value to cloud monitoring for DevOps teams.
What is Anomaly Detection?
Anomaly detection is the process of identifying observations or patterns of observations in a data set that do not conform to expected behavior. “One of these things is not like the other” – sounds easy, right? Of course when you’re working with tens of thousands of system and application metrics that change from minute to minute, the game becomes exponentially more difficult. At Metricly, we tend to characterize this as, “humanly impossible.”
Understanding the Four Kinds of Anomalies
When talking about anomaly detection, there are four specific types of results: True positives, true negatives, false positives, and false negatives.
3 Types of Anomaly Detection Monitoring Tools
Smart DevOps teams typically evolve through three levels of anomaly detection tools. They start with simple dashboards to track basic metrics then add increasingly sophisticated analytics. A common progression for analytics is to start with static thresholds, then add simple data transformations, and finally introduce machine learning and other models and algorithms designed to increase alarm quality. For example, static thresholds are the most common “starter” analytics. Static thresholds automatically flag simple anomalies in a collection of point observations. Some analytics tools use data transformation functions make it easier to detect outliers. Advanced analytics tools screen out unwanted noise and enhance anomaly detection thereby reducing the frequency of bad alarms—namely, false positives and false negatives.
Humans are experts at pattern-matching and anomaly detection. Most monitoring tools use dashboards to display graphs of ever-changing system and application performance metrics. The innate human ability to quickly detect patterns, combined with a developer or system administrator’s learned domain experience, makes reviewing dashboards a very easy way to quickly gauge the overall health of an application or cloud infrastructure in a simple environment.
However, as the team adds more applications,and metrics to track their status, the complexity quickly exceeds human capacity for easy visual detection anomaly detection. Increased automation is needed.
Transformations are an additional data analysis option for anomaly detection. Formerly hidden anomalies can sometimes be uncovered by applying transformation functions to change the value of an observed metric prior to applying criteria such as static thresholds. One very common transformation for montonically increasing counter metrics is to comparing successive observations and then compare that difference to a threshold. Another useful transformation is to transform a set of histocial observations in to a frequency histogram.
The more commonly observed values will be represented in “tall” bars because that have been seen relatively often. The more rarely observed values are represented in very short bars, thereby identifying the potential values for thesholds. So, histogram transformations can be used to automate the discovery of reasonable threshold settings. While this technique works well in many settings, it fails miserably when the observations follow a seasonal pattern.
Thresholds and Baselines
Adding static upper and lower thresholds for observed values easily automates anomaly detection for data points that fall significantly above or below values and are fixed constants. Whenever an observation crosses a threshold, static threshold analytics tools generate an alarm.
Setting thresholds works very well for metrics that typically hover in a narrow band of predictable values. Unfortunately, when levels vary significantly at different times of day or due to fluctuations in other usage patterns, finding the right threshold levels is tough. Set them too narrow and you’ll be overloaded with too many false alarms (“crying wolf syndrome.”) Set thresholds too wide, and you can completely miss critical service degradations that could damage your business (“asleep at the switch syndrome.”)
Advanced analytics leverage many models and algorithms, both qualitative and quantitative. Some quantitative techniques include statistical analysis and machine learning. Qualitative techniques include incorporating a priori knowledge (human input) and semantic contextual models.
A common application of statistical analysis in DevOps monitoring tools, measures how variables behave in relation to each other. This is relatively straightforward to track in real-time. If two metrics are highly correlated (have a high correlation coefficient) and one goes crazy, the real time correlation coefficient will significantly change and can be used in deciding whether or not to trigger an anomaly alert. However, if both metrics go crazy because they are both similarly affected by the same root cause issue, the correlation coefficients may not change much and the system will fail to properly alarm. An additional layer of analytics is needed.
Statistical Machine Learning
This technique typically has at least two phases: learning and operational. Both phases leverage heavy-duty mathematics and proprietary algorithms. During the learning phase the algorithm establishes norms and other parameters that describe expected behavior. In the operational phase, as new observations are made, the algorithm applies what it has “learned” to distinguish between normal and abnormal values.
Some operational phases include an adaptive learning capability that continues to make adjustments to the parameters that are used to identify normal behavior. Applying corrections parameters allows the machine learning model to adapt to changing circumstances. This is particularly valuable in elastic compute environments that change frequently. Adaptive learning, while much more difficult to implement in on-line contexts, can achieve much greater accuracy, thereby reducing false positives and false negatives.
Anomaly Detection and Advanced Analytics within the Context of a DevOps Workflow
Ideally, anomaly detection is not simply an isolated monitoring step or the only factor in deciding whether or not to issue and alarm or take some action. For the most accurate results, advanced analytics should be applied within a more comprehensive monitoring workflow that:
- Captures infrastructure and application metrics in real time
- Applies multiple types of analytics to the observations
- Discovers deviations in the observed data
- Applies structural knowledge such as relationships between components to refine raw analytic results
- Assesses the results within the contexts of environmental semantics and other human knowledge (we call a ”policy”)
Using analytics together within a workflow such as the one shown below, DevOps staff can achieve highly accurate results – namely minimizing false positives and false negatives.