Anomaly Detection

Anomaly Detection for Monitoring


As companies adopt agile development practices and deploy business-critical applications to the cloud, monitoring and analytics are necessary elements for a DevOps workflow. Here’s what you need to know about anomaly detection and how it adds value to cloud monitoring for DevOps teams.

What is Anomaly Detection?


Anomaly detection is the process of identifying observations or patterns of observations in a data set that do not conform to expected behavior. “One of these things is not like the other” – sounds easy, right? Of course when you’re working with tens of thousands of system and application metrics that change from minute to minute, the game becomes exponentially more difficult. At Metricly, we tend to characterize this as, “humanly impossible.”

Understanding the Four Kinds of Anomalies


When talking about anomaly detection, there are four specific types of results: True positives, true negatives, false positives, and false negatives.

You have a problem and get an alarm

True Positive – This is the ideal scenario and exactly how anomaly detection is supposed to work. Unfortunately, it’s not always that simple.

You don’t have a problem and don’t get an alarm

True Negative – Congrats! Your anomaly detection method wasn’t fooled into a false alarm – and you weren’t woken up at 3 a.m. for a problem that doesn’t exist.

 You don’t have a problem but you do get an alarm

False Positive – This is sometimes called “crying wolf”.  The alarms are false alarms.  They waste time and undermine confidence in the monitoring system.  This is bad, but not the worst outcome.

 

You do have a problem and don’t get an alarm

False Negative – This is the worst.  A problem is occurring that could lead to a serious outage and your team is blissfully ignorant because your monitoring system is “asleep at the switch.”  Adding insult to injury, it’s often the case that these missed alarms are caught by impacted users (or your boss!)

3 Types of Anomaly Detection Monitoring Tools


Smart DevOps teams typically evolve through three levels of anomaly detection tools.  They start with simple dashboards to track basic metrics then add increasingly sophisticated analytics.  A common progression for analytics is to start with static thresholds, then add simple data transformations, and finally introduce machine learning and other models and algorithms designed to increase alarm quality.  For example, static thresholds are the most common “starter” analytics.  Static thresholds automatically flag simple anomalies in a collection of point observations. Some analytics tools use data transformation functions make it easier to detect outliers. Advanced analytics tools screen out unwanted noise and enhance anomaly detection thereby reducing the frequency of  bad alarms—namely, false positives and false negatives.

 

Dashboard

Humans are experts at pattern-matching and anomaly detection.  Most monitoring tools use dashboards to display graphs of ever-changing system and application performance metrics. The innate human ability to quickly detect patterns, combined with a developer or system administrator’s learned domain experience, makes reviewing dashboards a very easy way to quickly gauge the overall health of an application or cloud infrastructure in a simple environment.

However, as the team adds more applications,and metrics to track their status, the complexity quickly exceeds human capacity for easy visual detection anomaly detection. Increased automation is needed.

 

Transformations

Transformations are an additional data analysis option for anomaly detection.  Formerly hidden anomalies can sometimes be uncovered by applying transformation functions to change the value of an observed metric prior to applying criteria such as static thresholds.  One very common transformation for montonically increasing counter metrics is to comparing successive observations and then compare that difference to a threshold.   Another useful transformation is to transform a set of histocial observations in to a frequency histogram.

The more commonly observed values will be represented in “tall” bars because that have been seen relatively often. The more rarely observed values are represented in very short bars, thereby identifying the potential values for thesholds.  So, histogram transformations can be used to automate the discovery of reasonable threshold settings.  While this technique works well in many settings, it fails miserably when the observations follow a seasonal pattern.

 

Thresholds and Baselines

Adding static upper and lower thresholds for observed values easily automates anomaly detection for data points that fall significantly above or below values and are fixed constants. Whenever an observation crosses a threshold, static threshold analytics tools generate an alarm.

Setting thresholds works very well for metrics that typically hover in a narrow band of predictable values. Unfortunately, when levels vary significantly at different times of day or due to fluctuations in other usage patterns, finding the right threshold levels is tough. Set them too narrow and you’ll be overloaded with too many false alarms (“crying wolf syndrome.”) Set thresholds too wide, and you can completely miss critical service degradations that could damage your business (“asleep at the switch syndrome.”)

Advanced Analytics


Advanced analytics leverage many models and algorithms, both qualitative and quantitative. Some quantitative techniques include statistical analysis and machine learning. Qualitative techniques include incorporating a priori knowledge (human input) and semantic contextual models.

 

Multivariate Correlation

A common application of statistical analysis in DevOps monitoring tools, measures how variables behave in relation to each other. This is relatively straightforward to track in real-time. If two metrics are highly correlated (have a high correlation coefficient) and one goes crazy, the real time correlation coefficient will significantly change and can be used in deciding whether or not to trigger an anomaly alert. However, if both metrics go crazy because they are both similarly affected by the same root cause issue, the correlation coefficients may not change much and the system will fail to properly alarm. An additional layer of analytics is needed.

 

Statistical Machine Learning

This technique typically has at least two phases:  learning and operational.  Both phases leverage heavy-duty mathematics and proprietary algorithms.  During the learning phase the algorithm establishes norms and other parameters that describe expected behavior.  In the operational phase, as new observations are made, the algorithm applies what it has “learned” to distinguish between normal and abnormal values.

Some operational phases include an adaptive learning capability that continues to make adjustments to the parameters that are used to identify normal behavior.  Applying corrections parameters allows the machine learning model to adapt to changing circumstances.  This is particularly valuable in elastic compute environments that change frequently.  Adaptive learning, while much more difficult to implement in on-line contexts, can achieve much greater accuracy, thereby reducing false positives and false negatives.

Anomaly Detection and Advanced Analytics within the Context of a DevOps Workflow


 

Ideally, anomaly detection is not simply an isolated monitoring step or the only factor in deciding whether or not to issue and alarm or take some action. For the most accurate results, advanced analytics should be applied within a more comprehensive monitoring workflow that:

  1. Captures infrastructure and application metrics in real time
  2. Applies multiple types of analytics to the observations
  3. Discovers deviations in the observed data
  4. Applies structural knowledge such as relationships between components to refine raw analytic results
  5. Assesses the results within the contexts of environmental semantics and other human knowledge (we call a ”policy”)

Using analytics together within a workflow such as the one shown below, DevOps staff can achieve highly accurate results – namely minimizing false positives and false negatives.

Look no further. Start monitoring with Metricly today.

Sign Up for Free Join us for a Demo

Look no further. Start monitoring with Metricly today.

Sign Up for Free