Metricly uses advanced analytics to monitor your environment and proactively notify you when problems occur. There are 3 basic data types that Metricly uses to do so:
- Raw data: Data collected from a third party integration that has not been interpreted, or aggregated, by Metricly. Because it has not been aggregated, raw data does not contribute to event creation.
- Aggregate data: Data collected from an integration that has been interpreted by Metricly. To generate aggregate data, Metricly averages the raw data collected from an integration at 5 minute intervals. Aggregate data is used to represent a metric’s actual value.
- Sparse data: Data generated by Metricly when no data is collected from an integration for a given period of time. The value of sparse data is always 0.
Metricly determines when a metric’s behavior is abnormal through the use of static thresholds, Baseline bands, and Contextual bands. In each case, the value of a metric is based on aggregate or sparse data. Raw data cannot be used to generate events.
|Black data points indicate aggregate data or data that has been interpreted by Metricly. To generate aggregate data, Metricly averages the data collected from a given integration at 5 minute intervals.|
|Gray data points also indicate aggregate data but for data collected that Metricly averages at 1 hour intervals.|
|Green data points indicate sparse data or data generated by Metricly when data is not collected from a integration within a given period of time. The metric value for sparse data is always 0.|
|Red data points indicate a deviation or data that falls outside the learned Contextual or Baseline bands.|
|Blue data points indicate raw data or data that has not been interpreted by Metricly. Because raw data is not interpreted by Metricly, it does not produce deviations.|
Baseline Bands represent the normal operating range of a metric. Metricly determines the normal operating range of a metric based on weekly patterns in the behavior of that metric.
The image below provides an example of Baseline bands in green surrounding the actual, current value for a CPU utilization metric in black. The green band demonstrates how Metricly learns the expected behavior of the metric based on patterns in the behavior of that metric.
To leverage the behavior learning capabilities of Baseline bands, use Upper or Lower Baseline Deviation condition tests in a policy. For more information about Baseline Deviation tests, see Conditions.
A policy with a Static Threshold condition of greater than 95% is applied to the CPU utilization metric for a server element. The actual value of that CPU utilization metric is often greater than 95% at 2:00 PM every Monday. However, using a Static Threshold test, the policy generates a Critical event every Monday at 2:00 PM when the value of the metric exceeds 95%. After changing the Static Threshold test to an Upper Baseline Deviation test, events are no longer generated when the CPU utilization metric exceeds 95% on Mondays at 2:00 PM. This is because Metricly uses Baseline bands to learn this pattern. So, the spike in CPU utilization that occurs on Monday afternoons is not perceived as abnormal behavior, resulting in fewer false alarms.
Contextual Bands represent the range of current expected values for a metric based on other correlated metrics in the learned model. In contrast to Baseline bands which look for patterns in a metric in isolation, Contextual bands take into account how the value of one metric may impact another.
The image below provides an example of a Contextual band in purple surrounding the actual, current value for a metric called Bytes In Per Second. The purple band demonstrates how Metricly is able to use other correlated metrics to learn the expected value of the Bytes In Per Sec metric.
Contextual Deviation conditions allow you to use Contextual bands to monitor the elements in your environment. For more information about creating policies using Contextual Deviation conditions, see Conditions.
A policy with a Static Threshold condition of greater than 90% is applied to the CPU utilization metric for a server element. Often, when network activity increases for this element, so does CPU Utilization. However, each time network activity spikes and CPU utilization exceeds 90%, a Critical event is generated. By changing the Static Threshold condition to an Upper Contextual Deviation condition, these false alarms can be avoided. If Metricly learns that a CPU utilization metric and a network activity metric belonging to the same server element are positively correlated, then it is able to determine the expected value for CPU utilization at any time, given the value of network activity. So, if the value of CPU utilization suddenly spikes to 96%, but the value of network activity shows a similar increase, Metricly would conclude that the rise in CPU utilization is not cause for an event, since it is not uncommon to see a rise in CPU utilization when there is also a rise in network activity.
Static Thresholds are unchanging levels that are compared against a metric’s current value. Create Static Threshold conditions in policies to use static thresholds to monitor the elements in your environment. If the value of a metric is greater than, less than, greater than or equal to, less than or equal to, equal to, or not equal to the specified level of a Static Threshold condition (depending on the operator selected), then an event is generated.
For more information about using Static Threshold conditions, see Conditions.
If a Static Threshold level of greater than 80% is applied to a CPU utilization metric for a server element, then an event will be generated when the value of that metric is greater than 80%.
Sudden Change Detection
Sudden Change is a time-series metric used to analyze trends of historical data and make predictions based on past behavior. This process finds an expected average of activity by generating data points every 5 minutes in a sliding, one-hour window of time. It then uses this data to make predictions on the next interval. When the future point is actualized and falls outside the expected prediction range, that interval is defined as a sudden change event. You can configure the scope of this metric’s prediction range by adjusting the acceptable percentage change of the future interval; doing so increases or decreases the sensitivity of your policy and affects the frequency of any associated alerts.
Sudden Change is useful for trend analysis on hourly min/max rollup data, like disk space usage or number of page hits. A best practice to consider when setting up a Sudden Change condition is to understand the average behavior of a metric beforehand. If the behavior is naturally prone to fluctuate, use a higher percentage. This widens the predictability scope, avoiding unnecessary policy activity and alerts.
You are monitoring the response time for a transaction. The conditions for the policy are looking for increases in response time of more than 50%, with a duration of 2 hours. An event is generated after the response time has exceeded the average by 50% for that specified duration.