Default policies are created by Metricly and intended to provide recommendations for ways to monitor the behavior of the elements in your environment. Default policies can be found on the Policies page and are marked as Metricly in the Created By column. You can edit default policies as needed to suit the behavior of your environment. When new default policies are provisioned to your account, Metricly will not overwrite any changes you made to existing default policies. Furthermore, any new default policies added to your account will be disabled by default.

Before reading about default policies, you first should understand the concepts of scope, conditions, duration, notifications, and event categories.

AWS

ASG

Policy names are prefixed with AWS ASG – .

Policy name Duration Condition 1 (and) Condition 2 (and) Condition 3 Cat. Description
Elevated CPU

Activity (Normal Network Activity)

30 min aws.ec2.cpuutilization has an upper baseline + upper contextual deviation metricly.aws.ec2.bytesinperse does not have a upper baseline + upper contextual deviation metricly.aws.ec2.bytesoutpersec does not  have a upper baseline + upper contextual deviation. INFO This policy is designed to catch cases where CPU activity is higher than than normal and cannot be explained by a corresponding increase in network traffic. Operates on the average CPU and network utilization across all EC2’sin the ASG.
Elevated Network Activity 30 min metricly.aws.ec2.bytesinpersec has an upper baseline + upper contextual deviation  metricly.aws.ec2.bytesoutperse has an upper baseline + upper contextual deviation INFO Indicates an increase in network activity above what is considered to be normal. Operates on the average network utilization across all EC2’s in the ASG.
Elevated Ephemeral
Disk Activity
30 min metricly.aws.ec2.diskreadopspersec has an upper baseline + upper contextual deviation

metricly.aws.ec2.diskwriteopspersec has an upper baseline + upper contextual deviation

INFO Indicates an increase in disk activity above what is considered to be normal. Operates on the average disk utilization across all EC2’s in the ASG.

 

DynamoDB

Policy names are prefixed with AWS DynamoDB – .

Policy Name Duration Condition 1 Cat. Description
Elevated Read Capacity Utilization

30 Min metricly.aws.dynamodb.readcapacityutilization has an upper baseline deviation an upper contextual deviation + a static threshold ≥ 50. WARNING Read Capacity Utilization has been higher than expected for over 30 minutes; also, the actual value has been above 50% for that time.
Elevated Write Capacity Utilization

30 Min metricly.aws.dynamodb.writecapacityutilization

has an upper baseline deviation + an upper contextual deviation, + a static threshold ≥ 50.

WARNING Write Capacity Utilization has been higher than expected for over 30 minutes; also, the actual value has been above 50% for that time.

 

EBS

Before reading about the EBS default policy, it is important to understand the following Metricly computed metrics. For more information about computed metrics, see Computed metrics.

  • Average Latency: Average Latency is straightforward as it represents the average amount of time that it takes for a disk operation to complete.
  • Queue Length Differential: Queue Length Differential measures the difference between the actual disk queue length and the “ideal” disk queue length.The ideal queue length is based on Amazon’s rule of thumb that for every 200 IOPS you should have a queue length of 1. In theory, a well-optimized volume should have a queue length differential that tends to hover around 0. In practice, we have seen volumes with extremely low latency (< 0.0001) have queue length differentials that are higher than 0; presumably this is because the latency is much lower than Amazon is assuming for their rule of thumb. Even in these cases, the differential is a pretty steady number.
Policy name Duration Condition 1 (and) Condition 2 Cat. Description
Elevated Queue Length Differential with Elevated Latency 30 min metricly.aws.ebs.queuelengthdifferential
has an upper baseline deviation + static threshold > 1
metricly.aws.ebs.averagelatency has an upper baseline CRITICAL The first condition of the policy looks for an upper deviation as the first indication that the disk may be getting more traffic than it can keep up with.  It also checks for the differential to be greater than 1 in order to avoid false alarming in cases where the differential is very low.

The second condition is added because an elevated queue differential by itself is not necessarily a bad thing. We only want to alarm if your differential is higher than normal AND your latency is higher than normal.

 Elevated iops Utilization with low Burst Balance   5 min netuitive.aws.ebs.iopsutilization  => 90 aws.ebs.burstbalance <= 10. This policy looks at two metrics: IOPS Utilizaton and EBS Burst Balance. High IOPS Utilization (a Metricly computed metric) indicates the disk is highly utilized and a low EBS Burst Balance indicates the disk is so highly utilized that most of the burst available to the disk is depleted. Once the burst balance is fully depleted available disk IOPS will fall causing slowdowns in I/O and deteriorated performance of the application using the volume.\n\nCheck this volume and the application using it to see if the I/O profile has changed. Consider using Provisioned IOPS to increase the disk performance if this new profile is normal.

 

EC2

Policy names are prefixed with AWS EC2 –

Policy name Duration Condition 1 (and) Condition 2 (and) Condition 3 Cat. Description
 Elevated CPU

Activity (Normal Network Activity)

30 min aws.ec2.cpuutilization has an upper baseline deviation an upper contextual deviation metricly.aws.ec2.bytesinpersec does not have a upper baseline deviation does not have a upper contextual deviation metricly.aws.ec2.bytesoutpersec does not have a upper baseline deviation does not have a upper contextual deviationn INFO Increases in CPU activity are not uncommon when there is a rise in network activity. Increased traffic to a server means more work for that server to do. This policy is designed to catch cases where CPU activity is higher than than normal, and said behavior cannot be explained by a corresponding increase in network traffic. It may or may not represent a problem, but it is useful to know about.
Elevated Network Activity 30 min metricly.aws.ec2.bytesinpersec  has an upper baseline deviation + an upper contextual deviation  metricly.aws.ec2.bytesoutpersec has an upper baseline deviation an upper contextual deviation INFO Indicates an increase in network activity above what is considered to be normal.
Elevated

Ephemeral Disk Activity

30 min metricly.aws.ec2.diskreadopspersec has an upper baseline deviation + an upper contextual deviation  metricly.aws.ec2.diskwriteopspersec has an upper baseline deviation + an upper contextual deviation INFO Indicates an increase in disk activity above what is considered to be normal.
Elevated

Ephemeral Disk Activity

30 min aws.ec2.cpuutilization has a static threshold >95% WARNING The CPU on the EC2 instance has exceeded 95% for at least 15 minutes.

 

Elasticache

Policy names are prefixed with AWS Elasticache 

Policy name Duration Condition 1 (and) Condition 2 Category Description
Memcached – CPU Threshold Exceeded 5 min aws.elasticache.cpuutilization has a static threshold >90%

 

CRITICAL The Memcached Node has exceeded the CPU threshold of 90%. The cache cluster may need to be scaled, either by using a larger node type or by adding more nodes.
Memcached – Elevated CPU Utilization 30 min aws.elasticache.cpuutilization  has an upper baseline deviation has a static threshold
> 50%
WARNING CPU utilization for the Memcached Node has been higher than expected for at least 30 minutes.
Memcached – Elevated Swap Usage 5 min aws.elasticache.swapusage has a static threshold >53428800

 

CRITICAL Swap usage on the Memcached Node has exceeded 50 MB. It is recommended that you increase the value of the ConnectionOverhead parameter.
Redis – Elevated Command Executions 30 min aws.elasticache..*cmds has an upper baseline deviation + an upper contextual deviation WARNING One or more command types on the Redis node have been experiencing a higher than expected number of executions for at least 30 minutes.
Redis – Elevated CPU Utilization 30 min aws.elasticache.cpuutilization has an upper baseline deviation + a static threshold > 30% WARNING CPU utilization for the Redis Node has been higher than expected for at least 30 minutes.
Redis – Elevated Network Activity 30 minutes aws.elasticache.networkbytesin has an upper baseline deviation an upper contextual deviation aws.elasticache.networkbytesout
has an upper baseline deviation + an upper contextual deviation
WARNING Network activity to/from the Redis node has been higher than expected for at least 30 minutes.
Redis – Elevated Number of New Connections 30 min aws.elasticache.newconnections has an upper baseline deviation an upper contextual deviation WARNING The number of new connections being opened to the Redis node has been

higher than expected for at least 30 minutes.

Redis – Elevated Replication Lag 30 min aws.elasticache.replicationlag has an upper baseline deviation WARNING Replication lag for the Redis node has been higher than expected for at

least 30 minutes.

Redis – Elevated Swap Usage 30 min aws.elasticache.swapusage has an upper baseline deviation WARNING Swap usage on the Redis Node has been higher than expected for at least

30 minutes. Extended swapping indicates a low physical memory condition,

and can lead to performance degradation.

Redis – Extended Period of Evictions 30 min aws.elasticache.swapusage has a static threshold > 0

 

WARNING Evictions for the Redis node have been greater than 0 for at least 30

minutes. This could indicate a low memory condition, and may impact

performance.

Redis – Low Cache Hit Rate 30 min aws.elasticache.cachehitrate has an upper baseline deviation + an upper contextual deviation WARNING The cache hit rate for the Redis node has been lower than expected for

at least 30 minutes.

 

ELB

Policy names are prefixed with AWS ELB –

Policy name Duration Condition 1 (and) Condition 2 Category Description
Elevated Backend

Error Rate (Low Volume)

15 min metricly.aws.elb.httpcodebackenderrorpercent has an upper baseline deviation + an upper contextual deviation metricly.aws.elb.requestcount has a static threshold <1,000 WARNING This is the first of three policies that look at elevated backend error rates. This policy looks specifically at low traffic volume cases. When traffic volumes are low, elevated error rates tend to be less important. For example, a 50% error rate is pretty significant if the total number of requests is 1 million; it is less so if the total number of requests is 10. Thus, this policy will generate a Warning if error rates are higher than normal and traffic volumes are low. By default, “low” is defined as less than 1,000 requests; you may wish to tune this for your own environment.
 Elevated

Backend Error Rate (High Volume, Low Error Rate)

15 min metricly.aws.elb.httpcodebackenderrorpercent has an upper baseline deviation + an upper contextual deviation  a static threshold <2%  

metricly.aws.elb.requestcount has a static threshold ≥ 1,000

WARNING This is the second of three policies that look at elevated backend error rates. For many customers, an error rate which is low enough is not cause for concern even if it is higher than normal. For example, if the normal error rate is between 0.25% and 0.75%, and observed error rate of 1.1% is higher than expected, but may not be worth more than a Warning. Thus, this policy looks for those cases where the error rate is higher than expected, but is under 2%. It also looks for traffic volumes to not be low, since the low traffic scenario is covered by the “Elevated Backend Error Rate (Low Volume)” policy. You may wish to tune either the 1,000 request count threshold, the 2% error threshold, or both, to better suit your environment.
Elevated

Backend Error Rate (High Volume, High Error Rate)

15 min metricly.aws.elb.httpcodebackenderrorpercent has an upper baseline deviation an upper contextual deviation + a static threshold ≥ 2%  

metricly.aws.elb.requestcount  has a static threshold ≥ 1,000

CRITICAL This is the third of three policies that look at elevated backend error rates. In this case, we are looking for both high traffic volumes (>1000) as well as error rates that are not just higher than normal, but are above the 2% threshold. In those cases, a Critical event will be generated. You may wish to tune either the 1,000 request count threshold, the 2% error threshold, or both, to better suit your environment.
Elevated Latency 30 min aws.elb.latency has an upper baseline deviation + an upper contextual deviation  

metricly.aws.elb.requestcount has a static threshold ≥ 1,000

CRITICAL This policy will generate a Critical event when average latency is higher than normal for half an hour or longer. Note that there must also be a minimum number of requests for this policy to trigger; this is because with too few requests, the average can tend to be skewed by outliers. The default request threshold is 1,000; you may wish to tune this for your environment.
Surge Queue Utilization

Greater Than 5%

15 min metricly.aws.elb.surgequeueutilization has a static threshold > 5% WARNING The ELB surge queue holds requests until they can be forwarded to the backend servers. The surge queue can hold a maximum of 1,024 requests, after which it will be full and will start rejecting requests. Metricly’s Surge Queue Utilization metric reflects as a percentage how full the surge queue currently is. If the surge queue is more than 5% full for 15 minutes or longer, a Warning event is generated.
Surge Queue Utilization

Greater Than 50%

15 min metricly.aws.elb.surgequeueutilization has a static threshold > 50% CRITICAL The ELB surge queue holds requests until they can be forwarded to the backend servers. The surge queue can hold a maximum of 1,024 requests, after which it will be full and will start rejecting requests. Metricly’s Surge Queue Utilization metric reflects as a percentage how full the surge queue currently is. If the surge queue is more than 50% full for 15 minutes or longer, a Critical event is generated.
 Unhealthy Host Percent Above 50% 15 min metricly.aws.elb.unhealthyhostpercent has a static threshold ≥ 50% + a static threshold < 75% WARNING More than half (50%) of the hosts associated with this ELB are in an unhealthy state.
 Unhealthy Host Percent Above 75% 5 min metricly.aws.elb.unhealthyhostpercent has a static threshold ≥ 75% CRITICAL More than three quarters (75%) of the hosts associated with this ELB are in an unhealthy state.
Elevated ELB Error Rate 15 min metricly.aws.elb.httpcodeelberrorpercent has an upper baseline deviation + an upper contextual deviation + a static threshold ≥ 2% aws.elb.requestcount has a static threshold ≥ 1000 CRITICAL This is another error rate policy, but rather than looking at backend error rates, it is looking at errors from the ELB itself. In this case, we look for both high traffic volumes (> 1000) as well as error rates that are not just higher than normal, but are above a 2% threshold. In those cases, a Critical event will be generated. You may wish to tune either the 1,000 request count threshold, the 2% error threshold, or both, to better suit your environment.

 

Lambda

Policy names are prefixed with AWS Lambda –

Policy name Duration Conditions Category Description
 Elevated Invocation Count 30 min aws.lambda.invocations has an upper baseline deviation + an upper contextual deviation WARNING The number of calls to the function (invocations) have been greater than

expected for at least the last 30 minutes.

Depressed Invocation Count 10 min aws.lambda.invocations has a lower baseline deviation + a lower contextual deviation WARNING The number of calls to the function (invocations) have been lower than

expected for at least the last 10 minutes.

Elevated Latency 30 min aws.lambda.duration has an upper baseline deviation + an upper contextual deviation WARNING The average duration per function call (latency) has been higher than

expected for at least the past 30 minutes.

 

RDS

 

Policy name Duration Condition 1 (and) Condition 2 (and) Condition 3 Cat. Description
Elevated RDS CPU Activity (Normal Network Activity) 30 min metricly.aws.rds.cpuutilization 

has an upper baseline deviation an upper contextual deviation + a static threshold  > 20

metricly.aws.rds.networkreceivethroughput 

does not have an upper baseline deviation +  does not have a upper contextual deviation

 metricly.aws.rds.networktransmitthroughput 

does not have a upper baseline deviation does not have a upper contextual deviation

INFO Increases in CPU activity are not uncommon when there is a rise in network activity. Increased traffic to a server means more work for that server to do. This policy is designed to catch cases where CPU activity is higher than than normal, and said behavior cannot be explained by a corresponding increase in network traffic. It may or may not represent a problem, but it is useful to know about.
Elevated RDS Network Activity 30 min metricly.aws.rds.networkreceivethroughput

has an upper baseline deviation + an upper contextual deviation

 

metricly.aws.rds.networktransmitthroughput

 has an upper baseline deviation + an upper contextual deviation

INFO Indicates an increase in network activity above what is considered to benormal.
Elevated RDS Disk Activity 30 min metricly.aws.rds.readiops has an upper baseline deviation + an upper contextual deviation

 

 metricly.aws.rds.writeiops has an upper baseline deviation + an upper contextual deviation INFO Indicates an increase in disk activity above what is considered to be normal.
Elevated RDS Latency 30 min metricly.aws.rds.readlatency has an upper baseline deviation + an upper contextual deviation

 

metricly.aws.rds.writelatency has an upper baseline deviation + an upper contextual deviation

 metricly.aws.rds.totalthroughput has a static threshold ≥ 1,000 CRITICAL This policy will generate a Critical event when both read and write latency is higher than normal for half an hour or longer. Note that there must also be a minimum number of requests for this policy to trigger; this is because with too few requests, the average can tend to be skewed by outliers. The default request threshold is 1,000; you may wish to tune this for your environment.
AWS RDS – Elevated Number of Connections 15 min metricly.aws.rds.databaseconnections has an upper baseline deviation + an upper contextual deviation WARNING The number of database connections open on the RDS instance is higher than expected.
AWS RDS – Elevated Read IOPS 15 min metricly.aws.rds.readiops has an upper baseline deviation an upper contextual deviation WARNING Read activity on the RDS instance is greater than expected.
AWS RDS – Elevated Write IOPS 15 min metricly.aws.rds.writeops has an upper baseline deviation + an upper contextual deviation WARNING Write activity on the RDS instance is greater than expected.

 

SQS

 

Policy name Duration Conditions Category Description
AWS SQS – Queue Falling Behind 2 hours metricly.aws.sqs.arrivalrate has a metric threshold > metricly.aws.sqs.completionrate CRITICAL The arrival rate for the queue has been greater than the completion rate for at least 2 hours. This is an indication that processing of the queue is falling behind.

 

Others

Apache HTTPD

 

Policy name Duration Conditions Category Description
Linux Apache HTTPD – Depressed Traffic Volume 30 min httpd.<host>.ReqPerSec has a lower baseline deviation WARNING The number of requests per second has been lower than expected for at
least the past 30 minutes.
Linux Apache HTTPD – Elevated Traffic Volume 30 min httpd.<host>.ReqPerSec has an upper baseline deviation WARNING The number of requests per second has been higher than expected for at
least the past 30 minutes.

 

Cassandra

All Policy names are prefixed with Cassandra –

Policy name Duration Conditions Category Description
Depressed Key Cache Hit Rate 30 min cassandra.Cache.KeyCache.HitRate has an lower baseline deviation + a static threshold ≤ 0.85 WARNING The hit rate for the key cache is lower than expected and is less than 85%. This condition has been persisting for at least the past 30 minutes.
Elevated Node Read Latency 30 min cassandra.Keyspace.ReadLatency.OneMinuteRate has an upper baseline deviation WARNING The overall keyspace read latency on this Cassandra node has been higher than expected for at least 30 minutes.
Elevated Node Write Latency 30 min cassandra.Keyspace.WriteLatency.OneMinuteRate has an upper baseline deviation WARNING The overall keyspace write latency on this Cassandra node has been higher than expected for at least 30 minutes.
Elevated Number of Pending Compaction Tasks 15 min cassandra.Compaction.PendingTasks has an upper baseline deviation WARNING The number of pending compaction tasks has been higher than expected for at least the past 15 minutes. This could indicate that the node is falling behind on compaction tasks.
Elevated Number of Pending Thread Pool Tasks 15 min cassandra.ThreadPools.*.PendingTasks has an upper baseline deviation WARNING For at least the past 15 minutes, the number of pending tasks for one or more thread pools has been higher than expected. This could indicate that the pools are falling behind on their tasks.
Unavailable Exceptions Greater Than Zero 5 min cassandra.*Unavailables.OneMinuteRate has a static threshold ≤1 CRITICAL The required number of nodes were unavailable for one or more requests.

 

Collectd
Policy name Duration Conditions Category Description
Elevated Memory Usage (Collectd) 30 min metricly.collectd.memory.utilizationpercent has an upper baseline deviation INFO Indicates an increase in memory usage above what is considered to be normal.
Elevated Process Count 30 min metricly.collectd.processes.total has an upper baseline deviation INFO Indicates that the total number of processes has increased above what is considered to be normal.
Elevated Percentage of Blocked
Processes
30 min metricly.collectd.processes.blockedpercent has an upper baseline deviation WARNING Indicates a higher-than-normal percentage of blocked processes.
Elevated Percentage of Zombie
Processes
30 min metricly.collectd.processes.zombiepercent has an upper baseline deviation WARNING Indicates a higher-than-normal percentage of zombie processes.

 

Diamond/Linux

Before reading about these default policies, note that both the Elevated User CPU and Elevated System CPU policies assume that the CPU Collector is configured to collect aggregate CPU metrics, rather than per core metrics. It also assumes that the metrics are being normalized.This is done by setting the percore setting set to FALSE (it is TRUE by default) and the normalize setting set to TRUE (it is FALSE by default) in your configuration file. After adjusting these settings, save the configuration file and restart the agent to apply the changes. See the Linux or Diamond agent documentation for more information.

Policy name Duration Condition 1 (and) Condition 2 Category Description
Linux – CPU Threshold Exceeded 15 min cpu.total.utilization.percent has a static threshold >95% CRITICAL The CPU on the SERVER instance has exceeded 95% for at least 15 minutes.
Linux – Elevated System CPU 30 min metricly.linux.cpu.total.system.normalized has an upper baseline deviation + a static threshold ≥ 30% INFO This policy will generate an Informational event when CPU usage by system processes is higher than normal, but only if the actual value is also above 30%. Customers typically don’t want to be informed of deviations in CPU behavior when the actual values are too low; you may want to tune the 30% threshold for your environment.
Linux – Elevated User CPU 30 min metricly.linux.cpu.total.user.normalized has an upper baseline deviation a static threshold ≥ 50% INFO This policy will generate an Informational event when CPU usage by user processes is higher than normal, but only if the actual value is also above 50%. Customers typically don’t want to be informed of deviations in CPU behavior when the actual values are too low; you may want to tune the 50% threshold for your environment.
Linux – Heavy CPU Load 15 min metricly.linux.cpu.total.user.normalized has an upper baseline deviation + an upper contextual deviation metricly.linux.loadavg.05.normalized has a static threshold > 2 CRITICAL This is a CRITICAL event indicating that the server’s CPU is under heavy load, based upon upper deviations on CPU utilization percent and the normalized loadavg.05 metric being greater than 2. Rule of thumb is that the run queue size (represented by the loadavg) should not be greater than 2x the number of CPUs.
Linux – Disk Utilization Threshold Exceeded 15 min metricly.linux.diskspace.*.byte_percentused has a static threshold >95% CRITICAL The consumed disk space on the SERVER instance has exceeded 95% for at least 15 minutes.
Linux – Heavy Disk Load 15 min iostat.*.average_queue_length has an upper baseline deviation + an upper contextual deviation WARNING This is a WARNING which indicates that the disk is experiencing heavy load, but performance has not yet been impacted.
Linux – Heavy Disk Load with Slow Performance 15 min iostat.*.await has an upper baseline deviation  an upper contextual deviation  iostat.*.average_queue_length has an upper baseline deviation an upper contextual deviation CRITICAL This is a CRITICAL event which indicates that the disk is not only experiencing heavy load, but performance is suffering.
Linux – Agent Appears to be Down 15 min metricly.metrics.heartbeat has a static threshold <1 WARNING A heartbeat has not been received for a Metricly Agent for at least the past 15 minutes; the Agent may be down.
Linux – Memory Utilization Threshold Exceeded 15 min metricly.linux.memory.utilization.percent has a static threshold > 95% CRITICAL This is a CRITICAL event which is raised when memory utilization exceeds 95%.
Elevated Memory Usage 30 min metricly.linux.memory.utilizationpercent has an upper baseline deviation + a static threshold > 50% INFO This policy will generate an Informational event when memory usage is higher than normal, but only if the actual value is also above 50%. Customers typically don’t want to be informed of deviations in memory usage when the actual values are too low; you may want to tune the 50% threshold for your environment.

 

Docker
Policy name Duration Conditions Category Description
Docker Container – CPU Throttling 15 min metricly.docker.cpu.container_throttling_percent has a static threshold >0 WARNING The Docker container has had its CPU usage throttled for at least the past 15 minutes.
Docker Container – Elevated CPU Utilization 30 min metricly.docker.cpu.container_cpu_percent has an upper baseline deviation + an upper contextual deviation INFO CPU usage on the Docker container has been higher than expected for 30 minutes or longer.
Docker Container – Elevated Memory Utililzation 30 min metricly.docker.cpu.container_memory_percent has an upper baseline deviation an upper contextual deviation INFO Memory usage on the Docker container has been highter than expected for 30 minutes or longer.
Docker Container – Extensive CPU Throttling 1 hour 5 min metricly.docker.cpu.container_throttling_percent has a static threshold >0 CRITICAL The Docker container has had its CPU usage throttled for over an hour.

 

Elastic Search
Policy name Duration Conditions Category Description
Cluster Health Degraded to Red  15 min elasticsearch.cluster_health.status has a static threshold < 1 CRITICAL The cluster health status is red which means that one or more primary shard(s) and its replica(s) is missing.
Cluster Health Degraded to Yellow  15 min elasticsearch.cluster_health.status is between 1 and 1.8 WARNING The cluster health status is yellow which means that one or more shard replica(s) is missing.
Elevated JVM Heap Usage 15 min elasticsearch.jvm.mem.heap_used_percent has an upper baseline deviation WARNING This policy will generate a warning event when the Elastic Search JVM’s heap usage is above 80%.
Disk space is more than 75% used on data node netuitive.linux.diskspace.avg_byte_percentused has a static threshold >75 WARNING The average utilization across your Elastic Search data node storage devices are more than 75%.
Elevated Fetch Time 30 min netuitive.linux.elasticsearch.indices._all.search.fetch_avg_time_in_millis has an upper baseline deviation WARNING This policy generates a warning event if the elasticsearch.indices._all.search.fetch_time_in_millis metric deviates above the baseline for 15 minutes or more.
Elevated Flush Time 30 min netuitive.linux.elasticsearch.indices._all.flush.avg_time_in_millis has an upper baseline deviation WARNING This policy generates a warning event if the elasticsearch.indices._all.flush.total_time_in_millis metric deviates above the baseline for 15 minutes or more.
Elevated Indexing Time 30 min netuitive.linux.elasticsearch.indices._all.indexing.index_avg_time_in_millis has an upper baseline deviation WARNING This policy generates a warning event if the elasticsearch.indices._all.indexing.index_time_in_millis metric deviates above the baseline for 15 minutes or more.
Reject Count Greater Than Zero 5 min elasticsearch.thread_pool.*.rejected has a static threshold >0 WARNING This policy generates a warning if any of the Elastic Search thread
pools has a “rejected” count greater than 0.

 

Java

 

Policy name Duration Conditions Category Description
Elevated JVM CPU Activity 15 min cpu.used.percent has an upper baseline deviation + an upper contextual deviation + a static threshold > 50% WARNING This policy will generate a WARNING event when the JVM’s CPU activity is higher than expected. Additionally, the CPU usage is above 50%.
Elevated JVM Heap Usage 15 min metricly.jvm.heap.utilizationpercent has an upper baseline deviation + an upper contextual deviation WARNING This policy will generate a WARNING event when the JVM’s heap usage is higher than expected.
Elevated JVM System Threads 15 min system.threads has an upper baseline deviation an upper contextual deviation WARNING This policy will generate a WARNING event when the number of system threads used by the JVM is higher than expected.

 

Kafka

Policy names are prefixed with Kafka –

Policy name Duration Condition 1 (and) Condition 2 Category Description
Depressed Number of Zookeeper Connections 30 min kafka.zookeeper.zk_num_alive_connections has a lower baseline deviation WARNING The number of active connections to Zookeeper has been lower than expected for at least the past 30 minutes.
Elevated Consumer Lag 15 min kafka.zookeeper.consumer_groups.*.comsuler_lag has an upper baseline deviation WARNING Consumer lag has been higher than expected for at least 15 minutes.
Elevated Consumer Purgatory Size 15 min kafka.server.DelayedOperationPurgatory.Fetch.PurgatorySize has

an upper baseline deviation

WARNING The purgatory size for consumer fetch requests is higher than expected. This may be causing increases in consumer request latency.
Elevated Consumer Servicing Time 15 min kafka.network.RequestMetrics.FetchConsumer.TotalTimeMs.Meanhas

an upper baseline deviation

WARNING The broker is taking longer than usual to service consumer requests.
Elevated Number of Outstanding Zookeeper Requests 15 min kafka.zookeeper.zk_outstanding_requests has an upper baseline deviation WARNING The number of outstanding Zookeeper requests has been higher than expected for at least the past 15 minutes. This could be resulting in performance issues.
Elevated Producer Purgatory Size 15  min kafka.server.DelayedOperationPurgatory.Produce.PurgatorySizehas

an upper baseline deviation

WARNING The purgatory size for producer requests is higher than expected. This may be causing increases in producer request latency.
Elevated Producer Servicing Time 15 min kafka.network.RequestMetrics.Produce.TotalTimeMs.Mean has an upper baseline deviation WARNING The broker is taking longer than usual to service producer requests.
Elevated Topic Activity 30 min iBrokerTopicMetrics._all.BytesInPerSec.Count has an upper baseline deviation BrokerTopicMetrics._all.BytesOutPerSec.Count has an upper baseline deviation WARNING Topic activity has been higher than expected for at least the past 30 minutes.
Elevated Zookeeper Latency 15 min kafka.zookeeper.zk_avg_latency has an upper baseline deviation WARNING The average latency for Zookeeper requests has been higher than expected for at least the past 15 minutes.
Extended Period of Consumer Lag 1 hour and 15 min kafka.zookeeper.consumer_groups.*.consumer_lag has an upper baseline deviation CRITICAL Consumer lag has been higher than expected for over an hour.
No Active Controllers 5 min kafka.controller.ActiveControllerCount has a static threshold < 1 CRITICAL There are no active controllers in the Kafka cluster.
Unclean Leader Election Rate Greater Than 0 5 min kafka.controller.UncleanLeaderElectionsPerSec.Count has a static threshold > 0 CRITICAL An out-of-sync replica was chosen as leader because none of the available replicas were in sync. Some data loss has occurred as a result.
Under Replicated Partition Count Greater Than 0 30 min kafka.server.ReplicaManager.UnderReplicatedPartitions has a static threshold > 0 CRITICAL The number of partitions which are under-replicated has been greater than 0 for at least 30 minutes.

 

Microsoft Azure

 

Microsoft Azure

All Policy names are prefixed with Azure VM – 

Some of the policies below require that you enable basic metric collection on your virtual machine. To learn about how to enable basic metrics, check out the Azure integration help page.
Policy name Metrics Required Duration Condition 1 (and) Condition 2 (and) Condition 3 Cat. Description
CPU Threshold Exceeded Boot Diagnostics 15 min Processor.PercentProcessorTime has a static threshold > 50% WARNING The CPU on the Azure Virtual Machine has exceeded 95% for at least 15 minutes.
Elevated CPU Activity (Normal Network Activity) Boot Diagnostics 30 min Processor.PercentProcessorTime has an upper baseline deviation an upper contextual deviation a static threshold > 20% NetworkInterface.BytesReceived has  no deviation  NetworkInterface.BytesTransmitted has no deviation INFO Increases in CPU activity are not uncommon when there is a rise in network activity. Increased traffic to a server means more work for that server to do. This policy is designed to catch cases where CPU activity is higher than than normal and said behavior cannot be explained by a corresponding increase in network traffic. It may or may not represent a problem, but it is useful to know about. This policy will not fire if CPU utilization is less than 20% though.
Elevated Disk Activity Boot Diagnostics 30 min PhysicalDisk.ReadsPerSecond has an upper baseline deviation an upper contextual deviation  PhysicalDisk.WritesPerSecond has an upper baseline deviation + an upper contextual deviation INFO Disk activity has been higher than expected for at least 30 minutes.
Elevated Memory Utilization Basic Metrics 15 min Memory.PercentUsedMemory has an upper baseline deviation + an upper contextual deviation WARNING The memory utilization on the Azure Virtual Machine is higher than expected.
Elevated Network Activity Boot Diagnostics 30 min NetworkInterface.BytesReceived has an upper baseline deviation  an upper contextual deviation  NetworkInterface.BytesTransmitted has an upper baseline deviation + an upper contextual deviation INFO Network activity has been higher than expected for at least 30 minutes.
Heavy Disk Load Basic Metrics 5 min PhysicalDisk.AverageDiskQueueLength has an upper baseline deviation + an upper contextual deviation WARNING Average disk queue length is greater than expected, which could indicate a problem with heavy disk load.
MongoDB

Policy names are prefixed with MongoDB –

Policy name Duration Condition 1 (and) Condition 2 Category Description
Connections in Use Threshold Exceeded 5 min metricly.linux.mongo.connections.utilization percent has a static threshold > 90% CRITICAL More than 90% of the total connections to MongoDB are in use. You may need to scale your servers to handle the load.
Elevated Number of Queued Read Requests 30 min mongo.globalLock.currentQueue.readers has an upper baseline deviation WARNING The number of read requests waiting in the queue has been higher than expected for at least the past 30 minutes.
Elevated Number of Queued Write Requests 30 min mongo.globalLock.currentQueue.writers has an upper baseline deviation WARNING The number of write requests waiting in the queue has been higher than expected for at least the past 30 minutes.
Elevated Percentage of Connections in Use 30 min metricly.linux.mongo.connections.utilizationpercent has an upper baseline deviation WARNING The percentage of client connections in use has been higher than expected for at least the past 30 minutes.
Suspicious Read Activity 15 min metricly.linux.mongo.opcounters.totalreads has an upper baseline deviation mongo.globalLock.activeClients.readers has no deviation WARNING The total number of reads (query and getmore requests) has been higher than expected for at least the past 15 minutes. During this time, the number of active readers has remained within the expected range. Since the increase in read activity cannot be explained by a corresponding increase in the number of readers, the increase is deemed to be suspicious.
Suspicious Write Activity 15 min metricly.linux.mongo.opcounters.totalwrites has an upper baseline deviation mongo.globalLock.activeClients.writers has no deviation WARNING The total number of writes (insert, update, and delete requests) has been higher than expected for at least the past 15 minutes. During this time, the number of active writers has remained within the expected range. Since the increase in write activity cannot be explained by a corresponding increase in the number of writers, the increase is deemed to be suspicious.

 

RabbitMQ

Policy names are prefixed with RabbitMQ –

Policy name Duration Conditions Category Description
Depressed Message Count 30 min rabbitmq.queue_totals.messages has a lower baseline deviation WARNING The number of messages across all queues has been lower than expected for at least the past 30 minutes.
Elevated Memory Usage 30 min rabbitmq.health.mem_used has an upper baseline deviation WARNING Memory usage has been higher than expected for at least the past 30 minutes.
Elevated Message Count 30 min rabbitmq.queue_totals.messages has an upper baseline deviation WARNING The number of messages across all queues has been higher than expectedfor at least the past 30 minutes.
Exceeded Disk Free Limit 5 min rabbitmq.health.disk_free has a metric threshold < rabbitmq.health.disk_free_limit CRITICAL Free disk space has dropped below the configured disk free space limit.
Exceeded Memory Limit 5 min rabbitmq.health.mem_used has a metric threshold < rabbitmq.health.mem_limit CRITICAL Memory utilization has exceeded the configured memory limit.
Memory Usage Approaching Limit 5 min metricly.aws.elb.unhealthyhostpercenthas a static threshold = 90% WARNING Memory utilization has reached 90% of the configured limit.
Unexpectedly Low Free Disk Space 30 min rabbitmq.health.disk_free has an lower baseline deviation WARNING Free disk space on the RabbitMQ node has been lower than expected for at least the past 30 minutes.

 

Windows

Policy names are prefixed with Windows –

Policy name Duration Condition 1 (and) Condition 2 (and) Condition 3 Cat. Description
Elevated Disk Latency 15 min physical_disk._Total.avg_sec_per_read has an upper baseline deviation physical_disk._Total.avg_sec_per_write has an upper baseline deviation WARNING This policy will generate a WARNING event when both disk read and write times are higher than their expected baselines
Elevated Memory

Utilization

10 min metricly.winsrv.memory.utilizationpercent has an upper baseline deviation an upper contextual deviation WARNING This policy will generate a WARNING event when memory utilization on the Windows server is higher than expected.
Heavy CPU Load 15 min metricly.winsrv.system.processor_queue_length_normalized has a static threshold > 2 processor._Total.percent_processor_time has an upper baseline deviation + an upper contextual deviation  system.context_switches_per_sec has an upper baseline deviation + an upper contextual deviation CRITICAL High CPU values by themselves are not always a good indicator of server being under heavy load. This policy looks for upper deviations not only in CPU, but in run queue size (system.processor_queue_length) and context switches as well. Taken together, upper deviations in all three of these key metrics are a good indication of an overloaded server.
Heavy Disk Load 15 min physical_disk._Total.avg_queue_length has an upper baseline deviation + an upper contextual deviation WARNING This policy will generate a WARNING event if the average disk queue length for the server is higher than expected, indicating a potential problem with heavy disk load.

 

capterra

Join other DevOps who love Metricly!

Sign up for a free, fully featured, 21-day trial. No credit card required!