Default policies are created by Metricly and intended to provide recommendations for ways to monitor the behavior of the elements in your environment. Default policies can be found on the Policies page and are marked as Metricly in the Created By column. You can edit default policies as needed to suit the behavior of your environment. When new default policies are provisioned to your account, Metricly will not overwrite any changes you made to existing default policies. Furthermore, any new default policies added to your account will be disabled by default.
Before reading about default policies, you first should understand the concepts of scope, conditions, duration, notifications, and event categories.
AWS
ASG
Policy names are prefixed with AWS ASG – .
DynamoDB
Policy names are prefixed with AWS DynamoDB – .
Policy Name | Duration | Condition 1 | Cat. | Description |
---|---|---|---|---|
Elevated Read Capacity Utilization | 30 Min | metricly.aws.dynamodb.readcapacityutilization has an upper baseline deviation + an upper contextual deviation + a static threshold ≥ 50. | WARNING | Read Capacity Utilization has been higher than expected for over 30 minutes; also, the actual value has been above 50% for that time. |
Elevated Write Capacity Utilization | 30 Min | metricly.aws.dynamodb.writecapacityutilization
has an upper baseline deviation + an upper contextual deviation, + a static threshold ≥ 50. |
WARNING | Write Capacity Utilization has been higher than expected for over 30 minutes; also, the actual value has been above 50% for that time. |
EBS
Before reading about the EBS default policy, it is important to understand the following Metricly computed metrics. For more information about computed metrics, see Computed metrics.
- Average Latency: Average Latency is straightforward as it represents the average amount of time that it takes for a disk operation to complete.
- Queue Length Differential: Queue Length Differential measures the difference between the actual disk queue length and the “ideal” disk queue length.The ideal queue length is based on Amazon’s rule of thumb that for every 200 IOPS you should have a queue length of 1. In theory, a well-optimized volume should have a queue length differential that tends to hover around 0. In practice, we have seen volumes with extremely low latency (< 0.0001) have queue length differentials that are higher than 0; presumably this is because the latency is much lower than Amazon is assuming for their rule of thumb. Even in these cases, the differential is a pretty steady number.
EC2
Policy names are prefixed with AWS EC2 –
EFS
Policy names are prefixed with AWS EC2 –
Policy name | Duration | Condition 1 | (and) Condition 2 | (and) Condition 3 | Cat. | Description |
---|---|---|---|---|---|---|
AWS EFS – Depleted Burst Credit Balance | 15 minutes | aws.efs.burstcreditbalance = 0 | Critical | There are no burst credits left. The number of burst credits that a file system has is zero. | ||
AWS EFS – IO Percentage Critical | 15 minutes | aws.efs.percentiolimit = 95% | Critical | File system has almost reached its limit of the General Purpose performance mode. If this metric is at 100% more often than not, consider moving your application to a file system using the Max I/O performance mode. |
Elasticache
Policy names are prefixed with AWS Elasticache
ELB
Policy names are prefixed with AWS ELB –
Lambda
Policy names are prefixed with AWS Lambda –
Policy name | Duration | Conditions | Category | Description |
---|---|---|---|---|
Elevated Invocation Count | 30 min | aws.lambda.invocations has an upper baseline deviation + an upper contextual deviation | WARNING | The number of calls to the function (invocations) have been greater than
expected for at least the last 30 minutes. |
Depressed Invocation Count | 10 min | aws.lambda.invocations has a lower baseline deviation + a lower contextual deviation | WARNING | The number of calls to the function (invocations) have been lower than
expected for at least the last 10 minutes. |
Elevated Latency | 30 min | aws.lambda.duration has an upper baseline deviation + an upper contextual deviation | WARNING | The average duration per function call (latency) has been higher than
expected for at least the past 30 minutes. |
RDS
SQS
Policy name | Duration | Conditions | Category | Description |
---|---|---|---|---|
AWS SQS – Queue Falling Behind | 2 hours | metricly.aws.sqs.arrivalrate has a metric threshold > metricly.aws.sqs.completionrate | CRITICAL | The arrival rate for the queue has been greater than the completion rate for at least 2 hours. This is an indication that processing of the queue is falling behind. |
Others
Apache HTTPD
Policy name | Duration | Conditions | Category | Description |
---|---|---|---|---|
Linux Apache HTTPD – Depressed Traffic Volume | 30 min | httpd.<host>.ReqPerSec has a lower baseline deviation | WARNING | The number of requests per second has been lower than expected for at least the past 30 minutes. |
Linux Apache HTTPD – Elevated Traffic Volume | 30 min | httpd.<host>.ReqPerSec has an upper baseline deviation | WARNING | The number of requests per second has been higher than expected for at least the past 30 minutes. |
Cassandra
All Policy names are prefixed with Cassandra –
Policy name | Duration | Conditions | Category | Description |
---|---|---|---|---|
Depressed Key Cache Hit Rate | 30 min | cassandra.Cache.KeyCache.HitRate has an lower baseline deviation + a static threshold ≤ 0.85 | WARNING | The hit rate for the key cache is lower than expected and is less than 85%. This condition has been persisting for at least the past 30 minutes. |
Elevated Node Read Latency | 30 min | cassandra.Keyspace.ReadLatency.OneMinuteRate has an upper baseline deviation | WARNING | The overall keyspace read latency on this Cassandra node has been higher than expected for at least 30 minutes. |
Elevated Node Write Latency | 30 min | cassandra.Keyspace.WriteLatency.OneMinuteRate has an upper baseline deviation | WARNING | The overall keyspace write latency on this Cassandra node has been higher than expected for at least 30 minutes. |
Elevated Number of Pending Compaction Tasks | 15 min | cassandra.Compaction.PendingTasks has an upper baseline deviation | WARNING | The number of pending compaction tasks has been higher than expected for at least the past 15 minutes. This could indicate that the node is falling behind on compaction tasks. |
Elevated Number of Pending Thread Pool Tasks | 15 min | cassandra.ThreadPools.*.PendingTasks has an upper baseline deviation | WARNING | For at least the past 15 minutes, the number of pending tasks for one or more thread pools has been higher than expected. This could indicate that the pools are falling behind on their tasks. |
Unavailable Exceptions Greater Than Zero | 5 min | cassandra.*Unavailables.OneMinuteRate has a static threshold ≤1 | CRITICAL | The required number of nodes were unavailable for one or more requests. |
Collectd
Diamond/Linux
Before reading about these default policies, note that both the Elevated User CPU and Elevated System CPU policies assume that the CPU Collector is configured to collect aggregate CPU metrics, rather than per core metrics. It also assumes that the metrics are being normalized.This is done by setting the percore setting set to FALSE (it is TRUE by default) and the normalize setting set to TRUE (it is FALSE by default) in your configuration file. After adjusting these settings, save the configuration file and restart the agent to apply the changes. See the Linux or Diamond agent documentation for more information.
Docker
Policy name | Duration | Conditions | Category | Description |
---|---|---|---|---|
Docker Container – CPU Throttling | 15 min | metricly.docker.cpu.container_throttling_percent has a static threshold >0 | WARNING | The Docker container has had its CPU usage throttled for at least the past 15 minutes. |
Docker Container – Elevated CPU Utilization | 30 min | metricly.docker.cpu.container_cpu_percent has an upper baseline deviation + an upper contextual deviation | INFO | CPU usage on the Docker container has been higher than expected for 30 minutes or longer. |
Docker Container – Elevated Memory Utililzation | 30 min | metricly.docker.cpu.container_memory_percent has an upper baseline deviation + an upper contextual deviation | INFO | Memory usage on the Docker container has been highter than expected for 30 minutes or longer. |
Docker Container – Extensive CPU Throttling | 1 hour 5 min | metricly.docker.cpu.container_throttling_percent has a static threshold >0 | CRITICAL | The Docker container has had its CPU usage throttled for over an hour. |
Elastic Search
Policy name | Duration | Conditions | Category | Description |
---|---|---|---|---|
Cluster Health Degraded to Red | 15 min | elasticsearch.cluster_health.status has a static threshold < 1 | CRITICAL | The cluster health status is red which means that one or more primary shard(s) and its replica(s) is missing. |
Cluster Health Degraded to Yellow | 15 min | elasticsearch.cluster_health.status is between 1 and 1.8 | WARNING | The cluster health status is yellow which means that one or more shard replica(s) is missing. |
Elevated JVM Heap Usage | 15 min | elasticsearch.jvm.mem.heap_used_percent has an upper baseline deviation | WARNING | This policy will generate a warning event when the Elastic Search JVM’s heap usage is above 80%. |
Disk space is more than 75% used on data node | netuitive.linux.diskspace.avg_byte_percentused has a static threshold >75 | WARNING | The average utilization across your Elastic Search data node storage devices are more than 75%. | |
Elevated Fetch Time | 30 min | netuitive.linux.elasticsearch.indices._all.search.fetch_avg_time_in_millis has an upper baseline deviation | WARNING | This policy generates a warning event if the elasticsearch.indices._all.search.fetch_time_in_millis metric deviates above the baseline for 15 minutes or more. |
Elevated Flush Time | 30 min | netuitive.linux.elasticsearch.indices._all.flush.avg_time_in_millis has an upper baseline deviation | WARNING | This policy generates a warning event if the elasticsearch.indices._all.flush.total_time_in_millis metric deviates above the baseline for 15 minutes or more. |
Elevated Indexing Time | 30 min | netuitive.linux.elasticsearch.indices._all.indexing.index_avg_time_in_millis has an upper baseline deviation | WARNING | This policy generates a warning event if the elasticsearch.indices._all.indexing.index_time_in_millis metric deviates above the baseline for 15 minutes or more. |
Reject Count Greater Than Zero | 5 min | elasticsearch.thread_pool.*.rejected has a static threshold >0 | WARNING | This policy generates a warning if any of the Elastic Search thread pools has a “rejected” count greater than 0. |
Java
Kafka
Policy names are prefixed with Kafka –
Policy name | Duration | Condition 1 | (and) Condition 2 | Category | Description |
---|---|---|---|---|---|
Depressed Number of Zookeeper Connections | 30 min | kafka.zookeeper.zk_num_alive_connections has a lower baseline deviation | WARNING | The number of active connections to Zookeeper has been lower than expected for at least the past 30 minutes. | |
Elevated Consumer Lag | 15 min | kafka.zookeeper.consumer_groups.*.comsuler_lag has an upper baseline deviation | WARNING | Consumer lag has been higher than expected for at least 15 minutes. | |
Elevated Consumer Purgatory Size | 15 min | kafka.server.DelayedOperationPurgatory.Fetch.PurgatorySize has
an upper baseline deviation |
WARNING | The purgatory size for consumer fetch requests is higher than expected. This may be causing increases in consumer request latency. | |
Elevated Consumer Servicing Time | 15 min | kafka.network.RequestMetrics.FetchConsumer.TotalTimeMs.Meanhas
an upper baseline deviation |
WARNING | The broker is taking longer than usual to service consumer requests. | |
Elevated Number of Outstanding Zookeeper Requests | 15 min | kafka.zookeeper.zk_outstanding_requests has an upper baseline deviation | WARNING | The number of outstanding Zookeeper requests has been higher than expected for at least the past 15 minutes. This could be resulting in performance issues. | |
Elevated Producer Purgatory Size | 15 min | kafka.server.DelayedOperationPurgatory.Produce.PurgatorySizehas
an upper baseline deviation |
WARNING | The purgatory size for producer requests is higher than expected. This may be causing increases in producer request latency. | |
Elevated Producer Servicing Time | 15 min | kafka.network.RequestMetrics.Produce.TotalTimeMs.Mean has an upper baseline deviation | WARNING | The broker is taking longer than usual to service producer requests. | |
Elevated Topic Activity | 30 min | iBrokerTopicMetrics._all.BytesInPerSec.Count has an upper baseline deviation | BrokerTopicMetrics._all.BytesOutPerSec.Count has an upper baseline deviation | WARNING | Topic activity has been higher than expected for at least the past 30 minutes. |
Elevated Zookeeper Latency | 15 min | kafka.zookeeper.zk_avg_latency has an upper baseline deviation | WARNING | The average latency for Zookeeper requests has been higher than expected for at least the past 15 minutes. | |
Extended Period of Consumer Lag | 1 hour and 15 min | kafka.zookeeper.consumer_groups.*.consumer_lag has an upper baseline deviation | CRITICAL | Consumer lag has been higher than expected for over an hour. | |
No Active Controllers | 5 min | kafka.controller.ActiveControllerCount has a static threshold < 1 | CRITICAL | There are no active controllers in the Kafka cluster. | |
Unclean Leader Election Rate Greater Than 0 | 5 min | kafka.controller.UncleanLeaderElectionsPerSec.Count has a static threshold > 0 | CRITICAL | An out-of-sync replica was chosen as leader because none of the available replicas were in sync. Some data loss has occurred as a result. | |
Under Replicated Partition Count Greater Than 0 | 30 min | kafka.server.ReplicaManager.UnderReplicatedPartitions has a static threshold > 0 | CRITICAL | The number of partitions which are under-replicated has been greater than 0 for at least 30 minutes. |
Microsoft Azure
Microsoft Azure
All Policy names are prefixed with Azure VM –
Policy name | Metrics Required | Duration | Condition 1 | (and) Condition 2 | (and) Condition 3 | Cat. | Description |
---|---|---|---|---|---|---|---|
CPU Threshold Exceeded | Boot Diagnostics | 15 min | Processor.PercentProcessorTime has a static threshold > 50% | WARNING | The CPU on the Azure Virtual Machine has exceeded 95% for at least 15 minutes. | ||
Elevated CPU Activity (Normal Network Activity) | Boot Diagnostics | 30 min | Processor.PercentProcessorTime has an upper baseline deviation + an upper contextual deviation + a static threshold > 20% | NetworkInterface.BytesReceived has no deviation | NetworkInterface.BytesTransmitted has no deviation | INFO | Increases in CPU activity are not uncommon when there is a rise in network activity. Increased traffic to a server means more work for that server to do. This policy is designed to catch cases where CPU activity is higher than than normal and said behavior cannot be explained by a corresponding increase in network traffic. It may or may not represent a problem, but it is useful to know about. This policy will not fire if CPU utilization is less than 20% though. |
Elevated Disk Activity | Boot Diagnostics | 30 min | PhysicalDisk.ReadsPerSecond has an upper baseline deviation + an upper contextual deviation | PhysicalDisk.WritesPerSecond has an upper baseline deviation + an upper contextual deviation | INFO | Disk activity has been higher than expected for at least 30 minutes. | |
Elevated Memory Utilization | Basic Metrics | 15 min | Memory.PercentUsedMemory has an upper baseline deviation + an upper contextual deviation | WARNING | The memory utilization on the Azure Virtual Machine is higher than expected. | ||
Elevated Network Activity | Boot Diagnostics | 30 min | NetworkInterface.BytesReceived has an upper baseline deviation + an upper contextual deviation | NetworkInterface.BytesTransmitted has an upper baseline deviation + an upper contextual deviation | INFO | Network activity has been higher than expected for at least 30 minutes. | |
Heavy Disk Load | Basic Metrics | 5 min | PhysicalDisk.AverageDiskQueueLength has an upper baseline deviation + an upper contextual deviation | WARNING | Average disk queue length is greater than expected, which could indicate a problem with heavy disk load. |
MongoDB
Policy names are prefixed with MongoDB –
Policy name | Duration | Condition 1 | (and) Condition 2 | Category | Description |
---|---|---|---|---|---|
Connections in Use Threshold Exceeded | 5 min | metricly.linux.mongo.connections.utilization percent has a static threshold > 90% | CRITICAL | More than 90% of the total connections to MongoDB are in use. You may need to scale your servers to handle the load. | |
Elevated Number of Queued Read Requests | 30 min | mongo.globalLock.currentQueue.readers has an upper baseline deviation | WARNING | The number of read requests waiting in the queue has been higher than expected for at least the past 30 minutes. | |
Elevated Number of Queued Write Requests | 30 min | mongo.globalLock.currentQueue.writers has an upper baseline deviation | WARNING | The number of write requests waiting in the queue has been higher than expected for at least the past 30 minutes. | |
Elevated Percentage of Connections in Use | 30 min | metricly.linux.mongo.connections.utilizationpercent has an upper baseline deviation | WARNING | The percentage of client connections in use has been higher than expected for at least the past 30 minutes. | |
Suspicious Read Activity | 15 min | metricly.linux.mongo.opcounters.totalreads has an upper baseline deviation | mongo.globalLock.activeClients.readers has no deviation | WARNING | The total number of reads (query and getmore requests) has been higher than expected for at least the past 15 minutes. During this time, the number of active readers has remained within the expected range. Since the increase in read activity cannot be explained by a corresponding increase in the number of readers, the increase is deemed to be suspicious. |
Suspicious Write Activity | 15 min | metricly.linux.mongo.opcounters.totalwrites has an upper baseline deviation | mongo.globalLock.activeClients.writers has no deviation | WARNING | The total number of writes (insert, update, and delete requests) has been higher than expected for at least the past 15 minutes. During this time, the number of active writers has remained within the expected range. Since the increase in write activity cannot be explained by a corresponding increase in the number of writers, the increase is deemed to be suspicious. |
RabbitMQ
Policy names are prefixed with RabbitMQ –
Windows
Policy names are prefixed with Windows –