Amazon OpenSearch Service Metrics and CloudWatch Statistics
Level: intermediate · ~15 min read · Intent: informational
Audience: AWS platform engineers, SRE teams, backend engineers, cloud architects, DevOps engineers
Prerequisites
- basic familiarity with Amazon OpenSearch Service
- basic CloudWatch metrics and alarm experience
- some exposure to search, indexing, or log analytics workloads
Key takeaways
- For provisioned Amazon OpenSearch Service domains, most operational metrics still live in the AWS/ES CloudWatch namespace, even though the service name is now OpenSearch.
- The most useful production dashboard starts with cluster health, storage headroom, JVM pressure, CPU, search latency, indexing latency, request errors, node count, and thread pool rejection metrics.
- Use Maximum for node-risk metrics such as CPUUtilization and JVMMemoryPressure, Minimum for low-headroom metrics such as FreeStorageSpace and Nodes, and Sum for request counts, errors, and rejection counters.
- Serverless OpenSearch has a separate AWS/AOSS namespace and different metrics, so do not copy provisioned-domain alarms directly into serverless collections.
References
- AWS Docs: Monitoring OpenSearch cluster metrics with CloudWatch
- AWS Docs: Recommended CloudWatch alarms for Amazon OpenSearch Service
- AWS Docs: Monitoring Amazon OpenSearch Service domains
- AWS Docs: Monitoring OpenSearch Serverless with CloudWatch
- AWS Docs: Amazon OpenSearch Service rename summary
- AWS Docs: Cluster Insights for Amazon OpenSearch Service
- AWS CLI: cloudwatch get-metric-statistics
FAQ
- Where are Amazon OpenSearch Service metrics in CloudWatch?
- For provisioned OpenSearch Service domains, CloudWatch metrics are under the AWS/ES namespace. Serverless collections use AWS/AOSS instead.
- Why do AWS Elasticsearch Service statistics searches still matter?
- Amazon Elasticsearch Service was renamed to Amazon OpenSearch Service, but many teams still use the old name in dashboards, runbooks, billing reports, and search queries. The underlying operational intent is usually OpenSearch Service CloudWatch metrics.
- Which OpenSearch CloudWatch alarms should I create first?
- Start with ClusterStatus.red, ClusterStatus.yellow, FreeStorageSpace, ClusterIndexWritesBlocked, Nodes, AutomatedSnapshotFailure, CPUUtilization, JVMMemoryPressure, OldGenJVMMemoryPressure, MasterCPUUtilization, 5xx errors, and thread pool queues or rejections.
- Should I use Average or Maximum for OpenSearch CPU and JVM alarms?
- Use Maximum when a single hot node can hurt the cluster, especially for CPUUtilization and JVMMemoryPressure. Average is useful for trend dashboards but can hide one overloaded node.
- Do OpenSearch Serverless metrics use the same alarms as provisioned domains?
- No. OpenSearch Serverless uses the AWS/AOSS namespace and exposes collection-oriented metrics such as SearchRequestLatency, IngestionRequestLatency, SearchOCU, IndexingOCU, and ingestion or search errors.
Amazon OpenSearch Service metrics are easy to find and surprisingly easy to misunderstand.
Part of the confusion is historical.
Many teams still say "AWS Elasticsearch Service statistics" or "Amazon Elasticsearch Service stats" even though Amazon Elasticsearch Service was renamed to Amazon OpenSearch Service. AWS still carries some legacy naming in places that matter operationally. For provisioned domains, the CloudWatch namespace is still AWS/ES, and at least one request metric is explicitly documented as OpenSearchRequests, previously ElasticsearchRequests.
So if you are trying to debug an old dashboard, rebuild alarms after a domain upgrade, or understand why search latency spiked after an index rollout, do not start with a giant list of every metric.
Start with the operational questions:
- Is the cluster healthy?
- Are writes blocked?
- Is one node out of disk?
- Is JVM pressure close to failure?
- Are searches slow because the workload is heavy, the shards are skewed, or the disks are throttled?
- Are indexing requests backing up or being rejected?
- Did a deployment, blue/green change, or node replacement reset cumulative counters?
This guide focuses on the OpenSearch Service metrics that answer those questions, how to choose CloudWatch statistics for each one, and how to build a dashboard that helps during incidents instead of burying you in charts.
For broader platform selection and architecture, read Amazon OpenSearch: A Practical Guide for Fast, Scalable Search. For the product-line comparison, read OpenSearch vs Elasticsearch: When to Choose Each.
Executive Summary
For provisioned Amazon OpenSearch Service domains:
| Area | Metrics to start with | Statistic to prefer | What it tells you |
|---|---|---|---|
| Cluster health | ClusterStatus.red, ClusterStatus.yellow, ClusterStatus.green |
Maximum | Whether shard allocation is healthy |
| Storage | FreeStorageSpace, ClusterUsedSpace |
Minimum for free space, Maximum for used space | Whether a single node is close to write-blocking |
| Writes | ClusterIndexWritesBlocked, IndexingRate, IndexingLatency, ThreadpoolWriteQueue, ThreadpoolWriteRejected |
Maximum for blocks and latency, Sum for rejections | Whether ingestion is keeping up |
| Search | SearchRate, SearchLatency, ThreadpoolSearchQueue, ThreadpoolSearchRejected, 5xx |
Maximum for latency and queues, Sum for errors | Whether query traffic is saturating the cluster |
| JVM and CPU | JVMMemoryPressure, OldGenJVMMemoryPressure, CPUUtilization |
Maximum | Whether one node is near failure |
| Masters | MasterCPUUtilization, MasterJVMMemoryPressure, MasterReachableFromNode |
Maximum, and Minimum for reachability | Whether cluster control plane stability is at risk |
| EBS | ReadLatency, WriteLatency, DiskQueueDepth, BurstBalance, IopsThrottle, ThroughputThrottle |
Maximum for throttles and latency, Minimum for credits | Whether storage is the bottleneck |
| Requests | OpenSearchRequests, 2xx, 3xx, 4xx, 5xx, InvalidHostHeaderRequests, TLSNegotiationError |
Sum | Whether client or endpoint behavior changed |
The most important habit is simple:
Use Maximum for "one bad node can hurt us" metrics, Minimum for low-headroom metrics, and Sum for counters.
That single rule prevents many quiet monitoring mistakes.
AWS Elasticsearch Service Statistics vs OpenSearch Metrics
If someone on your team asks for AWS Elasticsearch Service statistics, they probably mean one of three things:
- CloudWatch metrics for a managed OpenSearch Service domain.
- Old dashboards created before the service rename.
- Historical billing, cost, or alarm language that still uses Elasticsearch wording.
AWS renamed the service to Amazon OpenSearch Service on September 8, 2021. The rename changed service names, API names, instance type naming, Dashboards terminology, and some CloudWatch metric names.
But not everything became visually clean overnight.
The provisioned-domain CloudWatch namespace is still:
AWS/ES
That means this command is still a normal starting point for provisioned domains:
aws cloudwatch list-metrics --namespace "AWS/ES"
Serverless is different. Amazon OpenSearch Serverless reports to:
AWS/AOSS
Do not mix these two namespaces in one runbook without labeling them clearly. A provisioned cluster alarm and a serverless collection alarm may use similar words, but they are watching different operating models.
The Metrics That Should Be On Your First Dashboard
A useful OpenSearch dashboard is not a museum of every metric. It is a triage surface.
When the page is slow, ingestion is delayed, or users are seeing errors, the dashboard should quickly answer:
- Is this a health issue?
- Is this a capacity issue?
- Is this a storage issue?
- Is this a search workload issue?
- Is this an indexing workload issue?
- Is this a client/API issue?
1. Cluster health
Start with:
ClusterStatus.greenClusterStatus.yellowClusterStatus.redShards.activeShards.unassignedNodes
Use Maximum for cluster status values because these metrics are binary health signals. A value of 1 means the condition is true.
Red means at least one primary shard is not allocated. That is an incident.
Yellow means primary shards are allocated but at least one replica shard is not. That might be expected on a single-node test domain, but it is not something to ignore in production.
The Nodes metric deserves a separate alarm. If the minimum node count drops below the number you expect, at least one node was unreachable during the evaluation window.
2. Storage and write blocking
Watch:
FreeStorageSpaceClusterUsedSpaceClusterIndexWritesBlockedIopsThrottleThroughputThrottle
FreeStorageSpace is one of the easiest metrics to configure badly.
Use Minimum, not only Average.
Average free storage can look fine while one node is running out of disk. The node with the least free space is the one that can push the cluster toward write failures. AWS documents FreeStorageSpace in MiB in CloudWatch, while the console displays it in GiB, so avoid copying a GiB threshold into a MiB alarm by accident.
ClusterIndexWritesBlocked should be treated as urgent. It means the cluster is blocking incoming write requests. Low free space and high JVM pressure are common contributors, but the alarm should send engineers straight into storage, shard distribution, and heap pressure checks.
A simple rule:
| Signal | Interpretation |
|---|---|
FreeStorageSpace falling |
Capacity or skew problem |
ClusterIndexWritesBlocked = 1 |
Writes are already blocked |
IopsThrottle = 1 |
I/O limit is being hit |
ThroughputThrottle = 1 |
Disk throughput limit is being hit |
DiskQueueDepth rising |
Storage work is backing up |
If storage is consistently part of your incidents, the fix is rarely "add one alarm." Look at shard count, index lifecycle policy, rollover size, retention, hot/warm tiering, and whether the domain is sized around the real workload.
CPU, JVM, and the Single-Hot-Node Problem
For OpenSearch, the average can lie politely.
One hot data node can make the cluster feel unstable while average CPU looks acceptable. That is why Maximum is usually the safer statistic for:
CPUUtilizationJVMMemoryPressureOldGenJVMMemoryPressureJVMGCOldCollectionCountJVMGCOldCollectionTime
AWS recommends alarming when CPUUtilization or WarmCPUUtilization stays at or above 80% for a sustained window. For JVM pressure, AWS recommended alarms include JVMMemoryPressure at 95% and OldGenJVMMemoryPressure at 80%.
Those are not magic numbers for every workload, but they are good default guardrails.
Use dashboard panels differently from alarms:
- Dashboard: show
AverageandMaximumtogether so you can see cluster-wide load and hot-node risk. - Alarm: use
Maximumwhen one node can create a user-visible failure. - Investigation: drill into node-level dimensions when the cluster-level maximum spikes.
SysMemoryUtilization is worth displaying, but do not treat it as your main heap-risk indicator. In OpenSearch, high system memory usage can be normal. JVM memory pressure is usually a more relevant stability signal.
Search Metrics: Latency, Rate, Queues, and Errors
Search problems usually have four parts:
- The amount of search work arriving.
- The latency of that work.
- Whether queues are filling.
- Whether requests are failing or being rejected.
Start with:
SearchRateSearchLatencyThreadpoolSearchQueueThreadpoolSearchRejected5xxOpenSearchRequests
SearchRate is not the same as user-facing request rate. AWS documents it as search requests per minute for all shards on a data node. One _search request can touch many shards, so shard layout affects the metric.
That is useful.
If application request volume is flat but SearchRate rises, the problem may be shard fan-out, broader queries, index expansion, or a change in query routing.
Use SearchLatency with Maximum when troubleshooting incidents. Average latency can be fine while one node or shard group is dragging the user experience down.
Use ThreadpoolSearchQueue and ThreadpoolSearchRejected to tell the difference between "queries are slower" and "queries are piling up or being dropped." Rejections deserve attention even when the total number is low, because they indicate the cluster has crossed from latency into failed work.
Indexing Metrics: When Writes Fall Behind
Indexing incidents usually show up as delayed logs, stale search results, bulk ingestion failures, or a sudden write block.
Start with:
IndexingRateIndexingLatencyThreadpoolWriteQueueThreadpoolWriteRejectedClusterIndexWritesBlockedFreeStorageSpaceJVMMemoryPressure
IndexingRate counts indexing operations, not just API calls. A single bulk request can represent many operations, and the work can be spread across nodes.
That means you should compare:
- application bulk request volume,
- OpenSearch indexing rate,
- indexing latency,
- write queue depth,
- write rejections,
- and storage or JVM pressure.
If indexing latency rises but write rejections stay low, the cluster may still be absorbing the workload. If queue depth and rejections rise together, ingestion concurrency or cluster capacity is likely beyond a healthy level.
If ClusterIndexWritesBlocked flips to 1, do not treat it as a normal performance event. It is a write-availability event.
Master Node Metrics Are Stability Metrics
Dedicated master nodes are not where your application queries should land, but they are still part of the cluster's ability to stay coherent.
Watch:
MasterCPUUtilizationMasterJVMMemoryPressureMasterOldGenJVMMemoryPressureMasterReachableFromNodeNodes
AWS recommended alarms keep master CPU thresholds lower than data node CPU thresholds because masters are responsible for cluster stability and configuration changes.
That is the right mental model.
Master node pressure can show up during:
- index creation storms,
- shard churn,
- too many small indices,
- blue/green deployments,
- node replacement,
- mapping explosions,
- or cluster state growth.
If data-node metrics look fine but the cluster feels unstable during deployments or index lifecycle activity, check the master metrics and shard counts before chasing application code.
EBS Metrics: When Search Is Waiting On Storage
Provisioned domains with EBS storage need storage-level visibility.
Watch:
ReadLatencyWriteLatencyReadIOPSWriteIOPSReadThroughputWriteThroughputDiskQueueDepthBurstBalanceIopsThrottleThroughputThrottleVolumeStalledIOcheck
EBS problems often masquerade as search problems.
For example:
- search latency rises,
- CPU is not maxed,
- JVM is not near the danger zone,
- but disk queue depth rises and throttling appears.
That points away from query syntax and toward storage throughput, IOPS, shard placement, or index lifecycle design.
If BurstBalance matters for your volume type, display it with Minimum. If a node exhausts burst credits, the weakest node matters more than the cluster average.
Request Metrics: Separate User Errors From Service Pain
Track:
OpenSearchRequests2xx3xx4xx5xxInvalidHostHeaderRequestsTLSNegotiationError
Use Sum for these.
The useful view is often a ratio:
5xx / OpenSearchRequests
AWS recommends considering alarms when 5xx responses reach a meaningful percentage of OpenSearch requests. The exact threshold depends on the workload, but a sustained rise in 5xx responses usually means users or ingestion clients are experiencing real failures.
4xx is different. A spike in 4xx responses often points to client behavior, permissions, request shape, auth, missing indexes, or rejected invalid requests. Still alert on it when it is abnormal for your application, but route it differently from a cluster-health alarm.
InvalidHostHeaderRequests and TLSNegotiationError are useful security and integration signals. They can catch clients using the wrong endpoint, broken TLS settings, scans against a public domain, or misconfigured proxies.
Which CloudWatch Statistic Should You Use?
CloudWatch lets you view statistics such as Average, Maximum, Minimum, and Sum.
For OpenSearch Service, the statistic is not a cosmetic choice.
Use this decision table:
| Metric type | Best statistic | Why |
|---|---|---|
| Binary health flags | Maximum | You want to know if the bad state happened |
| Low free headroom | Minimum | The weakest node matters |
| CPU and JVM pressure | Maximum | One hot node can hurt the cluster |
| Latency | Maximum for alarms, Average plus Maximum for dashboards | Average hides tail pain |
| Request counts | Sum | Counts need total volume |
| Error counts | Sum | Errors need total volume |
| Rejection counters | Sum or metric math difference | Rejections are cumulative-style signals |
| Node count | Minimum | You want to detect missing nodes |
| Burst credits | Minimum | One depleted node can bottleneck |
Also remember that AWS notes cumulative metrics can reset during node drops, node bounces, node replacements, and blue/green deployments. That makes raw cumulative counters less useful than trends, differences, and "did this increase during the last period?" alarms.
Example: Pull One Metric With AWS CLI
For a provisioned domain, list metrics first:
aws cloudwatch list-metrics \
--namespace "AWS/ES"
Then pull a specific metric. This example asks for maximum JVM pressure over five-minute periods.
aws cloudwatch get-metric-statistics \
--namespace "AWS/ES" \
--metric-name "JVMMemoryPressure" \
--dimensions Name=DomainName,Value=my-domain Name=ClientId,Value=123456789012 \
--start-time 2026-06-04T00:00:00Z \
--end-time 2026-06-04T01:00:00Z \
--period 300 \
--statistics Maximum
For per-node investigation, use dimensions that include the node identifier where available. For domain-level dashboards, use the per-domain dimensions so the chart stays readable.
The exact dimensions available depend on the metric family. AWS documents domain, node, shard-role, and serverless collection dimensions separately, so avoid assuming every metric has the same dimension set.
A Practical Alarm Set For Provisioned Domains
Use this as a starting point, then tune it to your workload.
| Alarm | Suggested starting rule | Route to |
|---|---|---|
| Red cluster | ClusterStatus.red Maximum >= 1 |
Immediate incident |
| Yellow cluster | ClusterStatus.yellow Maximum >= 1 for several periods |
Platform review or incident depending on workload |
| Low free storage | FreeStorageSpace Minimum <= 25% of node storage |
Capacity/on-call |
| Writes blocked | ClusterIndexWritesBlocked Maximum >= 1 |
Immediate incident |
| Node missing | Nodes Minimum < expected node count |
Immediate investigation |
| Snapshot failure | AutomatedSnapshotFailure Maximum >= 1 |
Reliability review |
| High data CPU | CPUUtilization Maximum >= 80% sustained |
Capacity/performance |
| High JVM | JVMMemoryPressure Maximum >= 95% |
Immediate investigation |
| Old-gen pressure | OldGenJVMMemoryPressure Maximum >= 80% |
Heap and shard review |
| Master CPU | MasterCPUUtilization Maximum >= 50% sustained |
Cluster stability review |
| Search queue | ThreadpoolSearchQueue Maximum above baseline |
Search workload review |
| Write queue | ThreadpoolWriteQueue Average or Maximum above baseline |
Ingestion review |
| Search rejected | increase in ThreadpoolSearchRejected |
User-facing performance incident |
| Write rejected | increase in ThreadpoolWriteRejected |
Ingestion incident |
| 5xx ratio | 5xx / OpenSearchRequests above baseline |
Application and cluster triage |
| TLS errors | TLSNegotiationError Sum above baseline |
Client or security review |
Do not copy these into production blindly.
A logging cluster, a product-search cluster, and a RAG retrieval cluster have different traffic patterns. The right alarm window for one may be noisy or too slow for another.
Serverless OpenSearch Metrics Are A Different Model
Amazon OpenSearch Serverless has its own CloudWatch namespace:
AWS/AOSS
The metrics are collection-oriented rather than node-oriented. Start with:
ActiveCollectionSearchRequestLatencySearchRequestErrorsSearchRequestRateSearchOCUIngestionRequestLatencyIngestionRequestErrorsIngestionDocumentErrorsIngestionDocumentRateIndexingOCUSearchableDocuments
The operational questions change:
- Is the collection active?
- Are search requests getting slower?
- Are ingestion requests failing?
- Are document-level ingestion errors rising?
- Are OCUs scaling with workload as expected?
- Is cost rising because search or indexing OCUs increased?
Do not try to port JVMMemoryPressure, Nodes, or FreeStorageSpace alarms from provisioned domains to serverless collections. They are not the same operating surface.
How Cluster Insights Fits In
CloudWatch is still the monitoring backbone, but Cluster Insights can make OpenSearch-specific diagnosis faster.
Cluster Insights surfaces cluster health, shard count, node count, index count, document statistics, indexing and search rates, latencies, JVM pressure, CPU utilization, and query-level information in OpenSearch UI.
Use it as an investigation layer:
- CloudWatch alarm fires.
- Dashboard tells you the affected metric family.
- Cluster Insights helps identify the domain, node, index, shard, or query pattern involved.
That is especially useful for problems like:
- large shards,
- node or shard skew,
- high-latency queries,
- hot shards,
- resource-intensive query shapes,
- and best-practice drift.
CloudWatch should still own alerting. Cluster Insights is strongest when the human is already investigating and needs OpenSearch-native context.
A Simple Runbook For OpenSearch Metric Spikes
When an alarm fires, move in this order.
Step 1: Check cluster health
Look at:
ClusterStatus.redClusterStatus.yellowNodesShards.unassigned
If the cluster is red or missing nodes, handle that before tuning queries.
Step 2: Check write availability
Look at:
ClusterIndexWritesBlockedFreeStorageSpaceJVMMemoryPressureThreadpoolWriteRejected
If writes are blocked, treat it as an availability incident. Free storage, shard distribution, and heap pressure are common first checks.
Step 3: Split search from indexing
If users report slow search, check:
SearchLatencySearchRateThreadpoolSearchQueueThreadpoolSearchRejected5xx
If data is stale or ingestion is delayed, check:
IndexingLatencyIndexingRateThreadpoolWriteQueueThreadpoolWriteRejected
Do not assume search and indexing are separate. Heavy ingestion can affect search performance, and broad searches can compete for the same shared resources.
Step 4: Look for hot-node behavior
Compare Average and Maximum for:
CPUUtilizationJVMMemoryPressureSearchLatencyIndexingLatency- EBS latency metrics
If maximum is much worse than average, investigate node-level dimensions, shard allocation, and uneven index traffic.
Step 5: Check the storage path
Look at:
ReadLatencyWriteLatencyDiskQueueDepthBurstBalanceIopsThrottleThroughputThrottle
If storage is saturated, scaling CPU may not help. You may need different volume settings, larger nodes, better shard sizing, less fan-out, or lifecycle changes.
Common Monitoring Mistakes
Mistake 1: Using Average everywhere
Average hides exactly the kind of single-node problems OpenSearch clusters often have.
Use Maximum for CPU, JVM, queue depth, and latency alarms.
Mistake 2: Alerting on every metric AWS exposes
More alarms do not create better operations.
Start with health, writes, storage, JVM, CPU, latency, queue, rejection, master, and 5xx signals. Add specialized metrics only when your workload needs them.
Mistake 3: Ignoring the rename boundary
Legacy Elasticsearch naming still appears in old dashboards, old runbooks, and old team vocabulary.
Document the mapping once:
- Amazon Elasticsearch Service is now Amazon OpenSearch Service.
- Provisioned-domain CloudWatch namespace is
AWS/ES. - Serverless namespace is
AWS/AOSS. - Some old metrics were renamed during OpenSearch upgrades.
- Billing and historical reports may still need old and new service filters.
Mistake 4: Treating serverless like provisioned
Provisioned domains expose nodes, JVM, EBS, shard, and master metrics.
Serverless collections expose collection, ingestion, search, and OCU-oriented metrics.
The dashboards should look different.
Mistake 5: No dashboard for deploy windows
OpenSearch metrics can reset during node replacements and blue/green deployments. If your release process changes mappings, index settings, cluster configuration, or ingestion volume, keep a deploy-window dashboard that shows health, latency, JVM, storage, queues, requests, and errors at the same time.
Final Checklist
For a production provisioned OpenSearch Service domain, build this first:
- Cluster health panel with green, yellow, red, nodes, and unassigned shards.
- Storage panel with free space minimum, used space, write blocks, and storage throttles.
- JVM and CPU panel with average and maximum values.
- Search panel with rate, latency, queue, rejections, and request errors.
- Indexing panel with rate, latency, queue, rejections, and write blocks.
- Master node panel with CPU, JVM, and reachability.
- EBS panel with latency, IOPS, throughput, queue depth, burst balance, and throttles.
- Request panel with OpenSearch requests, 4xx, 5xx, invalid host headers, and TLS negotiation errors.
- Alarm set for red/yellow health, low storage, writes blocked, node loss, snapshot failure, high CPU, high JVM, master pressure, thread pool rejections, and 5xx ratio.
- Separate serverless dashboard and alarms if you run OpenSearch Serverless.
OpenSearch monitoring is not about memorizing every metric name.
It is about keeping the failure path visible:
health -> storage -> JVM/CPU -> search/indexing pressure -> queues/rejections -> client errors -> shard and node diagnosis.
If your dashboard follows that path, the next incident will be a lot less mysterious.
About the author
Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.