Where are Amazon OpenSearch Service metrics in CloudWatch?

For provisioned OpenSearch Service domains, CloudWatch metrics are under the AWS/ES namespace. Serverless collections use AWS/AOSS instead.

Why do AWS Elasticsearch Service statistics searches still matter?

Amazon Elasticsearch Service was renamed to Amazon OpenSearch Service, but many teams still use the old name in dashboards, runbooks, billing reports, and search queries. The underlying operational intent is usually OpenSearch Service CloudWatch metrics.

Which OpenSearch CloudWatch alarms should I create first?

Start with ClusterStatus.red, ClusterStatus.yellow, FreeStorageSpace, ClusterIndexWritesBlocked, Nodes, AutomatedSnapshotFailure, CPUUtilization, JVMMemoryPressure, OldGenJVMMemoryPressure, MasterCPUUtilization, 5xx errors, and thread pool queues or rejections.

Should I use Average or Maximum for OpenSearch CPU and JVM alarms?

Use Maximum when a single hot node can hurt the cluster, especially for CPUUtilization and JVMMemoryPressure. Average is useful for trend dashboards but can hide one overloaded node.

Do OpenSearch Serverless metrics use the same alarms as provisioned domains?

No. OpenSearch Serverless uses the AWS/AOSS namespace and exposes collection-oriented metrics such as SearchRequestLatency, IngestionRequestLatency, SearchOCU, IndexingOCU, and ingestion or search errors.

Back to Blog

Amazon OpenSearch Service Metrics and CloudWatch Statistics

Cloud, API & Security

Jun 4, 2026·By Elysiate·Updated Jun 4, 2026·

awsopensearchelasticsearchcloudwatchobservabilitysearch

Level: intermediate · ~15 min read · Intent: informational

Audience: AWS platform engineers, SRE teams, backend engineers, cloud architects, DevOps engineers

Prerequisites

basic familiarity with Amazon OpenSearch Service
basic CloudWatch metrics and alarm experience
some exposure to search, indexing, or log analytics workloads

Key takeaways

For provisioned Amazon OpenSearch Service domains, most operational metrics still live in the AWS/ES CloudWatch namespace, even though the service name is now OpenSearch.
The most useful production dashboard starts with cluster health, storage headroom, JVM pressure, CPU, search latency, indexing latency, request errors, node count, and thread pool rejection metrics.
Use Maximum for node-risk metrics such as CPUUtilization and JVMMemoryPressure, Minimum for low-headroom metrics such as FreeStorageSpace and Nodes, and Sum for request counts, errors, and rejection counters.
Serverless OpenSearch has a separate AWS/AOSS namespace and different metrics, so do not copy provisioned-domain alarms directly into serverless collections.

References

FAQ

Where are Amazon OpenSearch Service metrics in CloudWatch?: For provisioned OpenSearch Service domains, CloudWatch metrics are under the AWS/ES namespace. Serverless collections use AWS/AOSS instead.
Why do AWS Elasticsearch Service statistics searches still matter?: Amazon Elasticsearch Service was renamed to Amazon OpenSearch Service, but many teams still use the old name in dashboards, runbooks, billing reports, and search queries. The underlying operational intent is usually OpenSearch Service CloudWatch metrics.
Which OpenSearch CloudWatch alarms should I create first?: Start with ClusterStatus.red, ClusterStatus.yellow, FreeStorageSpace, ClusterIndexWritesBlocked, Nodes, AutomatedSnapshotFailure, CPUUtilization, JVMMemoryPressure, OldGenJVMMemoryPressure, MasterCPUUtilization, 5xx errors, and thread pool queues or rejections.
Should I use Average or Maximum for OpenSearch CPU and JVM alarms?: Use Maximum when a single hot node can hurt the cluster, especially for CPUUtilization and JVMMemoryPressure. Average is useful for trend dashboards but can hide one overloaded node.
Do OpenSearch Serverless metrics use the same alarms as provisioned domains?: No. OpenSearch Serverless uses the AWS/AOSS namespace and exposes collection-oriented metrics such as SearchRequestLatency, IngestionRequestLatency, SearchOCU, IndexingOCU, and ingestion or search errors.

Amazon OpenSearch Service metrics are easy to find and surprisingly easy to misunderstand.

Part of the confusion is historical.

Many teams still say "AWS Elasticsearch Service statistics" or "Amazon Elasticsearch Service stats" even though Amazon Elasticsearch Service was renamed to Amazon OpenSearch Service. AWS still carries some legacy naming in places that matter operationally. For provisioned domains, the CloudWatch namespace is still AWS/ES, and at least one request metric is explicitly documented as OpenSearchRequests, previously ElasticsearchRequests.

So if you are trying to debug an old dashboard, rebuild alarms after a domain upgrade, or understand why search latency spiked after an index rollout, do not start with a giant list of every metric.

Start with the operational questions:

Is the cluster healthy?
Are writes blocked?
Is one node out of disk?
Is JVM pressure close to failure?
Are searches slow because the workload is heavy, the shards are skewed, or the disks are throttled?
Are indexing requests backing up or being rejected?
Did a deployment, blue/green change, or node replacement reset cumulative counters?

This guide focuses on the OpenSearch Service metrics that answer those questions, how to choose CloudWatch statistics for each one, and how to build a dashboard that helps during incidents instead of burying you in charts.

For broader platform selection and architecture, read Amazon OpenSearch: A Practical Guide for Fast, Scalable Search. For the product-line comparison, read OpenSearch vs Elasticsearch: When to Choose Each.

Executive Summary

For provisioned Amazon OpenSearch Service domains:

Area	Metrics to start with	Statistic to prefer	What it tells you
Cluster health	`ClusterStatus.red`, `ClusterStatus.yellow`, `ClusterStatus.green`	Maximum	Whether shard allocation is healthy
Storage	`FreeStorageSpace`, `ClusterUsedSpace`	Minimum for free space, Maximum for used space	Whether a single node is close to write-blocking
Writes	`ClusterIndexWritesBlocked`, `IndexingRate`, `IndexingLatency`, `ThreadpoolWriteQueue`, `ThreadpoolWriteRejected`	Maximum for blocks and latency, Sum for rejections	Whether ingestion is keeping up
Search	`SearchRate`, `SearchLatency`, `ThreadpoolSearchQueue`, `ThreadpoolSearchRejected`, `5xx`	Maximum for latency and queues, Sum for errors	Whether query traffic is saturating the cluster
JVM and CPU	`JVMMemoryPressure`, `OldGenJVMMemoryPressure`, `CPUUtilization`	Maximum	Whether one node is near failure
Masters	`MasterCPUUtilization`, `MasterJVMMemoryPressure`, `MasterReachableFromNode`	Maximum, and Minimum for reachability	Whether cluster control plane stability is at risk
EBS	`ReadLatency`, `WriteLatency`, `DiskQueueDepth`, `BurstBalance`, `IopsThrottle`, `ThroughputThrottle`	Maximum for throttles and latency, Minimum for credits	Whether storage is the bottleneck
Requests	`OpenSearchRequests`, `2xx`, `3xx`, `4xx`, `5xx`, `InvalidHostHeaderRequests`, `TLSNegotiationError`	Sum	Whether client or endpoint behavior changed

The most important habit is simple:

Use Maximum for "one bad node can hurt us" metrics, Minimum for low-headroom metrics, and Sum for counters.

That single rule prevents many quiet monitoring mistakes.

AWS Elasticsearch Service Statistics vs OpenSearch Metrics

If someone on your team asks for AWS Elasticsearch Service statistics, they probably mean one of three things:

CloudWatch metrics for a managed OpenSearch Service domain.
Old dashboards created before the service rename.
Historical billing, cost, or alarm language that still uses Elasticsearch wording.

AWS renamed the service to Amazon OpenSearch Service on September 8, 2021. The rename changed service names, API names, instance type naming, Dashboards terminology, and some CloudWatch metric names.

But not everything became visually clean overnight.

The provisioned-domain CloudWatch namespace is still:

AWS/ES

That means this command is still a normal starting point for provisioned domains:

aws cloudwatch list-metrics --namespace "AWS/ES"

Serverless is different. Amazon OpenSearch Serverless reports to:

AWS/AOSS

Do not mix these two namespaces in one runbook without labeling them clearly. A provisioned cluster alarm and a serverless collection alarm may use similar words, but they are watching different operating models.

The Metrics That Should Be On Your First Dashboard

A useful OpenSearch dashboard is not a museum of every metric. It is a triage surface.

When the page is slow, ingestion is delayed, or users are seeing errors, the dashboard should quickly answer:

Is this a health issue?
Is this a capacity issue?
Is this a storage issue?
Is this a search workload issue?
Is this an indexing workload issue?
Is this a client/API issue?

1. Cluster health

Start with:

ClusterStatus.green
ClusterStatus.yellow
ClusterStatus.red
Shards.active
Shards.unassigned
Nodes

Use Maximum for cluster status values because these metrics are binary health signals. A value of 1 means the condition is true.

Red means at least one primary shard is not allocated. That is an incident.

Yellow means primary shards are allocated but at least one replica shard is not. That might be expected on a single-node test domain, but it is not something to ignore in production.

The Nodes metric deserves a separate alarm. If the minimum node count drops below the number you expect, at least one node was unreachable during the evaluation window.

2. Storage and write blocking

Watch:

FreeStorageSpace
ClusterUsedSpace
ClusterIndexWritesBlocked
IopsThrottle
ThroughputThrottle

FreeStorageSpace is one of the easiest metrics to configure badly.

Use Minimum, not only Average.

Average free storage can look fine while one node is running out of disk. The node with the least free space is the one that can push the cluster toward write failures. AWS documents FreeStorageSpace in MiB in CloudWatch, while the console displays it in GiB, so avoid copying a GiB threshold into a MiB alarm by accident.

ClusterIndexWritesBlocked should be treated as urgent. It means the cluster is blocking incoming write requests. Low free space and high JVM pressure are common contributors, but the alarm should send engineers straight into storage, shard distribution, and heap pressure checks.

A simple rule:

Signal	Interpretation
`FreeStorageSpace` falling	Capacity or skew problem
`ClusterIndexWritesBlocked = 1`	Writes are already blocked
`IopsThrottle = 1`	I/O limit is being hit
`ThroughputThrottle = 1`	Disk throughput limit is being hit
`DiskQueueDepth` rising	Storage work is backing up

If storage is consistently part of your incidents, the fix is rarely "add one alarm." Look at shard count, index lifecycle policy, rollover size, retention, hot/warm tiering, and whether the domain is sized around the real workload.

CPU, JVM, and the Single-Hot-Node Problem

For OpenSearch, the average can lie politely.

One hot data node can make the cluster feel unstable while average CPU looks acceptable. That is why Maximum is usually the safer statistic for:

CPUUtilization
JVMMemoryPressure
OldGenJVMMemoryPressure
JVMGCOldCollectionCount
JVMGCOldCollectionTime

AWS recommends alarming when CPUUtilization or WarmCPUUtilization stays at or above 80% for a sustained window. For JVM pressure, AWS recommended alarms include JVMMemoryPressure at 95% and OldGenJVMMemoryPressure at 80%.

Those are not magic numbers for every workload, but they are good default guardrails.

Use dashboard panels differently from alarms:

Dashboard: show Average and Maximum together so you can see cluster-wide load and hot-node risk.
Alarm: use Maximum when one node can create a user-visible failure.
Investigation: drill into node-level dimensions when the cluster-level maximum spikes.

SysMemoryUtilization is worth displaying, but do not treat it as your main heap-risk indicator. In OpenSearch, high system memory usage can be normal. JVM memory pressure is usually a more relevant stability signal.

Search Metrics: Latency, Rate, Queues, and Errors

Search problems usually have four parts:

The amount of search work arriving.
The latency of that work.
Whether queues are filling.
Whether requests are failing or being rejected.

Start with:

SearchRate
SearchLatency
ThreadpoolSearchQueue
ThreadpoolSearchRejected
5xx
OpenSearchRequests

SearchRate is not the same as user-facing request rate. AWS documents it as search requests per minute for all shards on a data node. One _search request can touch many shards, so shard layout affects the metric.

That is useful.

If application request volume is flat but SearchRate rises, the problem may be shard fan-out, broader queries, index expansion, or a change in query routing.

Use SearchLatency with Maximum when troubleshooting incidents. Average latency can be fine while one node or shard group is dragging the user experience down.

Use ThreadpoolSearchQueue and ThreadpoolSearchRejected to tell the difference between "queries are slower" and "queries are piling up or being dropped." Rejections deserve attention even when the total number is low, because they indicate the cluster has crossed from latency into failed work.

Indexing Metrics: When Writes Fall Behind

Indexing incidents usually show up as delayed logs, stale search results, bulk ingestion failures, or a sudden write block.

Start with:

IndexingRate
IndexingLatency
ThreadpoolWriteQueue
ThreadpoolWriteRejected
ClusterIndexWritesBlocked
FreeStorageSpace
JVMMemoryPressure

IndexingRate counts indexing operations, not just API calls. A single bulk request can represent many operations, and the work can be spread across nodes.

That means you should compare:

application bulk request volume,
OpenSearch indexing rate,
indexing latency,
write queue depth,
write rejections,
and storage or JVM pressure.

If indexing latency rises but write rejections stay low, the cluster may still be absorbing the workload. If queue depth and rejections rise together, ingestion concurrency or cluster capacity is likely beyond a healthy level.

If ClusterIndexWritesBlocked flips to 1, do not treat it as a normal performance event. It is a write-availability event.

Master Node Metrics Are Stability Metrics

Dedicated master nodes are not where your application queries should land, but they are still part of the cluster's ability to stay coherent.

Watch:

MasterCPUUtilization
MasterJVMMemoryPressure
MasterOldGenJVMMemoryPressure
MasterReachableFromNode
Nodes

AWS recommended alarms keep master CPU thresholds lower than data node CPU thresholds because masters are responsible for cluster stability and configuration changes.

That is the right mental model.

Master node pressure can show up during:

index creation storms,
shard churn,
too many small indices,
blue/green deployments,
node replacement,
mapping explosions,
or cluster state growth.

If data-node metrics look fine but the cluster feels unstable during deployments or index lifecycle activity, check the master metrics and shard counts before chasing application code.

EBS Metrics: When Search Is Waiting On Storage

Provisioned domains with EBS storage need storage-level visibility.

Watch:

ReadLatency
WriteLatency
ReadIOPS
WriteIOPS
ReadThroughput
WriteThroughput
DiskQueueDepth
BurstBalance
IopsThrottle
ThroughputThrottle
VolumeStalledIOcheck

EBS problems often masquerade as search problems.

For example:

search latency rises,
CPU is not maxed,
JVM is not near the danger zone,
but disk queue depth rises and throttling appears.

That points away from query syntax and toward storage throughput, IOPS, shard placement, or index lifecycle design.

If BurstBalance matters for your volume type, display it with Minimum. If a node exhausts burst credits, the weakest node matters more than the cluster average.

Request Metrics: Separate User Errors From Service Pain

Track:

OpenSearchRequests
2xx
3xx
4xx
5xx
InvalidHostHeaderRequests
TLSNegotiationError

Use Sum for these.

The useful view is often a ratio:

5xx / OpenSearchRequests

AWS recommends considering alarms when 5xx responses reach a meaningful percentage of OpenSearch requests. The exact threshold depends on the workload, but a sustained rise in 5xx responses usually means users or ingestion clients are experiencing real failures.

4xx is different. A spike in 4xx responses often points to client behavior, permissions, request shape, auth, missing indexes, or rejected invalid requests. Still alert on it when it is abnormal for your application, but route it differently from a cluster-health alarm.

InvalidHostHeaderRequests and TLSNegotiationError are useful security and integration signals. They can catch clients using the wrong endpoint, broken TLS settings, scans against a public domain, or misconfigured proxies.

Which CloudWatch Statistic Should You Use?

CloudWatch lets you view statistics such as Average, Maximum, Minimum, and Sum.

For OpenSearch Service, the statistic is not a cosmetic choice.

Use this decision table:

Metric type	Best statistic	Why
Binary health flags	Maximum	You want to know if the bad state happened
Low free headroom	Minimum	The weakest node matters
CPU and JVM pressure	Maximum	One hot node can hurt the cluster
Latency	Maximum for alarms, Average plus Maximum for dashboards	Average hides tail pain
Request counts	Sum	Counts need total volume
Error counts	Sum	Errors need total volume
Rejection counters	Sum or metric math difference	Rejections are cumulative-style signals
Node count	Minimum	You want to detect missing nodes
Burst credits	Minimum	One depleted node can bottleneck

Also remember that AWS notes cumulative metrics can reset during node drops, node bounces, node replacements, and blue/green deployments. That makes raw cumulative counters less useful than trends, differences, and "did this increase during the last period?" alarms.

Example: Pull One Metric With AWS CLI

For a provisioned domain, list metrics first:

aws cloudwatch list-metrics \
  --namespace "AWS/ES"

Then pull a specific metric. This example asks for maximum JVM pressure over five-minute periods.

aws cloudwatch get-metric-statistics \
  --namespace "AWS/ES" \
  --metric-name "JVMMemoryPressure" \
  --dimensions Name=DomainName,Value=my-domain Name=ClientId,Value=123456789012 \
  --start-time 2026-06-04T00:00:00Z \
  --end-time 2026-06-04T01:00:00Z \
  --period 300 \
  --statistics Maximum

For per-node investigation, use dimensions that include the node identifier where available. For domain-level dashboards, use the per-domain dimensions so the chart stays readable.

The exact dimensions available depend on the metric family. AWS documents domain, node, shard-role, and serverless collection dimensions separately, so avoid assuming every metric has the same dimension set.

A Practical Alarm Set For Provisioned Domains

Use this as a starting point, then tune it to your workload.

Alarm	Suggested starting rule	Route to
Red cluster	`ClusterStatus.red Maximum >= 1`	Immediate incident
Yellow cluster	`ClusterStatus.yellow Maximum >= 1` for several periods	Platform review or incident depending on workload
Low free storage	`FreeStorageSpace Minimum <= 25% of node storage`	Capacity/on-call
Writes blocked	`ClusterIndexWritesBlocked Maximum >= 1`	Immediate incident
Node missing	`Nodes Minimum < expected node count`	Immediate investigation
Snapshot failure	`AutomatedSnapshotFailure Maximum >= 1`	Reliability review
High data CPU	`CPUUtilization Maximum >= 80%` sustained	Capacity/performance
High JVM	`JVMMemoryPressure Maximum >= 95%`	Immediate investigation
Old-gen pressure	`OldGenJVMMemoryPressure Maximum >= 80%`	Heap and shard review
Master CPU	`MasterCPUUtilization Maximum >= 50%` sustained	Cluster stability review
Search queue	`ThreadpoolSearchQueue Maximum` above baseline	Search workload review
Write queue	`ThreadpoolWriteQueue Average or Maximum` above baseline	Ingestion review
Search rejected	increase in `ThreadpoolSearchRejected`	User-facing performance incident
Write rejected	increase in `ThreadpoolWriteRejected`	Ingestion incident
5xx ratio	`5xx / OpenSearchRequests` above baseline	Application and cluster triage
TLS errors	`TLSNegotiationError Sum` above baseline	Client or security review

Do not copy these into production blindly.

A logging cluster, a product-search cluster, and a RAG retrieval cluster have different traffic patterns. The right alarm window for one may be noisy or too slow for another.

Serverless OpenSearch Metrics Are A Different Model

Amazon OpenSearch Serverless has its own CloudWatch namespace:

AWS/AOSS

The metrics are collection-oriented rather than node-oriented. Start with:

ActiveCollection
SearchRequestLatency
SearchRequestErrors
SearchRequestRate
SearchOCU
IngestionRequestLatency
IngestionRequestErrors
IngestionDocumentErrors
IngestionDocumentRate
IndexingOCU
SearchableDocuments

The operational questions change:

Is the collection active?
Are search requests getting slower?
Are ingestion requests failing?
Are document-level ingestion errors rising?
Are OCUs scaling with workload as expected?
Is cost rising because search or indexing OCUs increased?

Do not try to port JVMMemoryPressure, Nodes, or FreeStorageSpace alarms from provisioned domains to serverless collections. They are not the same operating surface.

How Cluster Insights Fits In

CloudWatch is still the monitoring backbone, but Cluster Insights can make OpenSearch-specific diagnosis faster.

Cluster Insights surfaces cluster health, shard count, node count, index count, document statistics, indexing and search rates, latencies, JVM pressure, CPU utilization, and query-level information in OpenSearch UI.

Use it as an investigation layer:

CloudWatch alarm fires.
Dashboard tells you the affected metric family.
Cluster Insights helps identify the domain, node, index, shard, or query pattern involved.

That is especially useful for problems like:

large shards,
node or shard skew,
high-latency queries,
hot shards,
resource-intensive query shapes,
and best-practice drift.

CloudWatch should still own alerting. Cluster Insights is strongest when the human is already investigating and needs OpenSearch-native context.

A Simple Runbook For OpenSearch Metric Spikes

When an alarm fires, move in this order.

Step 1: Check cluster health

Look at:

ClusterStatus.red
ClusterStatus.yellow
Nodes
Shards.unassigned

If the cluster is red or missing nodes, handle that before tuning queries.

Step 2: Check write availability

Look at:

ClusterIndexWritesBlocked
FreeStorageSpace
JVMMemoryPressure
ThreadpoolWriteRejected

If writes are blocked, treat it as an availability incident. Free storage, shard distribution, and heap pressure are common first checks.

Step 3: Split search from indexing

If users report slow search, check:

SearchLatency
SearchRate
ThreadpoolSearchQueue
ThreadpoolSearchRejected
5xx

If data is stale or ingestion is delayed, check:

IndexingLatency
IndexingRate
ThreadpoolWriteQueue
ThreadpoolWriteRejected

Do not assume search and indexing are separate. Heavy ingestion can affect search performance, and broad searches can compete for the same shared resources.

Step 4: Look for hot-node behavior

Compare Average and Maximum for:

CPUUtilization
JVMMemoryPressure
SearchLatency
IndexingLatency
EBS latency metrics

If maximum is much worse than average, investigate node-level dimensions, shard allocation, and uneven index traffic.

Step 5: Check the storage path

Look at:

ReadLatency
WriteLatency
DiskQueueDepth
BurstBalance
IopsThrottle
ThroughputThrottle

If storage is saturated, scaling CPU may not help. You may need different volume settings, larger nodes, better shard sizing, less fan-out, or lifecycle changes.

Common Monitoring Mistakes

Mistake 1: Using Average everywhere

Average hides exactly the kind of single-node problems OpenSearch clusters often have.

Use Maximum for CPU, JVM, queue depth, and latency alarms.

Mistake 2: Alerting on every metric AWS exposes

More alarms do not create better operations.

Start with health, writes, storage, JVM, CPU, latency, queue, rejection, master, and 5xx signals. Add specialized metrics only when your workload needs them.

Mistake 3: Ignoring the rename boundary

Legacy Elasticsearch naming still appears in old dashboards, old runbooks, and old team vocabulary.

Document the mapping once:

Amazon Elasticsearch Service is now Amazon OpenSearch Service.
Provisioned-domain CloudWatch namespace is AWS/ES.
Serverless namespace is AWS/AOSS.
Some old metrics were renamed during OpenSearch upgrades.
Billing and historical reports may still need old and new service filters.

Mistake 4: Treating serverless like provisioned

Provisioned domains expose nodes, JVM, EBS, shard, and master metrics.

Serverless collections expose collection, ingestion, search, and OCU-oriented metrics.

The dashboards should look different.

Mistake 5: No dashboard for deploy windows

OpenSearch metrics can reset during node replacements and blue/green deployments. If your release process changes mappings, index settings, cluster configuration, or ingestion volume, keep a deploy-window dashboard that shows health, latency, JVM, storage, queues, requests, and errors at the same time.

Final Checklist

For a production provisioned OpenSearch Service domain, build this first:

Cluster health panel with green, yellow, red, nodes, and unassigned shards.
Storage panel with free space minimum, used space, write blocks, and storage throttles.
JVM and CPU panel with average and maximum values.
Search panel with rate, latency, queue, rejections, and request errors.
Indexing panel with rate, latency, queue, rejections, and write blocks.
Master node panel with CPU, JVM, and reachability.
EBS panel with latency, IOPS, throughput, queue depth, burst balance, and throttles.
Request panel with OpenSearch requests, 4xx, 5xx, invalid host headers, and TLS negotiation errors.
Alarm set for red/yellow health, low storage, writes blocked, node loss, snapshot failure, high CPU, high JVM, master pressure, thread pool rejections, and 5xx ratio.
Separate serverless dashboard and alarms if you run OpenSearch Serverless.

OpenSearch monitoring is not about memorizing every metric name.

It is about keeping the failure path visible:

health -> storage -> JVM/CPU -> search/indexing pressure -> queues/rejections -> client errors -> shard and node diagnosis.

If your dashboard follows that path, the next incident will be a lot less mysterious.

About the author

Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.

View author profile Read editorial policy

Amazon OpenSearch Service Metrics and CloudWatch Statistics

Prerequisites

Key takeaways

References

FAQ

Executive Summary

AWS Elasticsearch Service Statistics vs OpenSearch Metrics

The Metrics That Should Be On Your First Dashboard

1. Cluster health

2. Storage and write blocking

CPU, JVM, and the Single-Hot-Node Problem

Search Metrics: Latency, Rate, Queues, and Errors

Indexing Metrics: When Writes Fall Behind

Master Node Metrics Are Stability Metrics

EBS Metrics: When Search Is Waiting On Storage

Request Metrics: Separate User Errors From Service Pain

Which CloudWatch Statistic Should You Use?

Example: Pull One Metric With AWS CLI

A Practical Alarm Set For Provisioned Domains

Serverless OpenSearch Metrics Are A Different Model

How Cluster Insights Fits In

A Simple Runbook For OpenSearch Metric Spikes

Step 1: Check cluster health

Step 2: Check write availability

Step 3: Split search from indexing

Step 4: Look for hot-node behavior

Step 5: Check the storage path

Common Monitoring Mistakes

Mistake 1: Using Average everywhere

Mistake 2: Alerting on every metric AWS exposes

Mistake 3: Ignoring the rename boundary

Mistake 4: Treating serverless like provisioned

Mistake 5: No dashboard for deploy windows

Final Checklist

About the author

Use these tools

Related posts