Amazon OpenSearch for Data Lakes: S3 Direct Query Guide

·By Elysiate·Updated Jun 4, 2026·
awsopensearchelasticsearchdata lakesamazon s3aws glue
·

Level: intermediate · ~16 min read · Intent: informational

Audience: AWS platform engineers, data engineers, SRE teams, cloud architects, security engineers

Prerequisites

  • basic familiarity with Amazon S3 and data lake storage
  • basic understanding of Amazon OpenSearch Service
  • some experience with log analytics, search, or cloud data pipelines

Key takeaways

  • Amazon OpenSearch Service can work with data lake data in two different ways: direct query against data in place, or ingestion into OpenSearch indexes for fast repeated search.
  • S3 Direct Query is best for exploratory analytics, incident investigation, and selective acceleration, not for replacing every lakehouse, warehouse, or high-throughput search index.
  • S3 direct queries require OpenSearch 2.13 or later, AWS Glue Data Catalog access, Query Workbench-created Spark tables, and a checkpoint bucket for indexed views.
  • If users expect low-latency product search, app search, or high-QPS API search, ingest and index the selected dataset instead of scanning the lake on every request.

References

FAQ

Can Amazon OpenSearch Service search data in an S3 data lake?
Yes. Amazon OpenSearch Service can directly query data in Amazon S3 through Direct Query, using AWS Glue Data Catalog tables and OpenSearch Dashboards. You can also ingest selected S3 data into OpenSearch indexes when you need faster repeated search.
Is AWS Elasticsearch still the right name for this?
The current service name is Amazon OpenSearch Service. Many older dashboards, runbooks, and searches still say AWS Elasticsearch Service, but new data lake guidance should use OpenSearch terminology.
Should I direct-query S3 or ingest data into OpenSearch?
Use S3 Direct Query for exploratory analysis, security/log investigation, and selective indexed views. Ingest and index when you need low-latency repeated search, product search, dashboards with strict response targets, or application-facing query APIs.
What formats can OpenSearch S3 Direct Query use?
AWS documents supported S3 Direct Query data types as Parquet, CSV, and JSON. For production data lakes, Parquet with clear partitioning is usually the safest default.
Does OpenSearch S3 Direct Query replace Athena or a data warehouse?
No. It is better understood as a search and operational analytics surface over selected lake data. Athena, Spark, Redshift, and lakehouse engines still matter for broad SQL analytics, transformations, and warehouse-style reporting.
0

People still search for "AWS Elasticsearch search engine data lakes" because the old name stuck.

The current answer is Amazon OpenSearch Service.

But the real question is more interesting than the name:

Can OpenSearch act as a search and analytics layer over data in Amazon S3, without forcing every lake object into an index first?

In 2026, the answer is yes, with important caveats.

Amazon OpenSearch Service now supports Direct Query integrations that can analyze data in Amazon S3, CloudWatch Logs, Amazon Security Lake, and Amazon Managed Service for Prometheus without building a traditional ingestion pipeline first. For S3 data lakes, that means OpenSearch can query data in place, use AWS Glue Data Catalog tables, and optionally accelerate repeat workloads with indexed views.

That does not mean OpenSearch replaces every warehouse, lakehouse, or query engine.

It means OpenSearch has become a more flexible operational analytics surface:

  • query selected data where it already lives
  • index only the parts that need fast repeated search
  • use OpenSearch Dashboards for investigation and visualization
  • connect log, security, and S3 lake workflows more directly
  • avoid building ingestion pipelines before you know which questions matter

This guide explains when Amazon OpenSearch Service is useful for data lakes, when Direct Query makes sense, when ingestion is still the better choice, and how to think about S3, AWS Glue, checkpoints, indexed views, cost, and security.

For broader cluster sizing, mappings, vector search, and production operations, read Amazon OpenSearch: A Practical Guide for Fast, Scalable Search. For operational alarms, see Amazon OpenSearch Service Metrics and CloudWatch Statistics.

Executive Summary

Use Amazon OpenSearch Service with a data lake in three main patterns:

Pattern Best for Data movement User expectation
S3 Direct Query Exploration, incident investigation, ad hoc log analytics, security lake review Query data in place Interactive analytics, not ultra-low latency app search
Indexed views over S3 data Repeat dashboards, common investigative queries, accelerated slices Selectively indexes query results or views Faster repeat analysis
OpenSearch Ingestion from S3 Search APIs, high-QPS dashboards, near-real-time log search, product search Pulls data from S3 into OpenSearch indexes Fast repeated search

The short version:

  • If the data is large, cold, and queried occasionally, start with S3 Direct Query.
  • If the same slice is queried repeatedly, add indexed views or acceleration.
  • If the workload is user-facing, low-latency, or high-QPS, ingest and index selected data.
  • If the workload is broad SQL analytics, keep Athena, Spark, Redshift, or a lakehouse engine in the design.
  • If the team still says "AWS Elasticsearch," map the old language to Amazon OpenSearch Service in dashboards and runbooks.

OpenSearch is strongest when the lake data needs search, filtering, investigation, dashboards, and operational context. It is weaker as a general-purpose data warehouse replacement.

The Naming Problem: AWS Elasticsearch vs OpenSearch

Amazon Elasticsearch Service was renamed to Amazon OpenSearch Service in 2021.

That rename matters because many engineers still use old phrases:

  • AWS Elasticsearch
  • Amazon Elasticsearch Service
  • AWS Elasticsearch search
  • Elasticsearch Service data lake
  • Elasticsearch Service statistics

In new architecture documents, use the current name: Amazon OpenSearch Service.

But when you inherit old AWS accounts, old Terraform, old dashboards, or old incident notes, expect mixed vocabulary. Some CloudWatch namespaces and historical metric names also carry legacy naming. That is why a search like "AWS Elasticsearch search engine data lakes" often points to an OpenSearch Service problem rather than a self-managed Elasticsearch problem.

The practical mapping is:

Old phrase Current meaning in most AWS conversations
AWS Elasticsearch Amazon OpenSearch Service
Elasticsearch Service domain OpenSearch Service domain
Elasticsearch dashboard OpenSearch Dashboards
Elasticsearch data lake search OpenSearch over S3, Security Lake, CloudWatch Logs, or indexed data
AWS/ES metrics CloudWatch metrics for provisioned OpenSearch Service domains

When publishing runbooks, include both names once, then use OpenSearch consistently.

What "OpenSearch For A Data Lake" Actually Means

A data lake is usually built around object storage, not around a search index.

On AWS, that commonly means:

  • Amazon S3 as the durable storage layer
  • AWS Glue Data Catalog for table metadata
  • partitioned data by year, month, day, hour, account, region, service, tenant, or event type
  • formats such as Parquet, JSON, and CSV
  • query engines such as Athena, Spark, Redshift Spectrum, or other lakehouse tools
  • governance through IAM, Lake Formation, bucket policies, and encryption controls

OpenSearch enters the picture when the team needs a search and investigation experience over some of that data.

That can mean:

  • analysts want to search historical logs
  • security engineers want to investigate events in Security Lake
  • SREs want to correlate logs, metrics, and traces
  • platform teams want dashboards over selected S3 log datasets
  • product teams want search over a curated subset of records
  • data engineers want to avoid indexing every object before they know the useful slices

The design question is not "Should my data lake be OpenSearch?"

The better question is:

Which lake data should be queried in place, which should be indexed, and which should stay in the warehouse/lakehouse path?

Pattern 1: Query S3 In Place With OpenSearch Direct Query

OpenSearch Service Direct Query lets you analyze data in Amazon S3 without building a traditional ingestion pipeline first.

For S3, the core pieces are:

  • an OpenSearch Service domain running OpenSearch 2.13 or later
  • one or more S3 buckets with the data you want to query
  • a checkpoint S3 bucket for indexed view state
  • AWS Glue Data Catalog access
  • Spark tables created from OpenSearch Query Workbench
  • access control mapping in OpenSearch Dashboards

AWS documentation is specific about one important detail: S3 Direct Query uses Spark tables in AWS Glue Data Catalog, and existing Glue or Athena tables are not simply reused as-is for all direct query behavior. For S3 Direct Query, tables are created from Query Workbench.

That distinction matters during migration.

If your lake already has a mature Glue catalog and Athena workflows, do not assume OpenSearch can immediately use every table without setup. Treat OpenSearch Direct Query as a connected investigative layer that needs its own table preparation and access model.

Pattern 2: Accelerate Repeated Queries With Indexed Views

Direct Query is useful, but scanning lake data every time is not always what you want.

For repeated dashboards and common investigative queries, OpenSearch can use accelerated data indexing, such as indexed views, materialized views, covering indexes, or skipping indexes depending on the data source and workflow.

The mental model is:

  • raw data remains in S3
  • OpenSearch queries the lake directly for exploration
  • repeated or expensive slices can be indexed to improve follow-up analysis
  • storage and refresh behavior become part of the operational plan

This is especially useful for:

  • VPC Flow Logs by account, region, and hour
  • CloudTrail event investigations
  • AWS WAF logs
  • security events by source, action, and severity
  • high-cardinality fields where Bloom-filter-style acceleration is useful
  • numeric and time ranges where min/max-style acceleration can reduce scan work

The risk is building too many accelerations too early.

If every saved view becomes a mini data mart, you are no longer simplifying the lake workflow. You are creating another storage and lifecycle surface that needs ownership.

Use indexed views where a query pattern has proven value.

Pattern 3: Ingest From S3 Into OpenSearch Indexes

Sometimes direct query is not enough.

If the application needs fast repeated search, ingest the selected data into OpenSearch indexes.

OpenSearch Ingestion supports Amazon S3 as a source. For S3 sources, AWS describes two common processing approaches:

  • S3-SQS processing for near-real-time scanning after objects land in S3
  • scheduled scans for one-time migration or recurring batch processing

This is the better pattern when:

  • users expect search results in milliseconds or low seconds
  • the same queries run all day
  • dashboards refresh frequently
  • APIs need predictable latency
  • relevance tuning, analyzers, highlighting, aggregations, or vector search are central
  • S3 is the source of truth, but OpenSearch is the serving index

Think of this as lake to search serving layer.

The data lake remains the durable system of record. OpenSearch becomes the optimized serving path for search-heavy access.

Direct Query vs Ingestion: The Decision Table

Situation Better starting point
Investigating rare historical logs S3 Direct Query
Exploring a new lake dataset S3 Direct Query
Building a repeated operational dashboard Direct Query plus indexed view
Searching product catalog records in an app Ingest and index
Running high-QPS application search Ingest and index
Querying broad analytical joins over many tables Warehouse, Athena, Spark, or lakehouse engine
Reviewing security logs in Security Lake Direct Query or Security Lake integration
Searching the last 15 minutes of logs with strict latency Ingest and index
Keeping old data searchable at low cost S3 plus selective direct query or lifecycle-tiered OpenSearch storage
Performing transformations and data quality pipelines Data pipeline or lakehouse tooling first

The mistake is treating "search data lake" as one architecture.

It is usually a combination:

S3 data lake
  -> direct query for exploration
  -> indexed views for repeat investigation
  -> ingestion pipelines for high-value search surfaces
  -> warehouse/lakehouse tools for broad analytics

S3 Direct Query Requirements That Matter

Before you build around OpenSearch and S3, check these details.

OpenSearch version

S3 Direct Query requires OpenSearch Service domains running OpenSearch 2.13 or later.

If you are still running an older domain, plan an upgrade before designing around S3 Direct Query.

Glue and table setup

S3 Direct Query requires AWS Glue Data Catalog access.

For Amazon S3, you create tables using SQL in Query Workbench. AWS notes that existing Glue Data Catalog or Athena-created tables are not supported for the Spark streaming behavior needed to maintain indexed views.

This does not mean your existing catalog is useless. It means your OpenSearch direct-query setup needs deliberate table creation and validation.

Checkpoint bucket

Direct Query for S3 requires a checkpoint S3 bucket.

That bucket stores state for indexed views, including last refresh time and most recently ingested data.

Treat it as infrastructure, not a throwaway bucket:

  • restrict access
  • encrypt it
  • use clear naming
  • monitor write failures
  • include it in environment teardown and recovery plans

Region and account boundaries

AWS documents that the OpenSearch domain and AWS Glue Data Catalog must be in the same AWS account. The S3 bucket can be in a different account with the right IAM policy condition, but it must be in the same AWS Region as the domain.

That matters for multi-account data lakes.

If your central lake lives in one account and OpenSearch domains live in application accounts, confirm the cross-account and same-region design before promising a direct query workflow.

Supported data types

AWS documents supported S3 Direct Query data types as:

  • Parquet
  • CSV
  • JSON

For production lake analytics, Parquet is usually the best default because it is columnar, compressed, and friendly to partitioned analytics.

CSV and JSON can still work, especially for early integration or exported logs, but they need stricter validation and schema discipline. If you are dealing with CSV lake feeds, validate row structure and column contracts before turning them into recurring query surfaces.

Example S3 Data Lake Layout

A practical S3 layout for logs usually looks more like this:

s3://company-log-lake/AWSLogs/
  account_id=111122223333/
  service=vpc-flow-logs/
  region=us-east-1/
  year=2026/
  month=06/
  day=04/
  hour=13/
    part-0000.parquet
    part-0001.parquet

That layout gives direct query and batch engines useful pruning paths.

For OpenSearch Direct Query, AWS recommends partition formats such as year, month, day, and hour to speed up S3 queries. Add business dimensions such as account, service, region, tenant, environment, or event type when they match real query patterns.

Do not partition by everything.

Good partitions reduce scan cost. Bad partitions create tiny files, messy metadata, and expensive maintenance.

Use the query patterns to decide:

  • What time range will users filter first?
  • Do they usually search by account or tenant?
  • Is region a frequent boundary?
  • Is service or log type always known?
  • Are files large enough to avoid tiny-file overhead?

Example Table Shape For VPC Flow Logs

In Query Workbench, you define Spark tables over S3 data.

A simplified table shape for VPC Flow Logs might include:

CREATE TABLE security_lake.vpc_flow_logs (
  version INT,
  account_id STRING,
  interface_id STRING,
  srcaddr STRING,
  dstaddr STRING,
  srcport INT,
  dstport INT,
  protocol INT,
  packets BIGINT,
  bytes BIGINT,
  action STRING,
  log_status STRING,
  year STRING,
  month STRING,
  day STRING,
  hour STRING
)
USING parquet
PARTITIONED BY (account_id, year, month, day, hour)
LOCATION "s3://company-log-lake/AWSLogs";

Then repair or refresh metadata as needed so the table sees the partitions:

MSCK REPAIR TABLE security_lake.vpc_flow_logs;

In practice, your table definition should match the exact log format, partition layout, and access model you operate. Treat this as a shape example, not a copy-paste production schema.

Querying S3 Data From OpenSearch

After the data source and tables are configured, OpenSearch Dashboards can query the S3 data source from Discover, Query Workbench, or supported plugins depending on the query mode.

If you have no indexed view, use SQL or PPL.

For example:

SELECT srcaddr, dstaddr, action, SUM(bytes) AS total_bytes
FROM security_lake.vpc_flow_logs
WHERE year = '2026'
  AND month = '06'
  AND day = '04'
  AND action = 'REJECT'
GROUP BY srcaddr, dstaddr, action
ORDER BY total_bytes DESC
LIMIT 100;

That is a good investigative query. It asks a bounded question over a partitioned slice.

A bad first query looks like this:

SELECT *
FROM security_lake.vpc_flow_logs;

Wide, unbounded queries make direct query expensive and slow. Always start with time, partition, and result limits.

S3 Direct Query is often attractive because you do not need to ingest everything first.

But it still has costs.

AWS documents OpenSearch Compute Unit pricing for S3 direct queries. You incur DirectQuery OCU usage as queries run, plus the separate S3 storage costs you already pay. Interactive queries and indexed view queries have different behavior. For S3, new queries from Discover can start a session that remains active for a minimum period so follow-up queries can run faster.

That changes how you should monitor usage.

Watch:

  • DirectQuery OCU usage
  • S3 request and data scan patterns
  • expensive dashboard refreshes
  • unbounded exploratory queries
  • indexed view refresh frequency
  • checkpoint bucket behavior
  • query failures and latency

AWS also documents direct query metrics for data sources and notes that OCU usage can be monitored through Cost Explorer and AWS Budgets, though budget data may lag.

The practical rule:

Direct Query saves ingestion work, not all compute cost.

If a dashboard runs constantly, it may be cheaper and faster to create a targeted indexed view or ingest the slice into OpenSearch.

Security And Access Control

OpenSearch over a data lake creates a cross-service access path.

That path should be designed explicitly.

You need to think about:

  • IAM role assumptions for OpenSearch direct query
  • S3 read access
  • checkpoint bucket write access
  • Glue Data Catalog access
  • OpenSearch Dashboards access control
  • fine-grained access control inside OpenSearch
  • query-result index access
  • encryption and KMS policy boundaries
  • cross-account bucket access

One subtle risk: query infrastructure can expose more than users realize.

AWS notes that indexes are used for queries against a direct-query data source, and users with read access to request or result indexes for that data source can read query requests or results associated with it. That means access control is not only about the raw S3 bucket. It is also about OpenSearch-side request and result storage.

For production:

  • map roles deliberately
  • separate admin and analyst access
  • keep checkpoint buckets locked down
  • avoid broad s3:* policies
  • log access and configuration changes
  • review result-index access
  • validate cross-account conditions

Data Lake Use Cases That Fit OpenSearch

Security investigation

Security teams often need to search across CloudTrail, VPC Flow Logs, WAF logs, and Security Lake data.

OpenSearch is useful because the workflow is investigative:

  • search by IP address
  • filter by account and region
  • group by action
  • pivot from logs into dashboards
  • accelerate repeat detections

Use Direct Query for ad hoc investigation and indexed views for common security slices.

Operational log analytics

Platform teams often have logs in S3 because S3 is durable and cheaper than keeping everything hot.

OpenSearch can make that data easier to explore during incidents:

  • rejected network flows
  • WAF blocks
  • service access logs
  • application export logs
  • account-level audit trails

Use partitions aggressively. Incident queries usually start with time.

Searchable archive

If an old dataset is rarely queried but still needs occasional search, S3 plus Direct Query can be cleaner than keeping every record in a hot OpenSearch index.

Good examples:

  • archived logs
  • historical operational events
  • compliance review data
  • low-frequency investigative records

If the archive becomes a daily dashboard source, reconsider indexing the useful slice.

This is where OpenSearch usually should not directly scan S3 on every user action.

For application search:

  • ingest selected documents
  • normalize fields
  • define mappings
  • tune analyzers
  • use relevance tests
  • maintain an index update pipeline
  • keep S3 as the source of truth or archive

The data lake is the durable backplane. OpenSearch is the serving index.

When OpenSearch Is The Wrong Data Lake Tool

OpenSearch is not the best answer when:

  • the work is broad analytical SQL across many lake tables
  • transformations are the main job
  • the team needs warehouse-grade semantic models
  • joins, batch ETL, and slowly changing dimensions dominate
  • the workload is cost-sensitive full-lake scanning
  • the data needs heavy cleansing before search
  • the primary output is finance, BI, or governed reporting

Use Athena, Spark, Redshift, dbt, Glue, or lakehouse tooling for those jobs.

OpenSearch can still sit next to them.

The best architectures do not force one tool to be the whole data platform.

Migration Checklist For Old AWS Elasticsearch Data Lake Workflows

If you inherited an old "AWS Elasticsearch data lake search" setup, use this checklist.

  1. Identify whether it is Amazon OpenSearch Service, legacy Elasticsearch OSS on OpenSearch Service, or self-managed Elasticsearch.
  2. Document the current domain version and whether OpenSearch 2.13 or later is available.
  3. Separate existing indexed datasets from lake-backed datasets.
  4. List the S3 buckets, Glue databases, Athena tables, and log sources involved.
  5. Identify which queries are ad hoc and which are repeated.
  6. Move repeated hot queries toward indexed views or indexed OpenSearch data.
  7. Keep rare historical search on S3 Direct Query where it fits.
  8. Review IAM roles, S3 bucket policies, checkpoint buckets, and result-index access.
  9. Update dashboards and runbooks to use OpenSearch naming.
  10. Add CloudWatch metrics and budget controls for Direct Query usage.

The biggest migration mistake is renaming everything without changing the architecture.

The rename is only vocabulary. The real work is deciding which data should be queried in place, accelerated, indexed, or left to the lakehouse path.

Practical Architecture

A healthy OpenSearch and data lake design often looks like this:

S3 data lake
  -> Glue Data Catalog
  -> OpenSearch Direct Query for exploration
  -> indexed views for repeated operational slices
  -> OpenSearch Ingestion for hot search indexes
  -> Athena/Spark/Redshift for broad analytics and transformation
  -> CloudWatch/Budgets for usage and cost monitoring

Ownership should be just as clear:

Layer Owner
Raw S3 data Data platform or producing team
Glue table contracts Data engineering
OpenSearch domain Platform/SRE
Direct query data source Platform plus data owner
Indexed views Dashboard or investigation owner
Ingestion pipeline Search/platform owner
Cost alarms FinOps/platform
Access control Security plus platform

That ownership map prevents the classic failure mode: everyone can query the lake, but nobody owns the tables, checkpoints, indexed views, or cost profile.

Final Recommendation

Amazon OpenSearch Service is useful for data lakes when you treat it as a search and operational analytics layer, not as a universal lakehouse replacement.

Use it like this:

  • Start with S3 Direct Query when you need to explore lake data before building a pipeline.
  • Create Query Workbench tables and partition S3 data around real investigation paths.
  • Use indexed views only when a repeated query pattern proves it deserves acceleration.
  • Ingest selected data into OpenSearch indexes when users need fast repeated search.
  • Keep warehouse and lakehouse engines for broad SQL analytics, transformations, and BI.
  • Update old AWS Elasticsearch language so teams do not confuse legacy naming with current architecture.

If the question is "Can AWS Elasticsearch search a data lake?", the modern answer is:

Amazon OpenSearch Service can query and search selected data lake datasets through S3 Direct Query, indexed views, and ingestion pipelines, but the right pattern depends on latency, query frequency, cost, and ownership.

That is the decision that matters.

About the author

Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.

Related posts