aws-reference-architectures/datalake
  • Overview of a Data Lake on AWS
  • Amazon S3: A Storage Foundation for Datalakes on AWS
    • Data lake Storage Architecture FAQs
  • Data Catalog Architecture
    • Schema Management Within a Data Catalog
  • Data Security and Access Control Architecture
    • Data Security and Access Control Using IAM
    • Fine-grained Access Control With AWS LakeFormation
  • Ingestion Architectures for Data lakes on AWS
    • Data Ingestion using Kinesis Firehose and Kinesis Producer Library (KPL)
    • Data Ingestion using Database Migration Service(DMS) and Lambda
    • Data Ingestion using Amazon Glue
    • Data Ingestion From On-Premise NFS using Amazon DataSync
  • Data Curation Architectures
    • Overwrite Table Partitions Using PySpark
  • Data Consumption Architectures
    • Query S3 Data lake using Athena and Glue Catalog
    • Query Data lake using Redshift Spectrum and Glue Catalog
    • Query Data lake using EMR and External Hive Metastore in VPC
    • Query Data lake using EMR and Glue Catalog
  • Code of Conduct
  • Contributing Guidelines
Powered by GitBook
On this page
  • Overview
  • Architecture Component Walkthrough
  • References
  • Have suggestions? Join our Slack channel to share feedback.

Was this helpful?

  1. Data Consumption Architectures

Query S3 Data lake using Athena and Glue Catalog

PreviousData Consumption ArchitecturesNextQuery Data lake using Redshift Spectrum and Glue Catalog

Last updated 5 years ago

Was this helpful?

Overview

is a serverless interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is out-of-the-box integrated with AWS Glue Data Catalog, which makes it very fast and easy to start running queries against your datalake. This is one of the simplest data lake architectures, as Amazon Athena is natively integrated with S3 data through the . can be optionally used to create and maintain the data catalog.

Architecture Component Walkthrough

  1. AWS Glue Catalog stores schema and partition metadata of datasets residing in S3 datalake.

  2. Amazon Glue Crawler can be (optionally) used to create and update the data catalogs periodically. If you know the schema of your data, you may want to use Athena to define tables directly in the Glue catalog using Hive DDL syntax.

  3. Athena uses the Glue Data Catalog to extract schema definitions by default, which are then used to format and query data on S3. Wherever possible, it is recommended to use data partitioning, compression, columnar serialization formats in S3 for better query performance.

References

Have suggestions? Join our to share feedback.

Athena Best Practices
Slack channel
Amazon Athena
AWS Glue Catalog
Glue Crawlers
Query S3 Data lake using Athena