aws-reference-architectures/datalake
  • Overview of a Data Lake on AWS
  • Amazon S3: A Storage Foundation for Datalakes on AWS
    • Data lake Storage Architecture FAQs
  • Data Catalog Architecture
    • Schema Management Within a Data Catalog
  • Data Security and Access Control Architecture
    • Data Security and Access Control Using IAM
    • Fine-grained Access Control With AWS LakeFormation
  • Ingestion Architectures for Data lakes on AWS
    • Data Ingestion using Kinesis Firehose and Kinesis Producer Library (KPL)
    • Data Ingestion using Database Migration Service(DMS) and Lambda
    • Data Ingestion using Amazon Glue
    • Data Ingestion From On-Premise NFS using Amazon DataSync
  • Data Curation Architectures
    • Overwrite Table Partitions Using PySpark
  • Data Consumption Architectures
    • Query S3 Data lake using Athena and Glue Catalog
    • Query Data lake using Redshift Spectrum and Glue Catalog
    • Query Data lake using EMR and External Hive Metastore in VPC
    • Query Data lake using EMR and Glue Catalog
  • Code of Conduct
  • Contributing Guidelines
Powered by GitBook
On this page
  • Overview
  • Architecture Walkthrough
  • References
  • Have suggestions? Join our Slack channel to share feedback.

Was this helpful?

  1. Data Consumption Architectures

Query Data lake using EMR and Glue Catalog

PreviousQuery Data lake using EMR and External Hive Metastore in VPCNextCode of Conduct

Last updated 5 years ago

Was this helpful?

Overview

provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB. EMR Notebooks, based on the popular Jupyter Notebook, provide a development and collaboration environment for ad hoc querying and exploratory analysis.

In a datalake environment, it is essential to have a central schema repository of the datasets available in S3. Data Catalog provides a fully managed service for indexing and managing the schema of data stored in S3. Compute engines like EMR, Athena, Redshift etc can execute analytics workloads against your S3 datalake using the Glue Data Catalog by default.

In this architecture, we show how to leverage AWS Glue Data Catalog to execute queries against S3 datalake by using multiple EMR clusters in .

Architecture Walkthrough

  1. S3 datalake is populated by one or many data ingestion mechanism.

  2. Glue Crawlers are used to discover datasets in S3 and create and maintain the schema definitions in the Glue Data Catalog.

References

Multiple EMR clusters can be deployed with access to Glue Catalog. EMR clusters execute queries against S3 through an or in the VPC.

Have suggestions? Join our to share feedback.

Internet Gateway
S3 Endpoint
Use Glue Catalog as Metastore
Slack channel
Amazon EMR
AWS Glue
virtual private cloud (VPC)
Query Data lake using EMR and Glue Catalog