Query Data lake using EMR and Glue Catalog

Overview

Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB. EMR Notebooks, based on the popular Jupyter Notebook, provide a development and collaboration environment for ad hoc querying and exploratory analysis.

In a datalake environment, it is essential to have a central schema repository of the datasets available in S3. AWS Glue Data Catalog provides a fully managed service for indexing and managing the schema of data stored in S3. Compute engines like EMR, Athena, Redshift etc can execute analytics workloads against your S3 datalake using the Glue Data Catalog by default.

In this architecture, we show how to leverage AWS Glue Data Catalog to execute queries against S3 datalake by using multiple EMR clusters in virtual private cloud (VPC).

Architecture Walkthrough

  1. S3 datalake is populated by one or many data ingestion mechanism.

  2. Glue Crawlers are used to discover datasets in S3 and create and maintain the schema definitions in the Glue Data Catalog.

  3. Multiple EMR clusters can be deployed with access to Glue Catalog. EMR clusters execute queries against S3 through an Internet Gateway or S3 Endpoint in the VPC.

References

Have suggestions? Join our Slack channel to share feedback.

Last updated