Query Data lake using EMR and External Hive Metastore in VPC

Overview

Amazon EMR is a managed Hadoop framework in AWS. Hive is a data infrastructure tool to process structured/semistructured data in Hadoop using SQL like query language. Hive stores and manages schema metadata using a 'metastore' service backed by a relational database. In a datalake environment, it is essential to have a centralized schema repository which translates storage locations on S3 or HDFS into a model of Databases, Tables, and Partitions that can be used in SQL. Most AWS customers leverage AWS Glue as an external catalog due to ease of use. However, customers may want to set up their own self-managed Data Catalog due to reasons outlined here.

In this architecture, we will provide a walkthrough of how to set up a centralized schema repository using EMR with Amazon RDS Aurora. Once created, multiple EMR clusters can execute queries against the same schema metadata. To avoid accidental schema metadata loss/corruption, it is recommended that you provide database write access to one EMR cluster only.

Architecture Walkthrough

  1. RDS database is used to store metadata information in a VPC.

  2. Single EMR Cluster is set up with Hive metastore on RDS database in same VPC but preferably in a different subnet and security group. This EMR cluster has write access on the the database. The permissions can be managed using database users.

  3. Multiple EMR clusters can be deployed with read-only access to schema metadata on the database. These EMR clusters can execute queries against the S3 using an Internet Gateway or S3 Endpoint on the VPC.

References

Have suggestions? Join our Slack channel to share feedback.

Last updated