aws-reference-architectures/datalake
  • Overview of a Data Lake on AWS
  • Amazon S3: A Storage Foundation for Datalakes on AWS
    • Data lake Storage Architecture FAQs
  • Data Catalog Architecture
    • Schema Management Within a Data Catalog
  • Data Security and Access Control Architecture
    • Data Security and Access Control Using IAM
    • Fine-grained Access Control With AWS LakeFormation
  • Ingestion Architectures for Data lakes on AWS
    • Data Ingestion using Kinesis Firehose and Kinesis Producer Library (KPL)
    • Data Ingestion using Database Migration Service(DMS) and Lambda
    • Data Ingestion using Amazon Glue
    • Data Ingestion From On-Premise NFS using Amazon DataSync
  • Data Curation Architectures
    • Overwrite Table Partitions Using PySpark
  • Data Consumption Architectures
    • Query S3 Data lake using Athena and Glue Catalog
    • Query Data lake using Redshift Spectrum and Glue Catalog
    • Query Data lake using EMR and External Hive Metastore in VPC
    • Query Data lake using EMR and Glue Catalog
  • Code of Conduct
  • Contributing Guidelines
Powered by GitBook
On this page
  • Overview
  • Architecture Walkthrough
  • References
  • Have suggestions? Join our Slack channel to share feedback.

Was this helpful?

  1. Data Consumption Architectures

Query Data lake using EMR and External Hive Metastore in VPC

PreviousQuery Data lake using Redshift Spectrum and Glue CatalogNextQuery Data lake using EMR and Glue Catalog

Last updated 5 years ago

Was this helpful?

Overview

is a managed Hadoop framework in AWS. Hive is a data infrastructure tool to process structured/semistructured data in Hadoop using SQL like query language. Hive stores and manages schema metadata using a 'metastore' service backed by a relational database. In a datalake environment, it is essential to have a centralized schema repository which translates storage locations on S3 or HDFS into a model of Databases, Tables, and Partitions that can be used in SQL. Most AWS customers as an external catalog due to ease of use. However, customers may want to set up their own self-managed Data Catalog due to reasons outlined .

In this architecture, we will provide a walkthrough of how to set up a centralized schema repository using EMR with . Once created, multiple EMR clusters can execute queries against the same schema metadata. To avoid accidental schema metadata loss/corruption, it is recommended that you provide database write access to one EMR cluster only.

Architecture Walkthrough

  1. RDS database is used to store metadata information in a VPC.

References

Single EMR Cluster is set up with Hive metastore on RDS database in same VPC but preferably in a different subnet and . This EMR cluster has write access on the the database. The permissions can be managed using database users.

Multiple EMR clusters can be deployed with read-only access to schema metadata on the database. These EMR clusters can execute queries against the S3 using an or on the VPC.

Have suggestions? Join our to share feedback.

security group
Internet Gateway
S3 Endpoint
Setting up an external database as metastore on EMR
Slack channel
Amazon EMR
leverage AWS Glue
here
Amazon RDS Aurora
Query Data lake using EMR and External Hive Metastore