# Query Data lake using EMR and External Hive Metastore in VPC

## Overview

[Amazon EMR](https://aws.amazon.com/emr/) is a managed Hadoop framework in AWS. Hive is a data infrastructure tool to process structured/semistructured data in Hadoop using SQL like query language. Hive stores and manages schema metadata using a 'metastore' service backed by a relational database. In a datalake environment, it is essential to have a centralized schema repository which translates storage locations on S3 or HDFS into a model of Databases, Tables, and Partitions that can be used in SQL. Most AWS customers [leverage AWS Glue](https://aws-reference-architectures.gitbook.io/datalake/data-analytics/multi-emr-on-glue-catalog) as an external catalog due to ease of use. However, customers may want to set up their own self-managed Data Catalog due to reasons outlined [here](https://aws-reference-architectures.gitbook.io/datalake/master).

In this architecture, we will provide a walkthrough of how to set up a centralized schema repository using EMR with [Amazon RDS Aurora](https://aws.amazon.com/rds/aurora). Once created, multiple EMR clusters can execute queries against the same schema metadata. To avoid accidental schema metadata loss/corruption, it is recommended that you provide database write access to one EMR cluster only.

![Query Data lake using EMR and External Hive Metastore](https://2553439727-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LXQF3JgpYb-IDUgkC6e%2F-LXUCd_m6SPY3a3y3Qyr%2F-LXUCellbFB9RoiOpHVv%2Fanalytics-emr-hive-metastore.png?generation=1548859388063510\&alt=media)

## Architecture  Walkthrough

1. RDS database is used to store metadata information in a VPC.
2. Single EMR Cluster is set up with Hive metastore on RDS database in same VPC but preferably in a different subnet and [security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html). This EMR cluster has write access on the the database. The permissions can be managed using database users.
3. Multiple EMR clusters can be deployed with read-only access to schema metadata on the database. These EMR clusters can execute queries against the S3 using an [Internet Gateway](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Internet_Gateway.html) or [S3 Endpoint](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-s3.html) on the VPC.

## References

* [Setting up an external database as metastore on EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-external.html)

## Have suggestions? Join our [Slack channel](https://join.slack.com/t/cat-cwp4274/shared_invite/zt-e2ztjpgw-Bugw46iXsLbZ~V54AljWsA) to  share feedback.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://aws-reference-architectures.gitbook.io/datalake/data-analytics/multi-emr-on-hive-metastore.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
