aws-reference-architectures/datalake
  • Overview of a Data Lake on AWS
  • Amazon S3: A Storage Foundation for Datalakes on AWS
    • Data lake Storage Architecture FAQs
  • Data Catalog Architecture
    • Schema Management Within a Data Catalog
  • Data Security and Access Control Architecture
    • Data Security and Access Control Using IAM
    • Fine-grained Access Control With AWS LakeFormation
  • Ingestion Architectures for Data lakes on AWS
    • Data Ingestion using Kinesis Firehose and Kinesis Producer Library (KPL)
    • Data Ingestion using Database Migration Service(DMS) and Lambda
    • Data Ingestion using Amazon Glue
    • Data Ingestion From On-Premise NFS using Amazon DataSync
  • Data Curation Architectures
    • Overwrite Table Partitions Using PySpark
  • Data Consumption Architectures
    • Query S3 Data lake using Athena and Glue Catalog
    • Query Data lake using Redshift Spectrum and Glue Catalog
    • Query Data lake using EMR and External Hive Metastore in VPC
    • Query Data lake using EMR and Glue Catalog
  • Code of Conduct
  • Contributing Guidelines
Powered by GitBook
On this page
  • Overview
  • Architecture Component Walkthrough
  • References
  • Have suggestions? Join our Slack channel to share feedback.

Was this helpful?

  1. Ingestion Architectures for Data lakes on AWS

Data Ingestion using Amazon Glue

PreviousData Ingestion using Database Migration Service(DMS) and LambdaNextData Ingestion From On-Premise NFS using Amazon DataSync

Last updated 5 years ago

Was this helpful?

Overview

is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. It can extract data from heterogeneous data sources like RDBMS (RDS, Aurora), Amazon Redshift, or Amazon S3, and ingest it into a datalake. AWS Glue uses an Apache Spark processing engine under the hood and supports Spark APIs to transform data in memory,

In this architecture, we are using AWS Glue to extract data from relational datasources in a VPC and ingest them in to a S3 data lake backed by S3.

Architecture Component Walkthrough

References

You create a Relational Database on and/or Aurora within a VPC.

You to your RDBMS in the AWS Glue Service

that has write access to S3

Amazon Glue connects to the databases using JDBC through an in the same VPC.

Data is extracted from your RDBMS by AWS Glue, and stored in Amazon S3. It is recommended to write structured data to S3 using compressed columnar format like Parquet/ORC for better query performance. Data in structured format like can be converted into compressed columnar format with Pyspark/Scala using spark APIs in the Glue ETL.

Have suggestions? Join our to share feedback.

Amazon RDS
create a Connection
Configure an IAM role for AWS Glue
Elastic Network Interface(ENI)
CSV
How to extract, transform, and load data for analytic processing using AWS Glue
Slack channel
AWS Glue
Data Ingestion Amazon Glue