Data Ingestion using Amazon Glue

Overview

AWS Gluearrow-up-right is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. It can extract data from heterogeneous data sources like RDBMS (RDS, Aurora), Amazon Redshift, or Amazon S3, and ingest it into a datalake. AWS Glue uses an Apache Spark processing engine under the hood and supports Spark APIs to transform data in memory,

In this architecture, we are using AWS Glue to extract data from relational datasources in a VPC and ingest them in to a S3 data lake backed by S3.

Data Ingestion Amazon Glue

Architecture Component Walkthrough

  1. You create a Relational Database on Amazon RDSarrow-up-right and/or Aurora within a VPC.

  2. You create a Connectionarrow-up-right to your RDBMS in the AWS Glue Service

  3. Amazon Glue connects to the databases using JDBC through an Elastic Network Interface(ENI)arrow-up-right in the same VPC.

  4. Data is extracted from your RDBMS by AWS Glue, and stored in Amazon S3. It is recommended to write structured data to S3 using compressed columnar format like Parquet/ORC for better query performance. Data in structured format like CSVarrow-up-right can be converted into compressed columnar format with Pyspark/Scala using spark APIs in the Glue ETL.

References

How to extract, transform, and load data for analytic processing using AWS Gluearrow-up-right

Have suggestions? Join our Slack channelarrow-up-right to share feedback.

Last updated

Was this helpful?