Data Ingestion using Amazon Glue
Last updated
Was this helpful?
Last updated
Was this helpful?
is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. It can extract data from heterogeneous data sources like RDBMS (RDS, Aurora), Amazon Redshift, or Amazon S3, and ingest it into a datalake. AWS Glue uses an Apache Spark processing engine under the hood and supports Spark APIs to transform data in memory,
In this architecture, we are using AWS Glue to extract data from relational datasources in a VPC and ingest them in to a S3 data lake backed by S3.
You create a Relational Database on and/or Aurora within a VPC.
You to your RDBMS in the AWS Glue Service
that has write access to S3
Amazon Glue connects to the databases using JDBC through an in the same VPC.
Data is extracted from your RDBMS by AWS Glue, and stored in Amazon S3. It is recommended to write structured data to S3 using compressed columnar format like Parquet/ORC for better query performance. Data in structured format like can be converted into compressed columnar format with Pyspark/Scala using spark APIs in the Glue ETL.