aws-reference-architectures/datalake
  • Overview of a Data Lake on AWS
  • Amazon S3: A Storage Foundation for Datalakes on AWS
    • Data lake Storage Architecture FAQs
  • Data Catalog Architecture
    • Schema Management Within a Data Catalog
  • Data Security and Access Control Architecture
    • Data Security and Access Control Using IAM
    • Fine-grained Access Control With AWS LakeFormation
  • Ingestion Architectures for Data lakes on AWS
    • Data Ingestion using Kinesis Firehose and Kinesis Producer Library (KPL)
    • Data Ingestion using Database Migration Service(DMS) and Lambda
    • Data Ingestion using Amazon Glue
    • Data Ingestion From On-Premise NFS using Amazon DataSync
  • Data Curation Architectures
    • Overwrite Table Partitions Using PySpark
  • Data Consumption Architectures
    • Query S3 Data lake using Athena and Glue Catalog
    • Query Data lake using Redshift Spectrum and Glue Catalog
    • Query Data lake using EMR and External Hive Metastore in VPC
    • Query Data lake using EMR and Glue Catalog
  • Code of Conduct
  • Contributing Guidelines
Powered by GitBook
On this page
  • Overview
  • Reference Architectures for Ingesting Data into a Data Lake
  • Have suggestions? Join our Slack channel to share feedback.

Was this helpful?

Ingestion Architectures for Data lakes on AWS

PreviousFine-grained Access Control With AWS LakeFormationNextData Ingestion using Kinesis Firehose and Kinesis Producer Library (KPL)

Last updated 5 years ago

Was this helpful?

Overview

One of the core values of a data lake is that it is a collection point and repository for all of an organizations data assets, in whatever their native formats are. This enables quick ingestion, elimination of data duplication and data sprawl, and centralized governance and management. After data assets are collected, they need to be transformed into normalized formats to be used by a variety of data analytics and processing tools. During this phase, customers will typically choose to standardize on a scheme for data compression, encryption of their data, and layout of information at the prefix level in S3.

The key to ‘democratizing’ data, and making it available to the widest number of users - of varying skill sets and responsibilities - is to transform data assets into a format that allows for efficient ad hoc SQL queries. As discussed earlier, when a data lake is built on AWS, we recommend transforming log-based data assets into Columnar formats. AWS provides multiple services to quickly and efficiently achieve this.

In this section, we would share some of the common architectural patterns for ingestion that we see with many of our customers' data lakes.

Reference Architectures for Ingesting Data into a Data Lake

Have suggestions? Join our to share feedback.

Ingest events and logs data using Kinesis Firehose
Ingest database changes using Database Migration Service
Ingest data from JDBC sources using Amazon Glue
Ingest datafiles using Amazon DataSync
Slack channel