aws-reference-architectures/datalake
  • Overview of a Data Lake on AWS
  • Amazon S3: A Storage Foundation for Datalakes on AWS
    • Data lake Storage Architecture FAQs
  • Data Catalog Architecture
    • Schema Management Within a Data Catalog
  • Data Security and Access Control Architecture
    • Data Security and Access Control Using IAM
    • Fine-grained Access Control With AWS LakeFormation
  • Ingestion Architectures for Data lakes on AWS
    • Data Ingestion using Kinesis Firehose and Kinesis Producer Library (KPL)
    • Data Ingestion using Database Migration Service(DMS) and Lambda
    • Data Ingestion using Amazon Glue
    • Data Ingestion From On-Premise NFS using Amazon DataSync
  • Data Curation Architectures
    • Overwrite Table Partitions Using PySpark
  • Data Consumption Architectures
    • Query S3 Data lake using Athena and Glue Catalog
    • Query Data lake using Redshift Spectrum and Glue Catalog
    • Query Data lake using EMR and External Hive Metastore in VPC
    • Query Data lake using EMR and Glue Catalog
  • Code of Conduct
  • Contributing Guidelines
Powered by GitBook
On this page
  • What features to look for while selecting a cloud datalake storage platform?
  • Why Hadoop HDFS or data warehouse storage are not great choices for datalake?
  • Have suggestions? Join our Slack channel to share feedback.

Was this helpful?

  1. Amazon S3: A Storage Foundation for Datalakes on AWS

Data lake Storage Architecture FAQs

PreviousAmazon S3: A Storage Foundation for Datalakes on AWSNextData Catalog Architecture

Last updated 5 years ago

Was this helpful?

What features to look for while selecting a cloud datalake storage platform?

Selecting a data storage solution is always driven by the data retrieval pattern, scalability, performance cost and durability characteristics. In case of a datalake, characteristics like data-retrieval pattern, performance are unclear at the beginning. So, it is recommended to select a solution that is secure, durable, distributed and decoupled from data processing compute infrastructure. provides you all the above characteristics with seem-less integration with other AWS and open source data analytic services. Amazon S3 provides secure APIs for programmatic access, so it is easy to build new integrations where required.

Why Hadoop HDFS or data warehouse storage are not great choices for datalake?

There are 3 primary reasons why solutions like HDFS storage and data warehouse(DW) storage systems are not suitable for datalakes.

  1. Scalability: Datalakes are supposed to store all data of an organization. As the data volume grows. HDFS or data warehouse storage systems needs to be scaled from time to time.

  2. Open data format support and storage-compute couple: HDFS supports open data format, however it comes coupled with compute services. Similarly, DW systems store data in proprietary format that is not accessible to external computes for analytics.

Have suggestions? Join our to share feedback.

Amazon S3
Slack channel