aws-reference-architectures/datalake
  • Overview of a Data Lake on AWS
  • Amazon S3: A Storage Foundation for Datalakes on AWS
    • Data lake Storage Architecture FAQs
  • Data Catalog Architecture
    • Schema Management Within a Data Catalog
  • Data Security and Access Control Architecture
    • Data Security and Access Control Using IAM
    • Fine-grained Access Control With AWS LakeFormation
  • Ingestion Architectures for Data lakes on AWS
    • Data Ingestion using Kinesis Firehose and Kinesis Producer Library (KPL)
    • Data Ingestion using Database Migration Service(DMS) and Lambda
    • Data Ingestion using Amazon Glue
    • Data Ingestion From On-Premise NFS using Amazon DataSync
  • Data Curation Architectures
    • Overwrite Table Partitions Using PySpark
  • Data Consumption Architectures
    • Query S3 Data lake using Athena and Glue Catalog
    • Query Data lake using Redshift Spectrum and Glue Catalog
    • Query Data lake using EMR and External Hive Metastore in VPC
    • Query Data lake using EMR and Glue Catalog
  • Code of Conduct
  • Contributing Guidelines
Powered by GitBook
On this page
  • Overview
  • Architecture Component Walkthrough
  • Have suggestions? Join our Slack channel to share feedback.

Was this helpful?

  1. Data Consumption Architectures

Query Data lake using Redshift Spectrum and Glue Catalog

PreviousQuery S3 Data lake using Athena and Glue CatalogNextQuery Data lake using EMR and External Hive Metastore in VPC

Last updated 5 years ago

Was this helpful?

Overview

Spectrum is a massively parallel query engine that can run queries against your S3 datalake through 'external tables', without loading data into your Redshift cluster.

Spectrum is integrated with AWS Glue Data Catalog. The Spectrum external table definitions are stored in Glue Catalog and accessible to the Redshift cluster through an 'external schema'. In this reference architecture, we are going to explain how to leverage Amazon Redshift Spectrum to query S3 data through a Redshift cluster in a VPC.

Architecture Component Walkthrough

  1. AWS Glue Catalog that stores schema and partition metadata of datasets residing in your S3 data lake.

  2. Amazon Glue Crawler can be (optionally) used to create and update the data catalogs periodically. If you know the schema of your data, you may want to use any Redshift client to define Redshift external tables directly in the Glue catalog using Redshift client.

in your Redshift cluster, which links the database cluster schema name to an AWS Glue Data Catalog Database

You can then query your data in S3 using Redshift Spectrum via a S3 VPC endpoint in the same VPC. Redshift Spectrum uses the schema and partition definitions stored in Glue catalog to query S3 data. using compressed columnar formats such as Orc and Parquet for better query performance.

Have suggestions? Join our to share feedback.

Create an 'external schema'
AWS recommends
Slack channel
Amazon Redshift
Query Data lake using Spectrum