Data Ingestion From On-Premise NFS using Amazon DataSync

Overview

AWS DataSync is a fully managed data transfer service that simplifies, automates, and accelerates moving and replicating data between on-premises storage systems and AWS storage services over the internet or AWS Direct Connect. In a datalake environment, AWS DataSync can be used to sync files securely from on premise storage servers like NFS to S3 based datalake automatically.

In this architecture, we = walk you through how to use AWS DataSync and DataSync Agent to migrate data to a datalake in Amazon S3.

Architecture Component Walkthrough

  1. You create a network attached file storage server (NFS) inside your data center.

  2. You install an AWS Datasync Agent as a VMware ESXi hypervisor based environment. This Agent will have read access on the NFS server.

  3. You configure AWS DataSync with the locations required to perform syncronisation

  4. You create and then start an AWS DataSync task to synchronization files from NFS to S3.

  5. Use an AWS Glue Crawler to catalog the S3 location that receives files via AWS DataSync.

References

Have suggestions? Join our Slack channel to share feedback.

Last updated