Data Security and Access Control Using IAM
Last updated
Last updated
Building a datalake and making it the centralized repository for assets that were previously duplicated and placed across many siloes of smaller platforms and groups of users requires implementing stringent and fine-grained security and access controls, along with methods to protect and manage the data assets. A datalake solution on AWS with Amazon S3 as its core provides a robust set of features and services to secure and protect your data against both internal and external threats, even in large, multi-tenant environments. Additionally, innovative Amazon S3 data management features enable automation and scaling of data lake storage, even when it contains billions of objects and petabytes of data assets.
Securing your datalake begins with implementing fine-grained controls that allow only authorized users to see, access, process, and modify particular assets and ensure that unauthorized users are blocked from taking any actions that would compromise data confidentiality and security. A complicating factor is that access roles may evolve over the lifecycle of data assets. AWS provides a comprehensive and integrated set of security features to secure an Amazon S3 based data lake.
Please consider reviewing the AWS Well Architected Security Pillar Whitepaper prior to continuing with these datalake security options.
You can manage access to your Amazon S3 resources using robust access policies. By default, all Amazon S3 resources, including buckets, objects, and related subresources—are private: only the resource owner, which is the AWS account that created them, can access the resources. The resource owner can then grant access permissions to others by writing an access policy. Amazon S3 access policies are broadly categorized as resource based policies and identity policies. Access policies that are attached to resources are referred to as resource based policies, and an example of a resource based policy includes bucket policies and access control lists (ACLs). Access policies that are attached to users, groups, or roles in an account are called identity policies. Typically, a combination of resource and idenitity policies are used to manage permissions to S3 buckets, objects, and other resources.
For most datalake environments, we recommend using identity policies, so that permissions to access data assets can be dynamically tied to the end user performing data processing and analytics. Identity policies are associated with IAM service principals, and are only valid for a short period of time before requiring renewal with the IAM service. In most enterprise datalakes, your end users will be using Identity Federation to access AWS, which means that you are linking datalake access policies to existing Directory Service groups.
Although IAM controls who can access data in your datalake, it’s also important to ensure that users who might inadvertently or maliciously manage to gain access to those data assets can’t see and use them. This is accomplished by using encryption keys to encrypt and de-encrypt data assets. Amazon S3 supports multiple encryption options. Additionally, AWS KMS helps scale and simplify management of encryption keys. AWS KMS gives you centralized control over the encryption keys used to protect your data assets. You can create, import, rotate, disable, delete, define usage policies for, and audit the use of encryption keys used to encrypt your data. AWS KMS is integrated with several other AWS services, making it easy to encrypt the data stored in these services with encryption keys. AWS KMS is integrated with AWS CloudTrail, which provides you with the ability to audit who used which keys, on which resources, and when.
Data lakes built on AWS primarily use two types of encryption: Server-side encryption (SSE) and client-side encryption:
Server Side Encryption (SSE) provides data-at-rest encryption for data written to Amazon S3. With SSE, Amazon S3 encrypts user data assets at the object level, stores the encrypted objects, and then decrypts them as they are accessed and retrieved. The encryption keys for SSE can be managed by S3, or alternatively managed by the Amazon Key Management Service (KMS)
With client-side encryption (CSE), data objects are encrypted before they written into Amazon S3, and then decrypted by S3 before returning data to the requesting AWS service.
A vital function of a centralized datalake is long term asset protection, including protection against corruption, loss, and accidental or malicious overwrites, modifications, or deletions. Amazon S3 provides several features to enable the highest levels of data protection.
The first step in protecting your data is to ensure that it's durable: defined as the ability to ensure that your data is accessible regardless of operational events. Amazon S3 is designed for 99.999999999% (11 9's) data durability, which is 4 to 6 orders of magnitude greater than that which most on-premises, single-site storage platforms can provide. Put another way, the durability of Amazon S3 is designed so that 10,000,000 data assets can be reliably stored for 10,000 years.
Amazon S3 achieves this durability in all of its global Regions by using multiple Availability Zones (AZs). AZs consist of one or more data centers, each with redundant power, networking, and connectivity, housed in separate facilities. AZs offer the ability to operate production applications and analytics services which are more highly available, fault tolerant, and scalable than would be possible from a single data center. Data written to Amazon S3 is redundantly stored across three AZs, and on multiple devices within each AZ to achieve 11 9's durability. This means that even in the event of an entire AZ failure, data will not be lost, and will be available from an alternate AZ.
Another key element of data protection is to prevent accidental or malicious deletion or corruption. This is especially important in a large multi-tenant datalake, which will have a large number of users, many applications, and constant ad-hoc data processing and application development. Amazon S3 provides object versioning to protect data assets against these scenarios. When enabled, Amazon S3 versioning will keep multiple copies of every data asset. When an asset is updated, prior versions of the asset will be retained and can be retrieved at any time. If an asset is deleted, the last version of it can be retrieved. Data asset versioning can be managed by policies, to automate management at large scale, and can be combined with other Amazon S3 capabilities such as lifecycle management for long-term retention of versions on lower cost storage tiers such as Amazon Glacier, Furthermore, we recommend the use of Multi-Factor-Authentication (MFA) Delete, which requires a second layer of authentication through an OTP token, to delete data asset versions.
Even though Amazon S3 is designed for 11 9's data durability within an AWS Region, many enterprise organizations may have compliance and risk models that require them to replicate their data assets to a second geographically distant location, and build disaster recovery (DR) architectures there. Amazon S3 cross-region replication (CRR) is a native S3 feature that automatically and asynchronously copies data objects from one AWS Region to an alternative of your choosing. The objects in the second Region are replicas of the source objects that they were copied from, including their names, metadata, versions, and access controls. All data assets are encrypted during transit with SSL to ensure the highest levels of data security.
All of these Amazon S3 features and capabilities, when combined with other AWS services like IAM, AWS KMS, Amazon Cognito, and Amazon API Gateway, ensure that a datalake using Amazon S3 as its core storage platform will be able to meet the most stringent data security, compliance, privacy, and protection requirements. Amazon S3 includes a broad range of certifications, including (as of Jan 2019) PCI-DSS, HIPAA/HITECH, FedRAMP, SEC Rule 17-a-4, FISMA, and EU Data Protection Directive.
Datalake solutions are often multi-tenant for your customers, with multiple internal organizations, users, and applications all accessing and processing the same data assets. To manage who owns what data, it becomes very important to tag data assets with cost attribution information. Amazon S3 provides object tagging to assist with categorizing and managing Amazon S3 data assets. An object tag is made up of a key-value pair. Each S3 object can have up to 10 object tags, Each tag key can be up to 128 Unicode characters in length, and each tag value can be up to 256 Unicode characters in length. In addition to cost attribution, tagging can be used for data classification: for example, suppose an object contains personally identifiable information (PII) data. A user, administrator, or application that uses object tags might tag the object using the key-value pair PII=True or Classification=PII. Object tagging enables extended security controls, and can be used in conjunction with IAM to enable fine-grain controls of access permissions, For example, a particular data lake user can be granted permissions to only read objects with specific tags (via the RequestObjectTagKeys
policy restriction). Object tags can also be used to manage data lifecycle policies. Finally, object tags can be combined with Amazon CloudWatch metrics and AWS CloudTrail logs to display monitoring and action audit data by specific data asset tag filters.