Introduction

Recently I have been working on a project that required a means of performing ETL operations on an S3 source bucket for the purpose of analysis. I decided to explore using a 'Crawler' provided as a service by AWS Glue in order to facilitate this. There are some gaps in the currently available documentation, so hopefully this will prove useful as a reference, not only for myself, but others that would like to provide these resources using Infrastructure as Code (IaaC).

Technology choices

CloudFormation
Athena
Glue

Note : This should follow a similar pattern in Terraform and so should still be informative for those using Terraform in place of CloudFormation

Let's create Glue resources needed for Athena

So we want to be able to query some S3 objects using Athena. To facilitate this, we need the following resources in Glue:

Glue Database
Glue Table

The Glue table can be created either manually or using a Crawler. We will use a Crawler in this example for the following reasons:

Easier to manage for long term applications
Handles schema changes based on the S3 object data automatically
Useful configuration options

Note: Manually creating tables has it's place. Either to facilitate one off requests or out of process querying. It may also be the case that if you know your schema upfront, you may be consuming data from a source out of your control and so need to define columns as JSON encoded strings to give you better flexibility with queries.

Resource Definitions

The following shows a basic example of the required resources.

Glue database:

  GlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Description: "Description of Database"
        Name: "my-database"

Nothing complicated to note here. Just set your AWS account id as the CatlogId and give it a name.

see: CloudFormation - Glue Database

Glue Security Configuration:

  GlueSecurityConfiguration:
    Type: AWS::Glue::SecurityConfiguration
    Properties:
      EncryptionConfiguration:
        CloudWatchEncryption:
          CloudWatchEncryptionMode: 'SSE-KMS'
          KmsKeyArn: <ARN of KMS key>
        JobBookmarksEncryption:
          JobBookmarksEncryptionMode: 'DISABLED'
        S3Encryptions:
          - S3EncryptionMode: 'DISABLED'
      Name: "My Security Configuration"

The example above provides encryption of any CloudWatch logs generated by your Crawler with the S3 and Job Bookmark encryption set as 'DISABLED'.

S3 encryption allows you to add 'at rest' encryption to any data written to S3 by your Crawler:

        S3Encryptions:
          - S3EncryptionMode: 'SSE-KMS'
            KmsKeyArn: <ARN of KMS key>

The options for 'S3EncryptionMode' are 'DISABLED', 'SSE-KMS' and 'SSE-S3'.

Job bookmark encryption can only be provided with 'CSE-KMS' encryption or be set as 'DISABLED'.

        JobBookmarksEncryption:
          JobBookmarksEncryptionMode: 'CSE-KMS'
          KmsKeyArn: <ARN of KMS key>

Job Bookmarks are used by Glue to track what data has already processed between running the ETL jobs.

see: CloudFormation - Glue Security Configuration Working with Security Configurations Job Bookmarks.

Note: 'KmsKeyArn is only required when setting the encryption mode as 'SSE-KMS'.

Glue Crawler:

  GlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Configuration: "{\"Version\": 1.0,\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"},\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"}}}"
      CrawlerSecurityConfiguration: !Ref GlueSecurityConfiguration
      DatabaseName: !Ref GlueDatabase
      Description: "Crawler for objects stored in my S3 bucket"
      Name: "MyCrawler"
      RecrawlPolicy:
        RecrawlBehavior: "CRAWL_EVERYTHING"
      Role: !Ref GlueCrawlerRole
      Schedule:
        ScheduleExpression: cron(0 3 * * ? *)
      SchemaChangePolicy:
        DeleteBehavior: "DEPRECATE_IN_DATABASE"
        UpdateBehavior: "UPDATE_IN_DATABASE"
      TablePrefix: "my_database_"
      Targets:
        S3Targets:
          - Path: !Sub "s3://${Stack Name}-${Environment}-<bucket name>/"
            Exclusions:
              - "**/excluded-folder/**"

This one is slightly more involved, so lets go through those which aren't obvious at first glance.

Configuration:

We can use the configuration to change the behaviour of the Crawler in regards to how it attempts to handle changes in the data store that affects the schema or partitions generated between runs.

We can provide JSON to set a number of options when declaring our resource in CloudFormation.

In the above example, we are setting the following options:

Inherit from table
Combine compatible schemas

When we set 'AddOrUpdateBehavior' to 'InheritFromTable' all partitions will inherit any metadata associated with the table. Such as:

Classification
Input format
Output format
SerDe information
Schema

If any of these values change at the table level, they will be reflected on to the partitions.

Setting the 'TableGroupingPolicy' to be 'CombineCompatibleSchemas' makes sure that if we have a schema that has been altered at some point; such as adding a new field or removing a field, then these changes don't result in the generation of a new table. Instead, the differing schemas are merged resulting in a table combining the properties of each schema.

See the following link which includes example configuration options: Crawler Configuration

ReCrawl Policy

This setting controls whether the crawler will re-scan the entire dataset on every run, only those new folders/objects that have been added since the previous run or only changes as identified by S3 events. The following options are allowed:

CRAWL_EVERYTHING
CRAWL_NEW_FOLDERS_ONLY
CRAWL_EVENT_MODE

SchemaChangePolicy

The following options control what actions take place depending on changes to the source data:

Delete Behaviour

The following options are available for delete behaviour:

Log
- If a table or a partition already exists, only log that a change was detected
Delete From Database
- Delete any tables or partitions no longer detected
Deprecate In Database
- Add a property to table named 'DEPRECATED' and mark this with a timestamp

Update Behaviour

The following options are available for update behaviour:

Log
- If a table or a partition is no longer detected, only log that it was no longer found. Do not delete it.
Update in Database
- Update the table based on changes detected to the table or partition

For information around this and the other options see: Glue Crawler - CloudFormation

Table Prefix

A small thing here. Be careful not to include '-' within the table name prefix. This caused issues for me when trying to perform partition fixes. I didn't dig into this too much, but if you encounter something similar this may be something to watch out for.

We understand the main resources, but what else do we need?

Besides the main Glue resources already mentioned there a number of supporting resources.

Creating the Crawler Role and Policies

The following shows the permissions assigned against the role for our Crawler:

  CrawlerRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - "glue.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      ManagedPolicyArns:
        - "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"

  CrawlerPolicy:
    Type: AWS::IAM::Policy
    Properties:
      Roles:
        - !Ref CrawlerRole
      PolicyName: glue_crawler_bucket_access
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Sid: "AllowS3Access"
            Action:
              - "s3:GetObject"
              - "s3:ListBucket"
            Resource:
              - !Sub "arn:aws:s3:::${stack name}-${Environment}-<bucket name>/*"
          - Effect: Allow
            Sid: "GlueEncryptionKeyPermissions"
            Action:
              - "kms:Encrypt"
            Resource:
              - !GetAtt SecurityConfigurationEncryptionKey.Arn
          - Effect: Allow
            Sid: "LogPermissions"
            Action:
              - "logs:AssociateKmsKey"
            Resource:
              - "*"

Here, we have taken advantage of the managed policy 'AWSGlueServiceRole' to do a lot of the heavy lifting. You can view the permissions granted for this policy here: AWSGlueServiceRole

Besides these, we specifically add permissions for the following resource actions:

Read permissions on our S3 source bucket
Permission to encrypt using an encryption key when writing to the Crawler log group
Permission to associate our KMS key with the target log group

Creating the KMS Key

An example of the KMS key definition is as follows:

  SecurityConfigurationEncryptionKey:
    Type: AWS::KMS::Key
    Properties:
      Description: "Symmetric key used to encrypt AWS Glue CloudWatch logs at rest"
      EnableKeyRotation: true
      KeySpec: SYMMETRIC_DEFAULT
      KeyPolicy:
        Version: "2012-10-17"
        Statement:
          - Sid: "EnableRootAccess"
            Effect: Allow
            Principal:
              AWS: !Sub "arn:aws:iam::${AWS::AccountId}:root"
            Action:
              - "kms:*"
            Resource:
              - "*"
          - Sid: "AllowLogsKMSAccess"
            Effect: Allow
            Principal:
              Service: "logs.${AWS::Region}.amazonaws.com"
            Action:
              - "kms:Decrypt*"
              - "kms:Encrypt*"
              - "kms:ReEncrypt*"
              - "kms:GenerateDataKey*"
              - "kms:Describe*"
            Resource:
              - "*"
            Condition:
              ForAnyValue:ArnEquals:
                "kms:EncryptionContext:aws:logs:arn": !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*"
          - Sid: "AllowCrawlerKMSAccess"
            Effect: Allow
            Principal:
              AWS: !GetAtt CrawlerRole.Arn
            Action:
              - "kms:Decrypt*"
              - "kms:Encrypt*"
              - "kms:ReEncrypt*"
              - "kms:GenerateDataKey*"
              - "kms:Describe*"
            Resource:
              - "*"
            Condition:
              ForAnyValue:ArnEquals:
                "kms:EncryptionContext:aws:logs:arn": !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*"

See the following regarding Glue and encryption: Glue - Data Protection

What does that look like altogether?

We now have all the required resources to get us up and running. The full template should look something like the following:

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: "A template to create all the Analysis infrastructure."

Parameters:
  Environment:
    Description: "The environment type"
    Type: "String"
    AllowedValues:
      - "dev"
      - "build"
      - "staging"
      - "integration"
      - "production"
    ConstraintDescription: must be dev, build, staging, integration or production

Resources:
  GlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Description: "Description of Database"
        Name: "my-database"

  GlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Configuration: "{\"Version\": 1.0,\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"},\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"}}}"
      CrawlerSecurityConfiguration: !Ref GlueSecurityConfiguration
      DatabaseName: !Ref GlueDatabase
      Description: "Crawler for objects stored in my S3 bucket"
      Name: "MyCrawler"
      RecrawlPolicy:
        RecrawlBehavior: "CRAWL_EVERYTHING"
      Role: !Ref GlueCrawlerRole
      Schedule:
        ScheduleExpression: cron(0 3 * * ? *)
      SchemaChangePolicy:
        DeleteBehavior: "DEPRECATE_IN_DATABASE"
        UpdateBehavior: "UPDATE_IN_DATABASE"
      TablePrefix: "my_database_"
      Targets:
        S3Targets:
          - Path: !Sub "s3://${Stack Name}-${Environment}-<bucket name>/"
            Exclusions:
              - "**/excluded-folder/**"

  GlueSecurityConfiguration:
    Type: AWS::Glue::SecurityConfiguration
    Properties:
      EncryptionConfiguration:
        CloudWatchEncryption:
          CloudWatchEncryptionMode: 'SSE-KMS'
          KmsKeyArn: <ARN of KMS key>
        JobBookmarksEncryption:
          JobBookmarksEncryptionMode: 'DISABLED'
        S3Encryptions:
          - S3EncryptionMode: 'DISABLED'
      Name: "My Security Configuration"

  CrawlerRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - "glue.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      ManagedPolicyArns:
        - "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"

  CrawlerPolicy:
    Type: AWS::IAM::Policy
    Properties:
      Roles:
        - !Ref CrawlerRole
      PolicyName: glue_crawler_bucket_access
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Sid: "AllowS3Access"
            Action:
              - "s3:GetObject"
              - "s3:ListBucket"
            Resource:
              - !Sub "arn:aws:s3:::${stack name}-${Environment}-<bucket name>/*"
          - Effect: Allow
            Sid: "GlueEncryptionKeyPermissions"
            Action:
              - "kms:Encrypt"
            Resource:
              - !GetAtt SecurityConfigurationEncryptionKey.Arn
          - Effect: Allow
            Sid: "LogPermissions"
            Action:
              - "logs:AssociateKmsKey"
            Resource:
              - "*"

  SecurityConfigurationEncryptionKey:
    Type: AWS::KMS::Key
    Properties:
      Description: "Symmetric key used to encrypt AWS Glue CloudWatch logs at rest"
      EnableKeyRotation: true
      KeySpec: SYMMETRIC_DEFAULT
      KeyPolicy:
        Version: "2012-10-17"
        Statement:
          - Sid: "EnableRootAccess"
            Effect: Allow
            Principal:
              AWS: !Sub "arn:aws:iam::${AWS::AccountId}:root"
            Action:
              - "kms:*"
            Resource:
              - "*"
          - Sid: "AllowLogsKMSAccess"
            Effect: Allow
            Principal:
              Service: "logs.${AWS::Region}.amazonaws.com"
            Action:
              - "kms:Decrypt*"
              - "kms:Encrypt*"
              - "kms:ReEncrypt*"
              - "kms:GenerateDataKey*"
              - "kms:Describe*"
            Resource:
              - "*"
            Condition:
              ForAnyValue:ArnEquals:
                "kms:EncryptionContext:aws:logs:arn": !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*"
          - Sid: "AllowCrawlerKMSAccess"
            Effect: Allow
            Principal:
              AWS: !GetAtt CrawlerRole.Arn
            Action:
              - "kms:Decrypt*"
              - "kms:Encrypt*"
              - "kms:ReEncrypt*"
              - "kms:GenerateDataKey*"
              - "kms:Describe*"
            Resource:
              - "*"
            Condition:
              ForAnyValue:ArnEquals:
                "kms:EncryptionContext:aws:logs:arn": !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*"

Create the stack in CloudFormation and we can start to look at providing some data for Athena to work with.

What do we see in Athena

Let's provide some example data to store in our bucket:

{"CHANGE":0.68,"PRICE":99.68,"TICKER_SYMBOL":"NFLX","SECTOR":"TECHNOLOGY"}
{"CHANGE":-1.73,"PRICE":47.02,"TICKER_SYMBOL":"QXZ","SECTOR":"RETAIL"}
{"CHANGE":-7.73,"PRICE":88.12,"TICKER_SYMBOL":"ALY","SECTOR":"ENERGY"}
{"CHANGE":0.83,"PRICE":28.89,"TICKER_SYMBOL":"CRM","SECTOR":"HEALTHCARE"}
{"CHANGE":-1.3,"PRICE":99.65,"TICKER_SYMBOL":"NFS","SECTOR":"ENERGY"}
{"CHANGE":-12.2,"PRICE":726.99,"TICKER_SYMBOL":"AMZN","SECTOR":"TECHNOLOGY"}
{"CHANGE":3.2,"PRICE":89.71,"TICKER_SYMBOL":"QXZ","SECTOR":"HEALTHCARE"}
{"CHANGE":-0.08,"PRICE":4.74,"TICKER_SYMBOL":"HJK","SECTOR":"TECHNOLOGY"}
{"CHANGE":0.38,"PRICE":45.94,"TICKER_SYMBOL":"WFC","SECTOR":"FINANCIAL"}
{"CHANGE":-2.23,"PRICE":114.8,"TICKER_SYMBOL":"IOP","SECTOR":"TECHNOLOGY"}
{"CHANGE":0.74,"PRICE":51.09,"TICKER_SYMBOL":"RFV","SECTOR":"FINANCIAL"}
{"CHANGE":8.89,"PRICE":199.89,"TICKER_SYMBOL":"TBV","SECTOR":"HEALTHCARE"}

Save the above as a .txt file and upload it into your bucket. I have uploaded into a specified path in order to take advantage of partitioning:

2022/07/22/10

This will create a partition on the year, month, day and hour which will greatly improve querying when generating large volumes of data. For example, we were working with FireHose to funnel large amounts of data into S3 before being consumed by Athena. If we didn't take advantage of partitioning queries would be extremely slow running. If you do work with a service like FireHose this will all be taken care of by default. But I felt it worth showing here as it is important when using Athena in a production environment.

You can look into this in more detail here:

AWS Glue - Partitioning

Now let's navigate in the console to AWS Glue > Crawlers:

We can see the Crawler we created is now waiting to be run. Select the Crawler and click the 'Run Crawler' button. If the permissions are set up correctly, you should see this complete and the 'Tables Added' column show as 1.

If we go now to the Athena service we should see the option to select our table:

Let's just select everything from our table (there are only a few rows).

We can see here that the crawler has correctly determined the columns and included the partitions which were inferred from the file path in the bucket. When running on a schedule, any changes to the source data will be handled and reflected automatically, we don't have to do anything else!

Conclusion

Hopefully the above will help people who are thinking of working with Glue or Athena and make use of CloudFormation, while also filling the gaps not covered in the current official documentation. The power of Glue to consume source data from S3 and perform ETL operations for us to consume via Athena is fantastic, so if you have the opportunity, make use of it.

Thanks for reading.

Getting started with Athena

AWS Glue - Defining resources in CloudFormation for use with Athena