Running SQL on Dynamo DB through Athena
Introduction
Athena is a service that allows you to run SQL queries on different data sources like Amazon S3. However Athena doesn’t support Dynamo DB as a data source. In order to circumvent this issue, we will use some of the glue services and store data in parquet format in S3.
Dynamo DB Crawler
A glue crawler is a tool to populate a Glue Data Catalog. An AWS glue data catalog is a database that can have multiple tables. Each table can have the schema or index of where and how data is stored in a particular data source. This data can be populated using a glue crawler.
- Configure a glue data catalog
CrawlerDatabase:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: DynamoDBCrawler
Description: Database for crawler
The DatabaseInput
field is to will be where the dynamo db schema will be stored. CatalogId
is the AWS account id.
- Configure a Crawler for Dynamo DB
The role required can make use of AWSGlueServiceRole
Managed Policy Arn with glue.amazonaws.com
as principal. Other Permissions related to Dynamo DB include -
- GetItem
- Scan
- DescribeTable
- Query
DynamoDBCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: dynamoDBCrawler
DatabaseName: // Name of glue data catalog database
Role: // Role with Dynamo DB and Glue Service Role
Targets:
DynamoDBTargets:
- Path: // Enter name of Dynamo Database
Defining the target of a crawler tells it which data source to crawl. On running, this will create a table with the same name as that of Dynamo DB.
Glue Job to update Data in parquet format
To deploy a glue job, it is imperative to write a python script to tell it what to do. Refer here : AWS Glue Tutorial Basically it can have three nodes. One dynamic node referencing data catalog database with the name of Dynamo Db. A second node to apply mapping. A third node to write it in parquet format. Learn more about parquet format here: Parquet.. This python script needs to be stored in a S3 bucket. This can be done through AWS console or as I prefer it using AWS CLI.
aws s3 cp filename.txt s3://bucket-name
After uploading the script to the bucket, you can reference it in the Glue Job like,
GlueJob:
Type: AWS::Glue::Job
Properties:
Name: // Job Name
GlueVersion: '3.0'
Command:
Name: glueetl
ScriptLocation: s3://script-location-glue/script.py
Role: // Role with S3 Read and write access and glue service role
The Role should have a AWSGlueServiceRole
and S3 Read and Write Access.
S3 Crawler
Once the Job is complete, we can use a crawler to define the schema in glue data catalog table. This table can be part of the same data catalog Database defined above for the dynamo db.
S3Crawler:
Type: AWS::Glue::Crawler
Properties:
Name: s3Crawler
DatabaseName: // Name of glue data catalog database
Role: // Role with Glue Service role and S3 access
Targets:
S3Targets:
- Path: // Name of S3 bucket
Step Functions to run it all
Using Glue Jobs with step functions is pretty straightforward. AWS Step Functions will wait for the Glue Job to complete when you add ‘.sync’ to Resource section.
RunGlueJob:
Type: Task
Parameters:
JobName: !Ref GlueJob
Resource: arn:aws:states:::glue:startJobRun.sync
Next: UploadDataInParquetFormat
However Running crawlers is not as straightforward. It will require adding a get crawler request and using a choice state to check whether the crawler is running or not. And crawler tasks are run through aws-sdk.
arn:aws:states:::aws-sdk:glue:getCrawler
The Choice state can check for the variable $.Crawler.State for READY to see if the crawler has stopped running. Adding a wait and can help reduce unnecessary requests. After a workflow execution, you are good to run queries after selecting the name of the S3 bucket from the Athena console. Or you can use aws-sdk for Athena to run queries.