GitHunt
EP

epomatti/aws-glue-athena

Glue ETL crawler and jobs with Athena queries

AWS Glue + Athena

Glue example extraction from RDS and query with Athena.

Create the infrastructure

Create the Terraform variables file:

cp config/template.tfvars .auto.tfvars

Apply Terraform:

terraform init
terraform apply

Once ready, enter the Glue Studio and test the connector to the RDS database.

https://mvnrepository.com/artifact/org.postgresql/postgresql
policies https://docs.aws.amazon.com/glue/latest/dg/console-connections-jdbc-drivers.html

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-jdbc-home.html
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL.html#UsingWithRDS.SSL.CertificatesAllRegions

Glue

Extra Configuration

  • VPC Gateway Endpoints
  • Security Group Self-reference

https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html
https://repost.aws/knowledge-center/glue-s3-endpoint-validation-failed
https://repost.aws/knowledge-center/glue-test-connection-failed

Initiate the Database

Connect to the database using the EC2 jump server:

psql "postgresql://demouser:p4ssw0rd@<host>:5432/demodb"

Apply the prepare-database.sql script to generate data.

Glue ETL Job

First, run the crawler to feed the database catalog.

aws glue start-crawler --name 'rds-aurora-crawler'

Connect to the AWS Glue Studio and go to the Jobs blade. Create a new Job:

  • Source: AWS Glue Database Catalog
  • Target: S3

Enter JSON for the output format, and fill it in the required information.

Save the job. File auto-generated-script-example.py is reference of what Glue will generate.

Run the ETL job and check the output files in S3:

{"favoriteFood":"Pasta","sex":"M","id":2,"birthday":"1998-03-15","name":"John"}

Athena

Athena needs an S3 data source, so querying the existing datasource is not possible since it runs over Aurora.

Use Glue again to prepare an Athena table with an S3 source:

  1. Create a new database on Glue.
  2. Create a new crawler that will read the S3 JSON data and feed the new Glue database.
  3. Run the crawler.
  4. Go to Athena and add a query result location on S3.

The table should be automatically created and you should now be able to run a query against the S3 data:

SELECT* FROM "crawler-s3"."transform" WHERE favoritefood='Lasagna';

Done ๐Ÿ† Athena will run your queries over S3 using SQL:


Clean-up

Delete the manually created Glue Jobs, Crawlers, Database, Table, S3, CloudWatch Logs.

Run terraform destroy -auto-approve to remove the infrastructure.

Languages

HCL93.6%Python5.8%Shell0.6%

Contributors

MIT License
Created August 14, 2022
Updated November 6, 2025