AWS Glue Troubleshooting

Notes: Read each line of problem statement

S3-Glue-Redshift

Setting Up Redshift cluster and connection in AWS Glue

create cluster → Node type ( dc2.large) → No of nodes (1)

Database Configurations- Manually add admin password → awsuser , Awsuser1

Cluster permissions → new IAM Role with Any S3 Policy, keep all default and

create cluster → see “available” → click on Actions → turn on public access.

In cluster security group, add inbound rule to allow custom TCP 5439 port ipv4 address

Creating a VPC Endpoint for s3, glue and redshift integration.

VPC → endpoint ( left side) → Name → s3-glue-redshift-endpoint

Service category → AWS services

Service Name - com.amazonaws.us-east-1.s3 - Gateway

VPC → select default, Route table → select default table, leave all as default , create endpoint.

Navigate to the AWS Data Catalog

databases → add database → name: employees_db

Connections → new connection → redshift → select redshift connection with user name and password

Connection Name: redshift_connection → save → click connections → actions → test connection

Creating Crawlers in AWS Glue

S3 : crawlers → create crawler → name → Data Store - S3 → include path (same exact path as S3) → existing IAM role → select db : employees_db → Run crawler

target Redshift Table:

Note: Create Redshift output tables before creating the crawlers for redshift tables, kindle check in the tasks for the output schema

crawlers → create crawler → name → Data Store - Jdbc , Amazon redshift → Jdbc Connection: redshift_connection → include path (dev/public/employees_cleaned_data) → existing IAM role → select db : employees_db → Run crawler

create job with s3_glue_redshift_job

Source : Input → retrieve data from the emp data in aws glue data catalog

Transform: filter → Schema verification

Output: Load data to Glue Catalog in employees_cleaned_data

use this this data for data transformation

use the cleaned data node for the next task.