A resilient AWS ingestion pipeline with Glue ETL

Design goals

Validate at the door. Don't ingest schema-violating data.
Never block the pipeline on a bad record. Dead-letter and move on.
Event-driven, not cron-driven. Files trigger their own processing.
Idempotent. Re-runs don't re-process work that already landed.
Athena-queryable. Once written, the data is immediately discoverable and queryable.

The pipeline, end to end

Five logical stages — each decoupled by S3 or SQS so failures stay local:

Raw upload lands in S3: raw-impure.
Lambda validator fires on object-create. It reads the matching schema from the Glue Schema Registry. Pass → S3: raw. Fail → S3: deadletter.
Glue Crawler watches new objects via SQS and writes one or more Glue Catalog tables describing the raw data.
Glue ETL job runs daily with bookmarks enabled. It reads the raw catalog tables, applies one transformation per table family, and writes Parquet back to S3 — flat data direct, nested data via a crawler that discovers schema after the write.
Athena reads the Parquet catalog tables. SQL-like queries; pay-per-scan.

upload → S3: raw-impure
          │  (object-create trigger)
          ▼
       Lambda validator  ← Glue Schema Registry (get schema)
          │
   ┌──────┴──────┐
   ▼             ▼
S3: deadletter  S3: raw
                 │  (object-create → SQS)
                 ▼
            Glue Crawler ── writes ──▶ Glue Catalog: raw_*
                                          │
                                          ▼
                                     Glue ETL job (daily, bookmark)
                                          │
                          ┌───────────────┴───────────────┐
                          ▼                               ▼
                  S3: parquet_data_1            S3: parquet_data_2 (nested)
                  catalog: parquet_1                     │
                                                  (object-create → SQS)
                                                         ▼
                                                  Glue Crawler
                                                  catalog: parquet_2
                                                         │
                                                         ▼
                                                       Athena

Why each piece exists

Schema Registry + Lambda validator

The two cheapest production incidents to prevent are "we ingested rows with the wrong columns" and "we ingested data that crashed the ETL job halfway through." A validator Lambda backed by the Schema Registry catches both. Pass-through latency is single-digit milliseconds. Bad records go to a separate bucket where they can be reviewed, replayed, or simply deleted.

SQS in front of the crawler

Glue Crawlers don't scale by being invoked harder. Putting an SQS queue in front means you can fan-in a burst of new objects and the crawler processes them in its own time, without a thundering-herd of StartCrawler calls. Bursts are absorbed; the rest of the pipeline keeps moving.

ETL bookmarks

Glue's bookmark feature is the single line of config that turns the daily job from "rebuild everything" into "process what arrived since last run." That's the difference between a job that gets cheaper as the data grows and one that gets exponentially more expensive.

Two paths after transform — flat vs nested

Flat data writes Parquet and updates the catalog in one step — schema is already known. Nested data writes Parquet and re-runs a crawler over it, because Glue's catalog writer doesn't handle arbitrarily nested schemas well. Routing this at write time is much cleaner than discovering it later.

What it actually moved

Cost down ~66%. Mostly from removing perpetually-running EMR clusters and replacing them with on-demand Glue ETL.
Athena query latency 7s → 1–2s by structuring the catalog correctly: file-size-aware partitions, column projection on time-stamped fields, Parquet over JSON.
Operational quiet. The pipeline runs unattended. Bad inputs route themselves out of the way. Re-running a day is a button.

What I'd do differently next time

Make the dead-letter bucket noisy. Surface dead-letter counts as a metric — silence here is dangerous, not good.
Skip the crawler for the parquet output. If the ETL job already knows the output schema, write it directly to the catalog. The crawler is a stop-gap for nested data; don't pay it for flat data.
Compaction on a schedule. Small Parquet files hurt Athena's read cost. A weekly compaction job earns its keep.

The shape of the lesson: a good ingestion pipeline is mostly about decoupling — every stage hands off via S3 or a queue, and any one stage can fail without dragging the rest down with it. The transformations are easy. The routing is the system.