← All blogs

NotesAWS · Data Pipeline

A resilient AWS ingestion pipeline with Glue ETL

Most data-pipeline pain isn't in the transformation — it's in the bad inputs you never validated and the partial runs you never re-ran. This is a walk through the pipeline I built at Juego: event-driven, dead-lettered, catalog-aware and roughly 66% cheaper than what came before.

Design goals

The pipeline, end to end

Five logical stages — each decoupled by S3 or SQS so failures stay local:

  1. Raw upload lands in S3: raw-impure.
  2. Lambda validator fires on object-create. It reads the matching schema from the Glue Schema Registry. Pass → S3: raw. Fail → S3: deadletter.
  3. Glue Crawler watches new objects via SQS and writes one or more Glue Catalog tables describing the raw data.
  4. Glue ETL job runs daily with bookmarks enabled. It reads the raw catalog tables, applies one transformation per table family, and writes Parquet back to S3 — flat data direct, nested data via a crawler that discovers schema after the write.
  5. Athena reads the Parquet catalog tables. SQL-like queries; pay-per-scan.
upload → S3: raw-impure
          │  (object-create trigger)
          ▼
       Lambda validator  ← Glue Schema Registry (get schema)
          │
   ┌──────┴──────┐
   ▼             ▼
S3: deadletter  S3: raw
                 │  (object-create → SQS)
                 ▼
            Glue Crawler ── writes ──▶ Glue Catalog: raw_*
                                          │
                                          ▼
                                     Glue ETL job (daily, bookmark)
                                          │
                          ┌───────────────┴───────────────┐
                          ▼                               ▼
                  S3: parquet_data_1            S3: parquet_data_2 (nested)
                  catalog: parquet_1                     │
                                                  (object-create → SQS)
                                                         ▼
                                                  Glue Crawler
                                                  catalog: parquet_2
                                                         │
                                                         ▼
                                                       Athena

Why each piece exists

Schema Registry + Lambda validator

The two cheapest production incidents to prevent are "we ingested rows with the wrong columns" and "we ingested data that crashed the ETL job halfway through." A validator Lambda backed by the Schema Registry catches both. Pass-through latency is single-digit milliseconds. Bad records go to a separate bucket where they can be reviewed, replayed, or simply deleted.

SQS in front of the crawler

Glue Crawlers don't scale by being invoked harder. Putting an SQS queue in front means you can fan-in a burst of new objects and the crawler processes them in its own time, without a thundering-herd of StartCrawler calls. Bursts are absorbed; the rest of the pipeline keeps moving.

ETL bookmarks

Glue's bookmark feature is the single line of config that turns the daily job from "rebuild everything" into "process what arrived since last run." That's the difference between a job that gets cheaper as the data grows and one that gets exponentially more expensive.

Two paths after transform — flat vs nested

Flat data writes Parquet and updates the catalog in one step — schema is already known. Nested data writes Parquet and re-runs a crawler over it, because Glue's catalog writer doesn't handle arbitrarily nested schemas well. Routing this at write time is much cleaner than discovering it later.

What it actually moved

What I'd do differently next time

The shape of the lesson: a good ingestion pipeline is mostly about decoupling — every stage hands off via S3 or a queue, and any one stage can fail without dragging the rest down with it. The transformations are easy. The routing is the system.