Design goals
- Validate at the door. Don't ingest schema-violating data.
- Never block the pipeline on a bad record. Dead-letter and move on.
- Event-driven, not cron-driven. Files trigger their own processing.
- Idempotent. Re-runs don't re-process work that already landed.
- Athena-queryable. Once written, the data is immediately discoverable and queryable.
The pipeline, end to end
Five logical stages — each decoupled by S3 or SQS so failures stay local:
- Raw upload lands in
S3: raw-impure. - Lambda validator fires on object-create. It reads the matching schema from the Glue Schema Registry. Pass →
S3: raw. Fail →S3: deadletter. - Glue Crawler watches new objects via SQS and writes one or more Glue Catalog tables describing the raw data.
- Glue ETL job runs daily with bookmarks enabled. It reads the raw catalog tables, applies one transformation per table family, and writes Parquet back to S3 — flat data direct, nested data via a crawler that discovers schema after the write.
- Athena reads the Parquet catalog tables. SQL-like queries; pay-per-scan.
upload → S3: raw-impure
│ (object-create trigger)
▼
Lambda validator ← Glue Schema Registry (get schema)
│
┌──────┴──────┐
▼ ▼
S3: deadletter S3: raw
│ (object-create → SQS)
▼
Glue Crawler ── writes ──▶ Glue Catalog: raw_*
│
▼
Glue ETL job (daily, bookmark)
│
┌───────────────┴───────────────┐
▼ ▼
S3: parquet_data_1 S3: parquet_data_2 (nested)
catalog: parquet_1 │
(object-create → SQS)
▼
Glue Crawler
catalog: parquet_2
│
▼
Athena
Why each piece exists
Schema Registry + Lambda validator
The two cheapest production incidents to prevent are "we ingested rows with the wrong columns" and "we ingested data that crashed the ETL job halfway through." A validator Lambda backed by the Schema Registry catches both. Pass-through latency is single-digit milliseconds. Bad records go to a separate bucket where they can be reviewed, replayed, or simply deleted.
SQS in front of the crawler
Glue Crawlers don't scale by being invoked harder. Putting an SQS queue in front means you can fan-in a burst of new objects and the crawler processes them in its own time, without a thundering-herd of StartCrawler calls. Bursts are absorbed; the rest of the pipeline keeps moving.
ETL bookmarks
Glue's bookmark feature is the single line of config that turns the daily job from "rebuild everything" into "process what arrived since last run." That's the difference between a job that gets cheaper as the data grows and one that gets exponentially more expensive.
Two paths after transform — flat vs nested
Flat data writes Parquet and updates the catalog in one step — schema is already known. Nested data writes Parquet and re-runs a crawler over it, because Glue's catalog writer doesn't handle arbitrarily nested schemas well. Routing this at write time is much cleaner than discovering it later.
What it actually moved
- Cost down ~66%. Mostly from removing perpetually-running EMR clusters and replacing them with on-demand Glue ETL.
- Athena query latency 7s → 1–2s by structuring the catalog correctly: file-size-aware partitions, column projection on time-stamped fields, Parquet over JSON.
- Operational quiet. The pipeline runs unattended. Bad inputs route themselves out of the way. Re-running a day is a button.
What I'd do differently next time
- Make the dead-letter bucket noisy. Surface dead-letter counts as a metric — silence here is dangerous, not good.
- Skip the crawler for the parquet output. If the ETL job already knows the output schema, write it directly to the catalog. The crawler is a stop-gap for nested data; don't pay it for flat data.
- Compaction on a schedule. Small Parquet files hurt Athena's read cost. A weekly compaction job earns its keep.