Blog

Designing an OpenRTB-Aligned Event Schema

If you're going to hand clients their raw event data, the schema is the product. Align it with OpenRTB, keep it flat, document every field, give it one join key — and client teams can run independent analysis on day one. Get it wrong and 'data ownership' becomes an unusable dump.

Author
Ad360 engineering
Discipline
Platform engineering

Owning your event data is a strategy. The schema is the thing that makes it real. You can promise a client their raw events, stream them faithfully into their own cloud, and still hand them something unusable — a proprietary blob with undocumented fields, nested structures no query engine wants, and no reliable way to stitch the events together. "Data ownership" without a usable schema is just a more expensive dashboard.

So the schema is the product. This is the engineering view of how to structure delivered event data so that a client's data team can run independent analysis on day one — using the design choices in a real, dated client data specification rather than abstractions.

Principle 1: align with OpenRTB

The single most important decision is to align the schema with the protocol the data already comes from. Bid requests arrive as OpenRTB; aligning the delivered auction records with OpenRTB 2.x conventions means the client's analysts are working with a vocabulary the industry already knows, not a vendor's private dialect.

In the Ad360 data spec, auction data follows standard OpenRTB specifications, and fields available in the upstream bid request are exposed in the dataset where present and supported by the supply source. That last clause matters: it is honest about the reality that field availability varies by exchange, inventory type, and integration. An OpenRTB-aligned schema inherits both the structure and the honesty of the protocol — including the fact that not every field is always populated.

Principle 2: model the entities the way the funnel works

The schema mirrors the advertising funnel, delivered as distinct datasets:

  • auctions — one row per auction/bid-request context: request, inventory, device, geo, and the demand hierarchy (agency, advertiser, campaign, line item, creative).
  • impressions — one row per impression event, with delivery and commercial context.
  • clicks — one row per click event.
  • conversions — one row per conversion/audience event, tied to segment and user identity.

Critically, these are delivered separately and not pre-joined. The internal joined table exists for analytics, but it is not what gets handed over. Clients reconstruct the funnel themselves by joining on a shared key — which is both a transparency guarantee and a schema-design constraint: every dataset must carry the key.

Principle 3: one join key, everywhere

The schema lives or dies on its join key. Here it is request_id — a unique bid-request identifier present on auctions, and carried onto impressions and clicks so they can be joined back to their auction context. One key, learnable in a sentence, that assembles the whole funnel. The discipline is ruthless consistency: the key must appear, unchanged, on every record that needs to participate in a join. A schema with five different ways to relate events is a schema nobody will join correctly.

Principle 4: flat, scalar records (with arrays only where they earn it)

Query engines and warehouses love flat, scalar fields and dislike deep nesting. The spec reflects this: a flat structure with scalar fields, reserving array fields only for taxonomy segment lists where a single value genuinely cannot express the data. Concretely, that includes things like:

  • demand-hierarchy identifiers (agency/advertiser/campaign/line-item/creative) and channel;
  • auction/inventory context (auction type, creative type/position, video context);
  • device, OS, and geo (including accuracy);
  • model-input estimates (exchange-provided viewability, CTR estimate);
  • banner-size indicators as 0/1 one-hot flags;
  • taxonomy segment fields — for each scope (site/app/user) and family (IAB Tech Lab audience/content, Chromium Topics), both a *_list (array of IDs) and a *_count (integer).

The one-hot banner flags and the list/count taxonomy pattern are small decisions that make a large difference: they let an analyst filter and aggregate with plain SQL instead of parsing nested objects.

Principle 5: separate raw from enriched

Trust requires being able to tell what the platform received from what it computed. The schema separates the two physically: a raw/ prefix for unmodified event records, and an enriched/ prefix for records with derived fields (viewability, CTR estimate, banner one-hots, taxonomy counts, web-safe identifiers, demand-hierarchy context). This lets a client audit every enrichment against its source. Derivation is welcome; hiding derivation is not. The prefix split makes enrichment inspectable rather than implicit.

Principle 6: format and partition for analytics

Finally, the physical layout has to suit how the data is actually queried:

  • Newline-delimited JSON for raw delivery (universally ingestible), with Parquet as an optional optimized format for analytics at scale.
  • Partitioning by time (year/month/day/hour), with additional partition keys for demand hierarchy and channel — so queries scan only the slice they need.

These choices are what let a client point an off-the-shelf query engine at their own bucket and get answers, instead of building a bespoke ingestion pipeline first.

Common misconceptions

  • "A schema is just a list of fields." It's a contract: entities, a join key, formats, partitioning, and the raw/enriched boundary together.
  • "Pre-joining is a convenience to offer." Delivering un-joined datasets joinable on one key is the transparency feature, not a gap.
  • "Nested JSON is fine." Flat scalar records (arrays only for taxonomies) are vastly easier to query at scale.
  • "Every OpenRTB field will be populated." Availability varies by exchange/inventory; an honest schema documents that rather than implying completeness.
  • "Format doesn't matter if the data's there." NDJSON + optional Parquet + sane partitioning is the difference between usable and theoretical.

What good operation looks like

  • Align to OpenRTB so analysts use a known vocabulary.
  • Deliver separate datasets with one consistent join key (request_id).
  • Keep records flat and scalar; use arrays only for taxonomy lists, with paired counts.
  • Separate raw from enriched so every derivation is auditable.
  • Ship NDJSON + optional Parquet, partitioned for the queries clients actually run.
  • Document every field — type, meaning, and availability caveats.

Open questions

  • Can the industry converge on a standard, portable event schema so buyers don't relearn each vendor's shape?
  • How should schema versioning be handled so client pipelines don't break on field changes?
  • What is the right documentation contract for field availability that varies by supply source?

A delivered event schema is where data-ownership ideals become an engineering reality — or fail to. Align it to OpenRTB, model the funnel as separate datasets, give it one relentless join key, keep it flat, split raw from enriched, and partition it for real queries. Do that, and "you own your data" becomes "your analysts shipped an analysis this afternoon." Skip it, and ownership is just a bigger file you can't read.