Blog

In-Path Inference: Running ML Inside the Bid's Millisecond Budget

There are two ways to put machine learning in a bidder: score everything offline and look it up, or run the model live on each request. The second — in-path inference — is harder and far more powerful. Here's what it takes to return a model prediction inside a few milliseconds.

Author: Ad360 engineering
Discipline: Platform engineering

There are two honest ways to put machine learning into a bidder, and the difference between them is the difference between a brochure and a system. The first is to score things offline — run users or contexts through a model overnight, write the answers to a table, and look them up at bid time. The second is to run the model live, on each request, as part of the decision. This second approach — in-path inference — is the hard one, and it is where real-time ML earns its name.

In-path means the model executes inside the bid's latency budget, as a discrete stage in the live decision, on the actual request being evaluated. Not a precomputed lookup. Not a separate round trip that happens "soon." A model evaluation, on this impression, with this context, returning a number fast enough to participate in an auction that closes in milliseconds.

Why in-path is harder — and better

Offline scoring is a legitimate and useful technique, but it has a ceiling: it can only know what it computed in advance. It cannot react to the specific combination of signals in this request, because that combination may never have been pre-scored. In-path inference can, because it evaluates the model on the live feature vector at decision time.

The cost is brutal latency pressure. An offline job has hours. An in-path model has single-digit milliseconds, shared with everything else the bid has to do — eligibility, filtering, pacing, allocation. That constraint shapes every design decision, and most "AI bidding" claims quietly avoid it by being offline lookups in disguise.

Where it sits in the decision

In a real bidder, inference is not the whole decision — it is one late stage of it. In Ad360's production funnel, "Run Inference" appears as a discrete, named stage, near the end of the sequence, after eligibility, targeting filters, pacing, allocation, and winner selection. The ordering is deliberate and economic: inference is comparatively expensive, so it runs on the small fraction of opportunities that have already survived every cheaper gate. There is no point spending model compute on an impression the campaign was never eligible to win.

That placement is itself a performance technique. Front-load the cheap, decisive filters; defer the expensive model to the end. The funnel's order is partly an ordering of compute cost against elimination power.

How the prediction is served

In-path inference needs a serving layer engineered for speed. Ad360's is a dedicated gRPC service (AiInferenceService.Predict) listening on a fixed port (50051), returning a single probability per request. Several properties make it viable in-path:

gRPC, not REST. A binary, multiplexed protocol with low serialization overhead — chosen because every microsecond on the wire counts.
Preloaded models. The per-line-item models are resident in memory, not loaded on the critical path; loading a model mid-bid would blow the budget.
Horizontal scaling. A server rank concept (INFERENCE_SERVER_RANK) lets inference scale out across instances to absorb QPS without latency creep.
A fast model class. The live model is gradient-boosted trees (XGBoost), whose inference is cheap to evaluate — a deliberate fit to the latency budget, where a heavy neural network might not be.

None of these are accidental. They are the consequences of taking "inside the millisecond budget" as a hard requirement rather than an aspiration.

The latency reality

The budget is real and measured. The bidder runs against a sub-50ms target, and production telemetry shows per-request processing time for requests with ads sitting largely in the 1–5ms range, with occasional spikes (into the low tens of milliseconds) around top-of-hour load. Inference has to fit inside that, alongside everything else. Exceed the budget and the exchange times out: the bid never lands, and it does not matter how good the prediction would have been.

This is why in-path inference is the genuinely hard part of "AI bidding." Building a model that scores well is a solved problem. Serving it, live, on every qualifying request, fast enough to win the auction, without destabilizing a system already running at volume — that is the engineering most platforms cannot actually do.

Common misconceptions

"All ML bidding is in-path." Much of it is offline scoring looked up at bid time — useful, but not the same thing.
"In-path inference means a big neural network thinking hard." In a millisecond budget, fast models (boosted trees) and lean serving matter more than model size.
"You can just call a REST endpoint." Protocol, serialization, model loading, and scaling all have to be engineered for the budget; a naive call won't fit.
"Inference runs on every request." It runs late, on the small fraction that survive the cheaper gates — by design.

What good operation looks like

Place inference late in the funnel, after cheap filters have done the elimination.
Preload models and keep them resident; never load on the critical path.
Choose a serving protocol and model class that fit the latency budget (gRPC, fast models).
Scale horizontally so QPS growth doesn't erode latency.
Measure end-to-end latency continuously; the budget is the boundary every decision lives inside.

Open questions

When does a heavier model (deep learning) justify its latency cost in a sub-50ms budget?
How do you batch or share inference across near-identical requests without hurting freshness?
Where is the crossover between in-path and hybrid (precomputed features + live scoring) architectures?

Offline scoring is the safe, common way to claim "AI bidding." In-path inference is the hard way that actually delivers it — a live model evaluation on the real request, inside the few milliseconds the auction allows. The brochure version hides the latency budget. The real version is built around it.