Blog

Multi-Armed Bandits in Campaign Allocation

How should a campaign split budget across options? Spend everything on today's leader and you never discover a better one. Spread evenly and you waste money on losers. The multi-armed bandit is the answer to this exploration–exploitation problem.

Author: Ad360 engineering
Discipline: Platform engineering

Here is a problem every campaign faces and few people name precisely. You have a budget and several options to spend it on — creatives, placements, audiences, line items. You don't yet know which performs best. If you pour everything into whichever option looks best right now, you might be backing an early fluke and you'll never discover the option that would have won. If you split the budget evenly forever to "be fair," you'll keep funding obvious losers long after you should have stopped. Both instincts are wrong, and the tension between them has a name: exploration versus exploitation.

The multi-armed bandit is the formal, elegant answer to that tension. Borrowed from the image of a gambler facing a row of slot machines ("one-armed bandits"), each with an unknown payout, it's a framework for learning which option is best while you're still spending — balancing the need to explore unproven options against the desire to exploit the proven ones. For budget allocation, it's one of the most natural fits in all of advertising.

The exploration–exploitation dilemma

Spell out the trade-off, because it's the whole problem:

Pure exploitation — always spend on the current best — is greedy and brittle. It locks onto early winners, ignores options it hasn't tried enough, and can never recover from backing the wrong horse early.
Pure exploration — keep trying everything equally — never converges. It treats a known loser the same as a promising unknown, wasting budget indefinitely.

A bandit threads between them. Early on, when it knows little, it explores more — sampling options to learn their payouts. As evidence accumulates, it shifts toward exploitation — concentrating budget on what's proven — while still occasionally checking whether something has changed. The allocation is dynamic: it moves with the evidence, instead of being fixed up front.

What the bandit actually tracks

A bandit isn't magic; it's bookkeeping plus a decision rule. Each option (each "arm") needs a record of how it has performed, and a way to turn that record into the next allocation decision. Ad360's optimization library implements exactly this primitive — BanditStats — tracking, per arm:

rewards — the good outcomes attributed to the arm,
penalties — the bad outcomes,
observations — how many times the arm has been tried,
action and selection counts — how often it has been chosen.

That's the raw material of an explore/exploit decision: you can't decide how much to trust an arm without knowing both how well it did and how much evidence you have for it. An arm with a great reward rate over three observations is not the same as one with a good rate over three thousand — and the counts are what let the bandit tell the difference.

Defining reward is the hard part

The bandit math is well understood; the judgment is in what counts as reward. A bandit optimizes relentlessly toward whatever you tell it to maximize, which means a badly chosen reward produces confidently wrong allocation. If reward is "clicks," the bandit will chase clickbait. If it's "last-touch conversions," it will over-fund whatever sits closest to conversion regardless of whether it caused anything (the attribution-vs-incrementality trap). Defining reward well — ideally something close to incremental value, with penalties for the outcomes you don't want — is where the real thinking lives. The algorithm is the easy part; the objective is the dangerous one.

Where the guardrails go

A bandit reallocating budget autonomously is, by definition, an optimization agent acting without a human in each decision — which is powerful and, ungoverned, risky. It can shift large amounts of spend quickly, including in the wrong direction if reward is misdefined or signal is noisy. So bandits belong inside the same governance envelope as any autonomous optimization: bounded authority (limits on how fast and how far it can move budget), the ability to inspect and override, and an audit trail of what it shifted and why. The bandit explores and exploits; the human sets the bounds within which it's allowed to do so. (This is the same human-in-command discipline that governs any agentic system.)

An old idea, quietly at work

It's worth noting that the bandit is not a product of the recent "agentic AI" wave — it's a classical reinforcement-learning primitive that has been an autonomous allocator in advertising for years. An algorithm that learns which options to fund and shifts spend toward them, updating from observed reward with no human in the per-decision loop, is exactly the behavior now marketed as "agentic budget management." Recognizing the bandit for what it is helps separate genuine autonomous-optimization substance from rebranding.

Common misconceptions

"Just spend on the best performer." Pure exploitation locks onto early flukes and never finds better options.
"Test everything equally, then decide." Pure exploration wastes budget on known losers; bandits shift dynamically as evidence grows.
"The reward metric is obvious." A poorly chosen reward makes the bandit confidently optimize the wrong thing (e.g. clickbait, non-incremental conversions).
"A bandit can run unsupervised." It's an autonomous agent moving real money; it needs bounds, override, and audit.
"Bandits are new AI." They're a long-standing RL primitive — the autonomous workhorse behind much "agentic" allocation.

What good operation looks like

Frame allocation explicitly as exploration vs exploitation, not "pick the winner."
Track both performance and evidence per arm (reward, penalty, observation counts).
Invest most in defining reward well — ideally incremental value, with penalties — because the bandit optimizes it literally.
Put the bandit inside a governance envelope: bounded movement, override, audit.
Watch for noisy or non-stationary conditions where naive bandits over-commit.

Open questions

How should bandits handle non-stationarity — when the best option changes over time?
What's the right reward definition to align a bandit with incremental business value rather than vanity metrics?
How do you bound a bandit's authority so it optimizes aggressively without destabilizing delivery?

The instinct to back today's winner and the instinct to keep everything fair are both traps — one too greedy, one too timid. The multi-armed bandit is the disciplined middle: learn while you spend, shift toward what works, never stop checking. It's elegant math sitting on top of a hard judgment (what is reward?) and a non-negotiable requirement (govern the autonomy). Get those right and a bandit turns budget allocation from a guess into a system that gets smarter every day it runs.