Position Paper · arXiv preprint

Agentic AI Systems Should Be Designed as Marginal Token Allocators

Siqi Zhu

University of Illinois Urbana-Champaign

Abstract

This position paper argues that agentic AI systems should be designed and evaluated as marginal token allocation economies rather than as text generators priced by the unit. Following a single request through four economic layers — a router, an agent, a serving stack, and a training pipeline — we show that all four are solving the same first-order condition: marginal benefit equals marginal cost plus latency cost plus risk cost, with different index sets and different prices.

Adopting marginal token allocation as the shared accounting object explains why systems that locally minimize tokens globally misallocate them, predicts a small set of recurring failure modes, and points to a concrete research agenda in token-aware evaluation, autonomy pricing, congestion-priced serving, and risk-adjusted RL budgeting.

TL;DR

Routers, agents, serving stacks, and trainers look like four different engineering problems. They are not. They are four readings of one allocation problem, evaluated at four shadow prices that today no single layer can see. The prescription is not centralization; it is shared price discovery.

The Core Equation

Let an LLM system face a stream of tasks. For each task it has a finite set of token uses, indexed by $i$ — cheap model, frontier model, retrieval, planning, tool call, verifier, prefill, decode, KV transfer, RL rollout, gradient update. Each use $i$ has marginal quality $\Delta Q_i$, marginal cost $\Delta C_i$, marginal latency $\Delta L_i$, and marginal risk $\Delta R_i$. The system should spend the next token on:

Marginal Token Allocation
$$ i^{*} \;=\; \arg\max_{i}\;\Big[\, V\,\Delta Q_i \;-\; \Delta C_i \;-\; \lambda\,\Delta L_i \;-\; \rho\,\Delta R_i\,\Big] $$

where $V$ is task value (set by the user), $\lambda$ is the shadow price of latency (set by the SLA), and $\rho$ is the shadow price of risk (set by the safety team). At an interior optimum, the Marshallian equimarginal condition holds — the marginal benefit of a token equals its full marginal cost, once latency and risk are properly priced.

The four prices in this equation are not chosen by fiat; they are the dual variables (Lagrange multipliers) of the constrained primal of token allocation, and by the first welfare theorem, if all four layers maximize their own component taking the same prices as given, the allocation is Pareto efficient.

One Request, Four Layers

Consider a developer who types "the CI test on auth/login is failing — fix it" into a coding agent. Before a single line of code is touched, the system has already made four economic decisions, each priced differently:

LAYER 1 · DEMAND

Routing as Screening

Choose model tier under hidden user type. Solves a Mirrlees–Spence screening problem; observes $V$ and $\Delta C_i$.

LAYER 2 · ACTION

Agent as Principal–Agent

Decide autonomy and split tokens among read, plan, edit, verify. A team-production problem; observes $\rho$ and $V$.

LAYER 3 · SUPPLY

Serving as Production

Produce tokens via prefill, decode, KV cache. A multi-stage production problem with congestion externalities; observes $\lambda$ and $\Delta C_i$.

LAYER 4 · CAPITAL

Caches & RL as Investment

Decide what to cache and what to learn from. A neoclassical capital-accumulation Bellman problem; observes $\Delta C_i$ and $\rho$.

Each layer charges a different price for what looks, on the API invoice, like the same token. The router prices in dollars per million; the agent in expected risk of an irreversible action; the serving stack in queueing delay; the trainer in marginal capability gain over a discount horizon. No single layer sees all four prices. This is the structural reason locally rational decisions compose into globally irrational allocations.

Predicted Failure Modes

The unified view sharpens what counts as a failure: a system fails when its allocation deviates predictably from the master equation. Seven recurring failure modes across heterogeneous LLM systems are corner cases of the same equation when one of the four prices is held at zero or at infinity by a layer that does not see it.

Failure mode Allocation violated Where observed
Over-routingMarginal $V\Delta Q_m < \Delta C_m$ for chosen $m$Frontier-default deployments
Under-routing$V\Delta Q_m \gg \Delta C_m$ ignoredCost-minimizing routers
Over-delegation$\partial R/\partial a$ exceeds $V\,\partial p/\partial a$Auto-execute coding/email agents
Under-verification$V\Delta Q_v - \rho\Delta R_v$ positive but $T_v=0$Skip-the-tests pipelines
Serving congestion$\lambda \Delta L_i$ un-priced in $\Delta C_i$Flat-rate inference APIs
Stale RL rollouts$\delta A_t$ exceeds $g(\cdot)$ at the marginLong async PPO loops
Cache misuseReused KV with mismatched $V(x)$Naive prefix-cache reuse

Design Implications

Five principles follow directly from the master equation:

1. Token-aware evaluation

Report the four prices ($V$, $\Delta C_i$, $\lambda$, $\rho$) and the realized allocation per request, not only aggregate accuracy and dollar cost.

2. Risk-adjusted routing

Publish a regret bound against the routing optimum or an incentive-compatible menu, not a cost–quality scatter plot.

3. Autonomy pricing

Make the action class explicit and price irreversible actions higher than reversible ones — read $\to$ free, draft $\to$ free, commit $\to$ confirm, deploy/transfer $\to$ multi-party.

4. Congestion-priced serving

Expose shadow prices for prefill, decode, and KV resources, so upstream allocators can read them in real time and respond to binding constraints rather than to a flat per-token list price.

5. RL token budgeting

Equalize marginal capability gain across rollouts, verifiers, and updates; depreciate stale rollouts at the rate $\delta$ implied by drift, not at an arbitrary epoch boundary.

Citation

If you find this work useful, please cite:

@misc{zhu2026marginaltoken, title = {Agentic AI Systems Should Be Designed as Marginal Token Allocators}, author = {Siqi Zhu}, year = {2026}, note = {Position paper, preprint} }