University of Illinois Urbana-Champaign
This position paper argues that agentic AI systems should be designed and evaluated as marginal token allocation economies rather than as text generators priced by the unit. Following a single request through four economic layers — a router, an agent, a serving stack, and a training pipeline — we show that all four are solving the same first-order condition: marginal benefit equals marginal cost plus latency cost plus risk cost, with different index sets and different prices.
Adopting marginal token allocation as the shared accounting object explains why systems that locally minimize tokens globally misallocate them, predicts a small set of recurring failure modes, and points to a concrete research agenda in token-aware evaluation, autonomy pricing, congestion-priced serving, and risk-adjusted RL budgeting.
Routers, agents, serving stacks, and trainers look like four different engineering problems. They are not. They are four readings of one allocation problem, evaluated at four shadow prices that today no single layer can see. The prescription is not centralization; it is shared price discovery.
Let an LLM system face a stream of tasks. For each task it has a finite set of token uses, indexed by $i$ — cheap model, frontier model, retrieval, planning, tool call, verifier, prefill, decode, KV transfer, RL rollout, gradient update. Each use $i$ has marginal quality $\Delta Q_i$, marginal cost $\Delta C_i$, marginal latency $\Delta L_i$, and marginal risk $\Delta R_i$. The system should spend the next token on:
where $V$ is task value (set by the user), $\lambda$ is the shadow price of latency (set by the SLA), and $\rho$ is the shadow price of risk (set by the safety team). At an interior optimum, the Marshallian equimarginal condition holds — the marginal benefit of a token equals its full marginal cost, once latency and risk are properly priced.
The four prices in this equation are not chosen by fiat; they are the dual variables (Lagrange multipliers) of the constrained primal of token allocation, and by the first welfare theorem, if all four layers maximize their own component taking the same prices as given, the allocation is Pareto efficient.
Consider a developer who types "the CI test on auth/login is failing — fix it" into a coding agent. Before a single line of code is touched, the system has already made four economic decisions, each priced differently:
Choose model tier under hidden user type. Solves a Mirrlees–Spence screening problem; observes $V$ and $\Delta C_i$.
Decide autonomy and split tokens among read, plan, edit, verify. A team-production problem; observes $\rho$ and $V$.
Produce tokens via prefill, decode, KV cache. A multi-stage production problem with congestion externalities; observes $\lambda$ and $\Delta C_i$.
Decide what to cache and what to learn from. A neoclassical capital-accumulation Bellman problem; observes $\Delta C_i$ and $\rho$.
Each layer charges a different price for what looks, on the API invoice, like the same token. The router prices in dollars per million; the agent in expected risk of an irreversible action; the serving stack in queueing delay; the trainer in marginal capability gain over a discount horizon. No single layer sees all four prices. This is the structural reason locally rational decisions compose into globally irrational allocations.
The unified view sharpens what counts as a failure: a system fails when its allocation deviates predictably from the master equation. Seven recurring failure modes across heterogeneous LLM systems are corner cases of the same equation when one of the four prices is held at zero or at infinity by a layer that does not see it.
| Failure mode | Allocation violated | Where observed |
|---|---|---|
| Over-routing | Marginal $V\Delta Q_m < \Delta C_m$ for chosen $m$ | Frontier-default deployments |
| Under-routing | $V\Delta Q_m \gg \Delta C_m$ ignored | Cost-minimizing routers |
| Over-delegation | $\partial R/\partial a$ exceeds $V\,\partial p/\partial a$ | Auto-execute coding/email agents |
| Under-verification | $V\Delta Q_v - \rho\Delta R_v$ positive but $T_v=0$ | Skip-the-tests pipelines |
| Serving congestion | $\lambda \Delta L_i$ un-priced in $\Delta C_i$ | Flat-rate inference APIs |
| Stale RL rollouts | $\delta A_t$ exceeds $g(\cdot)$ at the margin | Long async PPO loops |
| Cache misuse | Reused KV with mismatched $V(x)$ | Naive prefix-cache reuse |
Five principles follow directly from the master equation:
Report the four prices ($V$, $\Delta C_i$, $\lambda$, $\rho$) and the realized allocation per request, not only aggregate accuracy and dollar cost.
Publish a regret bound against the routing optimum or an incentive-compatible menu, not a cost–quality scatter plot.
Make the action class explicit and price irreversible actions higher than reversible ones — read $\to$ free, draft $\to$ free, commit $\to$ confirm, deploy/transfer $\to$ multi-party.
Expose shadow prices for prefill, decode, and KV resources, so upstream allocators can read them in real time and respond to binding constraints rather than to a flat per-token list price.
Equalize marginal capability gain across rollouts, verifiers, and updates; depreciate stale rollouts at the rate $\delta$ implied by drift, not at an arbitrary epoch boundary.
If you find this work useful, please cite: