The Last Mile is Action: Agentic AI in Execution

The Decision That Sat for Six Weeks

A VP of Customer Operations I worked with last spring had a queue-routing change in flight. The shape was straightforward: shift roughly 18% of inbound contact-center traffic away from a tier-one outsourced vendor whose handle times had drifted twenty seconds above contract over a quarter, and onto a smaller domestic vendor whose quality scores were better and whose unit cost, after the routing change, was still cheaper on a fully-loaded basis. The recommendation had been produced by a workforce-management analyst, vetted by an internal cost model, debated for an hour in a Tuesday operations review, and approved on a slide that said "Approved — implement next cycle."

That was six weeks before our conversation. The routing rules had not moved.

When I asked her what was happening, she walked me through the same shape of answer I described in Part 2, only one rung further down. The committee approved. The named owner was identified: workforce management. WFM filed a change request with the contact-center platform team. The platform team had a freeze in week two for an unrelated holiday-season cutover. Week three the platform team's lead changed jobs. Week four the new lead asked, reasonably, whether the routing change had been re-validated against the new contract addendum that vendor procurement had just signed. Week five the re-validation produced a slightly different number (savings now $312K annual instead of $340K), which kicked the change back to the committee for a re-approval that was scheduled for week seven. Meanwhile the vendor whose performance had drifted was still receiving 18% of traffic that the organization had decided, six weeks earlier, not to send them.

This is the third stall in the series, and it has a different shape than the first two: not an information problem (the cost model was clear), not a recommendation problem (the analyst produced a ranked option set with a default), not even a governance problem in the sense of Part 2 (there was a named owner), but a latency problem and a supervision problem, tightly coupled. The decision had been made. The system that the decision needed to flow into could not absorb it at the speed at which it had been made. By the time the system absorbed it, the decision had gone stale and had to be made again.

The seam between decision and action (the last mile, in the language the Decisioning Platform research uses) is where the rest of the framework either pays out or quietly evaporates. The first two articles argued for changing what the team produces and what the leader is accountable for. This article argues for changing what the execution surface looks like, and what the human's job becomes when the surface is fast enough that no individual event can be supervised one at a time.

Why the Seam Is Execution, Not Deliberation

The instinct, when a decision sits in execution for six weeks, is to ask whether we should have decided differently. We should not have. The decision was right. It was, by the time it landed, on the wrong side of the freshness clock; the failure mode was not analytical, and adding more deliberation to the front of the workflow only makes the latency worse.

What the contact-center example illustrates is a category of failure that the Decisioning Platform research names directly. The platform's roadmap describes three phases: foundational automation and data unification, predictive modeling and guided intelligence, and (in Phase 3) autonomous execution and continuous learning, where "fully autonomous systems carry out decisions moment-by-moment" and "every interaction is fed back into the system's Data Flywheel." The seam between Phase 2 and Phase 3 is the seam this article is about. Phase 2 produces a recommendation surface that a human approves. Phase 3 produces an execution surface that a human supervises. Those are not the same artifact, and most organizations are still operating Phase 2 architecture against Phase 3 expectations.

What changes between the two is not the wisdom of the decision. It is the latency between deciding and acting, and the instrumentation required when that latency collapses. In the Phase 2 world the contact-center change took six weeks because each handoff between WFM, platform engineering, vendor procurement, and the committee was a serialized human step, and each step had its own queue, its own reviewer, and its own opportunity for the change to lose context. In the Phase 3 world the same routing change is implemented by an agent that has documented authority to adjust traffic-shaping rules within a defined envelope, executes the change inside a maintenance window without re-litigating the cost model, and reports the outcome (actual handle times, actual unit costs, actual quality scores) back to the supervisor as a population statistic rather than as a single approval ticket.

The technical substrate that makes Phase 3 possible has gotten genuinely better in the past two years. Anthropic's recent work on agentic computer use (the model controlling a real desktop, navigating real applications, taking real actions on behalf of a user) is one signal that the capability layer is no longer the binding constraint. OpenAI and other labs have published similar agent framings. The European Nexus for Strategic Intelligence's Agentic Startups: The Opportunity Clusters maps a broad set of vertical agent categories (customer operations, finance ops, IT support, sales) where the execution layer is now buildable on commodity infrastructure. The Decisioning Platform research summarizes the shift directly: branding is moving from "providing information" to "providing intelligence and execution," with autonomous role-based agents (iWorkers, in some vendor framings) that can "adjust bids, shift budgets, or monitor health signals 24/7." That is the language of an execution surface, not a recommendation surface.

The capability layer is not the binding constraint. The binding constraint is what happens to human supervision when an execution surface can act faster than a human can review individual events.

This is the part of the shift that organizations underrate, and it is worth being precise about. In the recommendation-and-decide world a human supervises the decision before the action. They look at the recommendation, weigh it, and approve or override. The supervisory unit is the individual decision. In the agentic-execution world a human can no longer be the gating step on every action (the throughput would collapse), so the supervisory unit shifts from the individual event to the firing pattern of the agent. The human is no longer asking "is this specific action correct?" They are asking "is the population of actions this agent has taken in the last hour, day, or week consistent with the policy I set?" That is a different cognitive job. It is closer to running a clinical trial, or supervising a high-frequency trading desk, than it is to approving a slide.

The Decisioning Platform research's "No Black Box" governance pillar (that every decision must be logged, audited, and explainable) is the ground floor of this shift but not the whole of it. Logging is necessary but not sufficient. What the supervisor needs in Phase 3 is not a log of every action; it is distributional telemetry over the agent's behavior. Drift detection. Rate monitoring. Population-level quality scores. The kind of statistical surface that lets a human notice that something is changing in the agent's firing pattern before an individual error becomes a public incident. The EU AI Act's high-risk-systems requirements are converging on this same point from the regulatory side: human oversight of an automated decision system is not satisfied by a human approving each output, because at scale that is impossible. It is satisfied by a human being in a position to "oversee, override, and bear responsibility for" the system, which is a population-level property, not an event-level one.

The seam between decision and action, in 2026, is therefore not about wisdom. It is about whether the organization has built an execution layer that can absorb decisions at the speed they are made, and whether the supervisor of that layer has the right surface (distributional, statistical, drift-aware) to do their actual job.

What Action Seeking Actually Looks Like

I want to be precise about the operational pattern, because agentic execution is a phrase that ranges from autonomous trading desk to Zapier with a chat interface, and the difference matters. By action seeking I mean a specific configuration that has, in my experience, four parts. All four are necessary; the pattern fails reproducibly when any one is missing.

Part one: a defined policy envelope, not a defined script. The agent is given a policy (a set of rules and constraints about what it is and is not allowed to do, with explicit numerical bounds) rather than a step-by-step playbook. The contact-center routing example: the routing agent may shift traffic between approved vendors within ±20% of the baseline allocation, may not exceed any single vendor's contracted ceiling, may not route below 95% of the baseline quality score on any traffic class. That is a policy. It is not a script. The agent is allowed to make any decision inside the envelope; it is required to report and pause if it would cross the envelope. This is the same architectural shape as AWS auto-scaling, where a deployed policy ("scale up if average CPU exceeds 70% for five minutes, scale down if it drops below 30% for ten") allows the platform to take thousands of scaling actions per day inside the envelope without a human approving each one. The human's job is to set the envelope, not to approve the events.

Part two: explicit human-override windows that interrupt mid-sequence. This is the part that is most often missing from naive agentic deployments, and is the thing I would refuse to ship a production agent without. The supervisor must be able to interrupt the agent in the middle of a sequence, not only at the beginning or the end. If the agent is going to execute a five-step migration of inbound traffic from one vendor to another over a forty-five-minute window, the supervisor must be able to halt the migration at minute twelve if something looks wrong, with the partial state cleanly recoverable. Cloudflare's automated mitigation patterns, where a DDoS-mitigation policy can be aggressively pre-empted by a human operator via a single dashboard control even after the policy has begun firing, are a clean version of this. Stripe's machine-learning-driven payment-decisioning systems publish the same property: rules can be overridden in flight, partial states are recoverable, and the override is logged with a rationale (echoing the dissent-on-record discipline from Part 2). The midstream override is what distinguishes a supervisor from a button-pusher. Without it, human in the loop is a slogan rather than a control.

Part three: distributional telemetry as the supervisory surface. The supervisor's primary instrument is not a log of individual actions. It is a small set of population-level statistics: firing rate (how often the agent is acting), drift (how the distribution of the agent's actions is changing over time), error rate (how often the agent's actions produce a downstream signal that the policy was violated), and a few task-specific quality moments. The same approach is documented in Sequential Monte Carlo reasoning for inference-time supervision: keep multiple hypotheses, track the distribution, and intervene on distributional shift. The supervisor is asking, every morning: did the agent's behavior yesterday look like the agent's behavior the week before? If the answer is no, drill in. If the answer is yes, do not.

This is the shift that is hardest to internalize for leaders who came up in the dashboard-and-approval era. The supervisor's question stops being was this action correct? and becomes is the rate of correct actions consistent with the policy I set, and is it changing in a direction that should worry me? That is closer to the cognitive work of an SRE looking at p99 latency over time than the work of a manager approving expense reports. It is a real change in what the leader is looking at, what they are looking for, and how they are spending their attention.

Part four: a feedback loop that updates the policy envelope, not just the agent. The Decisioning Platform research describes the Data Flywheel: every interaction fed back into the system, every outcome refining the model. This is correct as far as it goes, but the part I want to add explicitly is that the feedback should refine the policy envelope, not only the agent's internal weights. If the supervisor notices over a quarter that the agent is consistently bumping the upper edge of the routing envelope at peak hours, the right response is to widen or narrow the envelope deliberately, with a written rationale, and to do so as a governed change. The envelope is a contract between the human and the agent. The contract should evolve. It should evolve transparently, with version history, and with the supervisor (not the agent) making the call. This is the architectural counterpart to the dissent-on-record pattern in Part 2: a written, auditable trail of how the human's mental model of the system has changed over time.

A note on what this is not. Action seeking is not "let the agent do whatever it wants and intervene if anything goes wrong." That posture is a disaster reproducibly. The pattern above is a much narrower claim: agents can act inside an envelope at machine speed, humans set the envelope and supervise its boundaries at human speed, and the seam between the two is instrumented with distributional telemetry rather than event-level approval. This is the shape of every production agentic system I have looked at that has not blown up. The ones that have blown up (and there have been a few well-publicized examples in the past year) share a common failure mode: a missing or mis-sized envelope, or a missing midstream override, or a supervisor who was looking at logs instead of distributions.

The substrate matters less than the contract, the same point I made about governance in Part 2. You can build action seeking on commodity workflow tools if the four pieces above are present. You can fail to build it on the most expensive agentic platform in the market if any of the four are missing. The technology is necessary. It is not what determines whether the seam closes.

What I am still figuring out

The four-part pattern works well when failures are predictable. In deterministic systems, failures usually follow patterns you can see in the data: an input outside an expected range, a value crossing a threshold, a rate that drifts. Telemetry catches those.

AI systems fail differently. Most of them work on unstructured data, which means the failure can come from something as small as a typo, or a phrasing the model misinterpreted, or a context shift the system did not flag. None of those show up cleanly as "inputs outside an expected range." They show up as one wrong action that no statistical view of the previous thousand actions would have predicted.

I do not yet have a clean answer for how to supervise this part. The pattern in this article is necessary; it is not yet sufficient. If you have run a production agentic system and have learned how to catch the failure modes that do not look like statistical drift, I would genuinely like to hear how you set up that early-warning surface. The literature is still shallow; operator notes are the better source.

Closing Reflection: The I-R-D-A Axis

This series has walked one axis. Information to Recommendation to Decision to Action. I → R → D → A. Four stages, three seams. The argument has been that information-rich organizations stall not at the rungs but at the seams, and that each seam has a different shape, a different failure mode, and a different fix.

The seam between information and recommendation is cognitive contract: what the analyst hands the leader. The seam between recommendation and decision is governance: who is named, with what authority, with what downside. The seam between decision and action is execution latency and statistical supervision: whether the system can absorb the decision at the speed it was made, and whether the supervisor has the right instruments to watch a population of agent actions rather than approve individual ones.

Each seam closes with a different artifact. A ranked-options view. A decision-rights register with dissent on record. A policy envelope with distributional telemetry and a midstream override. The artifacts are different. The principle underneath them is the same: at every seam, the work is to make explicit a contract that has been implicit, and to put a single named human in the position of owning that contract and being able to evolve it.

There is a meta-story I want to flag, because it is too cute not to. This site is built and run with AI in the loop. The drafts are produced by Claude. The voice is calibrated by reading prior published articles and matching the patterns. The infographics are generated as SVG by an agent following a constrained palette. The engine itself is a pipeline of recommendation, decision, and action seams; the human in the loop, which is to say me, is doing exactly the supervisory work this article describes. Setting envelopes. Reviewing distributions. Overriding mid-sequence when something drifts. The framework is not just a thesis. It is the architecture of how this brand publishes.

→ Part 1: From Dashboards to Recommendations. Why information-rich orgs stall.

→ Part 2: The Trust Gap. Why recommendations don't always become decisions.

→ See the full framework: The Decision Effectiveness Framework.

The Last Mile is Action — Closing the Decision-to-Execution Gap

The Decision That Sat for Six Weeks

Why the Seam Is Execution, Not Deliberation

What Action Seeking Actually Looks Like

What I am still figuring out

Closing Reflection: The I-R-D-A Axis

Related

The Right to Merge

Building AI Learning Curves: A Vibe Coding Journey

Mastercard's Network of Labeled Outcomes