back to reflections

Agentic Work Must Be Time-Bound

A durable workflow can be paused forever, and that is exactly the footgun. The longer an agent runs, the more loops, code drift and world drift it picks up. The fix is to make agentic work strictly time-bound, so a failed attempt restarts in a clean conversation instead of dragging stale assumptions forward.
Petko D. Petkovon a break from CISO duties, building cbk.ai

AI models can run for longer and longer. The assumption that rides along with that is that the whole job has to happen in one execution - one shot.

This is the case durable workflows make. A durable workflow is a state machine. Each inference completion gets scheduled in a queue, dispatched, and the state of that operation is written to a database, then cycled back in, and so on. It is a sound piece of architecture and we use it internally for what we call Tasks. The selling point, the thing that makes it interesting, is that the execution can be paused. That is also the footgun.

It makes sense when a task is split into small steps that run in a consistent manner. It stops making sense the longer the execution goes, because the longer it runs the higher the chance that something goes badly wrong. And there are many things that can go really, really wrong.

Start with the obvious one, the one none of the frameworks I have reviewed actually handle. Execution loops. The workflow is persisted, so in theory it can run forever. A rogue workflow stuck in a loop is a real risk, and an expensive one. This is the same family of failure I wrote about in loops, cycles, and runaways, except now it has a database keeping it alive.

Then there is state. The state is preserved at the step level, but that does not mean the code you have deployed to operate on that step still accommodates it. Change the agent and you can introduce a state that can no longer be reconciled, and the whole execution fails. There are ways to detect code drift. I am not going to go into them here. Just know it is not a free lunch.

The next one is external state change. The longer a workflow sits paused, the more the world drifts underneath it. For example, an agent pulls the current time through a tool call. The workflow then pauses for a day or two. When the agent wakes up, time has moved on, but the agent has no idea, unless it asks for the time again. You have to build that in. You can stamp the time into the system prompt, which is bad for token caching, or you can force a time signal in somewhere else, but either way you have to solve it. Time is just the easy example. In a more complex system the drift is far worse. It is the oldest bug in the book, time of check to time of use, a plain race condition.

The fix is obvious and somehow not obvious at the same time. Stop thinking of an agent as a step function that can run to infinity. Think of its work as time-bound work that has to happen inside a limited window. In CBK the Trigger integration, and every other integration, carries a maximum execution time of fifteen minutes. The agent has to get the job done inside that window. If it cannot, it can try again, moments or days later, but that attempt lands in a completely new conversation, one that carries none of the assumptions of the last one. That is what actually solves the problem.

For a realtime system like Slack this falls out naturally. The user needs some kind of response quickly. It does not mean a complex job finishes in fifteen minutes, but any planning, anything technical, fits inside it. For a trigger you configure how often it fires, and depending on the session duration those triggers land in different states. That sounds like extra work for the agent, and it is exactly the point. It has to fetch the time again, because the world has drifted and the old reading is no longer reliable.

This pattern runs through every one of our subsystems except Tasks. Tasks are the one place built to run for a long time, on a state machine much like a durable workflow, with one difference. They are time-bound too. You configure it, the default is one hour and a limited number of iterations, and it can never be unlimited or unbounded.

You might ask what happens when a workflow needs an approval. It does not work, and I went into why in human in the loop, just not like this. In theory it sounds great. In practice it is another footgun. The moment an agent needs an approval, the whole concept of authorisation has to change.

None of this stops an agent from failing badly. It can still fail in ways you did not see coming. What it does is shrink the surface. Context and world drift, authentication and authorisation, all of it gets smaller.

We did not build it this way to be different. We learned the hard way that it is the only thing that makes sense. So why are the framework authors not doing the same? Because they are building AI frameworks, and a platform is a different thing. A framework hands these decisions to the developer, who then shoots himself in the foot and pays the high price. A platform cannot afford that, because it shows up as a poor customer experience, so we are the ones who have to fix it.

An agent with no deadline is just a problem waiting to happen.