Planner-Executor-Evaluator Harness
A three-agent task harness inspired by Anthropic's research on long-running application development. A System bot orchestrates three specialist agents - Planner, Executor, and Evaluator - using bot/call abilities to decompose complex goals, implement them iteratively, and verify quality through an independent evaluation loop. Attach a trigger integration to run the harness on a schedule or fire it manually.
Anthropic's engineering team published research showing that naive single-agent execution falls short on complex tasks for two reasons: models lose coherence as the context window fills, and they are unreliable judges of their own output. Their solution is a multi-agent harness with three roles - planner, generator, and evaluator - each addressing a specific gap. The planner expands a short prompt into a full spec so the generator does not under-scope. The generator works in focused sprints against that spec. And a separate evaluator grades the output against concrete criteria, catching bugs and quality gaps the generator misses.
The key insight is that separating evaluation from generation is far more effective than asking an agent to self-assess. Agents reliably praise their own work, but an external evaluator can be tuned for skepticism and given concrete grading criteria. When the evaluator fails a sprint, its detailed feedback flows back to the generator as input for the next iteration - creating a GAN-inspired loop that drives quality upward over cycles.
Why This Works Better as a Blueprint
Rather than hardcoding the planner-executor-evaluator pattern into the task execution engine, this blueprint implements it as a composable multi-agent system using standard ChatBotKit primitives. This matters for several reasons:
-
Flexibility - Not every task needs all three agents. Simple tasks can skip the planner. Research tasks might swap the executor for a researcher and synthesizer. Creative tasks might weight the evaluator more heavily. A blueprint lets users compose whatever topology fits their problem.
-
Tunable evaluation - The Evaluator bot's backstory encodes domain- specific grading criteria. Users can customize these criteria for their use case - code quality metrics, design principles, compliance checks, or whatever their domain requires. This is far more powerful than a generic built-in evaluator.
-
Shareable patterns - Proven orchestration patterns become blueprints that teams clone and customize. Publish a code review harness, a content creation harness, or a data pipeline harness - each with its own evaluation criteria and agent configuration.
-
No engine changes - The existing task system provides scheduling, execution limits, stalled detection, and session management. The blueprint provides the multi-agent topology on top. Clean separation of concerns.
How It Works
The System bot is the orchestrator. It owns a skillset with three bot/call abilities - Plan, Execute, and Evaluate. When triggered (manually or on a schedule), the System bot follows a strict protocol:
-
Call the Planner with the goal. The Planner expands it into a structured spec with numbered tasks and acceptance criteria.
-
For each task in the plan, call the Executor with the task description and acceptance criteria. The Executor does the work and reports what it completed.
-
After each execution, call the Evaluator with the original task, acceptance criteria, and the Executor's output. The Evaluator grades against the criteria and returns PASS or FAIL with specific feedback.
-
If the Evaluator returns FAIL, the System bot calls the Executor again with the feedback. Maximum 3 evaluate-fix cycles per task before moving on.
Adapting the Pattern
- Add a
space/storage/rwability to let agents share artifacts through persistent files rather than conversation context - Connect the trigger integration to a schedule for autonomous overnight builds
- Swap in different models per agent - a cheaper model for planning, a stronger model for execution, a fast model for evaluation
- Add more specialist executors and route tasks to them by type
Reference: https://www.anthropic.com/engineering/harness-design-long-running-apps
Backstory
Common information about the bot's experience, skills and personality. For more information, see the Backstory documentation.
Skillset
This example uses a dedicated Skillset. Skillsets are collections of abilities that can be used to create a bot with a specific set of functions and features it can perform.
Plan
Call the Planner agent to decompose a goal into a structured plan with numbered tasks and acceptance criteriaExecute
Call the Executor agent to implement a task and report completed workEvaluate
Call the Evaluator agent to grade completed work against acceptance criteria and return PASS or FAIL with feedback
Terraform Code
This blueprint can be deployed using Terraform, enabling infrastructure-as-code management of your ChatBotKit resources. Use the code below to recreate this example in your own environment.
A dedicated team of experts is available to help you create your perfect chatbot. Reach out via or chat for more information.