Planner-Executor-Evaluator Harness

A three-agent task harness inspired by Anthropic's research on long-running application development. A System bot orchestrates three specialist agents - Planner, Executor, and Evaluator - using bot/call abilities to decompose complex goals, implement them iteratively, and verify quality through an independent evaluation loop. Attach a trigger integration to run the harness on a schedule or fire it manually.

multi-agent
planner
executor
378

Anthropic's engineering team published research showing that naive single-agent execution falls short on complex tasks for two reasons: models lose coherence as the context window fills, and they are unreliable judges of their own output. Their solution is a multi-agent harness with three roles - planner, generator, and evaluator - each addressing a specific gap. The planner expands a short prompt into a full spec so the generator does not under-scope. The generator works in focused sprints against that spec. And a separate evaluator grades the output against concrete criteria, catching bugs and quality gaps the generator misses.

The key insight is that separating evaluation from generation is far more effective than asking an agent to self-assess. Agents reliably praise their own work, but an external evaluator can be tuned for skepticism and given concrete grading criteria. When the evaluator fails a sprint, its detailed feedback flows back to the generator as input for the next iteration - creating a GAN-inspired loop that drives quality upward over cycles.

Why This Works Better as a Blueprint

Rather than hardcoding the planner-executor-evaluator pattern into the task execution engine, this blueprint implements it as a composable multi-agent system using standard ChatBotKit primitives. This matters for several reasons:

  1. Flexibility - Not every task needs all three agents. Simple tasks can skip the planner. Research tasks might swap the executor for a researcher and synthesizer. Creative tasks might weight the evaluator more heavily. A blueprint lets users compose whatever topology fits their problem.

  2. Tunable evaluation - The Evaluator bot's backstory encodes domain- specific grading criteria. Users can customize these criteria for their use case - code quality metrics, design principles, compliance checks, or whatever their domain requires. This is far more powerful than a generic built-in evaluator.

  3. Shareable patterns - Proven orchestration patterns become blueprints that teams clone and customize. Publish a code review harness, a content creation harness, or a data pipeline harness - each with its own evaluation criteria and agent configuration.

  4. No engine changes - The existing task system provides scheduling, execution limits, stalled detection, and session management. The blueprint provides the multi-agent topology on top. Clean separation of concerns.

How It Works

The System bot is the orchestrator. It owns a skillset with three bot/call abilities - Plan, Execute, and Evaluate. When triggered (manually or on a schedule), the System bot follows a strict protocol:

  1. Call the Planner with the goal. The Planner expands it into a structured spec with numbered tasks and acceptance criteria.

  2. For each task in the plan, call the Executor with the task description and acceptance criteria. The Executor does the work and reports what it completed.

  3. After each execution, call the Evaluator with the original task, acceptance criteria, and the Executor's output. The Evaluator grades against the criteria and returns PASS or FAIL with specific feedback.

  4. If the Evaluator returns FAIL, the System bot calls the Executor again with the feedback. Maximum 3 evaluate-fix cycles per task before moving on.

Adapting the Pattern

  • Add a space/storage/rw ability to let agents share artifacts through persistent files rather than conversation context
  • Connect the trigger integration to a schedule for autonomous overnight builds
  • Swap in different models per agent - a cheaper model for planning, a stronger model for execution, a fast model for evaluation
  • Add more specialist executors and route tasks to them by type

Reference: https://www.anthropic.com/engineering/harness-design-long-running-apps

Backstory

Common information about the bot's experience, skills and personality. For more information, see the Backstory documentation.

You are the Planner. Your job is to take a short goal or prompt and expand it into a comprehensive, structured plan that a separate Executor agent will implement. ## YOUR ROLE You do NOT implement anything. You plan. You decompose. You define what "done" looks like. ## PLANNING PROTOCOL When given a goal: 1. ANALYZE the goal and identify all the distinct pieces of work needed 2. DECOMPOSE into numbered tasks, ordered by dependency (independent tasks first, dependent tasks later) 3. For each task, define: - **Task N: [Title]** - A clear, actionable title - **Description** - What needs to be done, with enough detail that an executor can work without asking questions - **Acceptance Criteria** - Specific, testable conditions that must be true when the task is complete. Write these as checkable statements: "The API returns a 200 status with a JSON body containing..." - **Dependencies** - Which prior tasks (if any) must be complete first ## RULES - Be ambitious about scope but realistic about complexity per task - Each task should be completable in a single focused session - Acceptance criteria must be concrete and verifiable - not vague ("works well") but specific ("returns valid JSON matching the schema") - Do not specify implementation details unless they are critical constraints. Let the Executor choose the path. - If the goal is ambiguous, make reasonable assumptions and state them explicitly - Number tasks sequentially: Task 1, Task 2, Task 3, etc. - Aim for 3-10 tasks depending on complexity. If you need more than 10, the goal is too broad - suggest splitting it.

Skillset

This example uses a dedicated Skillset. Skillsets are collections of abilities that can be used to create a bot with a specific set of functions and features it can perform.

  • 👹

    Plan

    Call the Planner agent to decompose a goal into a structured plan with numbered tasks and acceptance criteria
  • ✂️

    Execute

    Call the Executor agent to implement a task and report completed work
  • 👍

    Evaluate

    Call the Evaluator agent to grade completed work against acceptance criteria and return PASS or FAIL with feedback

Terraform Code

This blueprint can be deployed using Terraform, enabling infrastructure-as-code management of your ChatBotKit resources. Use the code below to recreate this example in your own environment.

Copy this Terraform configuration to deploy the blueprint resources:

Next steps:

  1. Save the code above to a file named main.tf
  2. Set your API key: export CHATBOTKIT_API_KEY=your-api-key
  3. Run terraform init to initialize
  4. Run terraform plan to preview changes
  5. Run terraform apply to deploy

Learn more about the Terraform provider

A dedicated team of experts is available to help you create your perfect chatbot. Reach out via or chat for more information.