Real-Time LLM-as-a-Judge with the Extract Integration
You can measure how well your AI agent performs without writing a single line of evaluation code. The trick is to point a second model at each finished conversation and have it grade the exchange against a rubric. This pattern is called LLM-as-a-judge, and ChatBotKit's Data Extraction integration gives you everything you need to run it in real time.
This becomes especially powerful for tool-using agents. A ChatBotKit conversation records every tool call the agent makes along with its result, so the transcript already contains a full operational trace. An LLM judge can read that trace and turn it into hard numbers: how many records the agent wrote, how often calls failed, how many times it retried, and where the failures came from.
In this tutorial you will build an agent that manages crmkit - an agent-first CRM - over MCP, then wire an Extract integration to it that acts as the judge. After every conversation, the judge reads the tool activity and extracts operational metrics you can chart and monitor.
What You'll Build
A two-part system, expressed as a single blueprint:
- CRM Sync Agent - a bot that loads crmkit's MCP tools and uses them to create and update company records.
- Operations Monitor - an Extract integration connected to the agent. Its schema is an operational rubric, and the numbers it pulls from each transcript become collected metrics.
- Operations Metrics chart - an Extract Chart tool linked to the judge, so the metrics are visible right on the blueprint canvas.
Here is the blueprint you will create. Pan and zoom to see how the pieces connect - the judge attaches to the agent, and the chart attaches to the judge.
Prerequisites
- A ChatBotKit account
- Basic familiarity with the blueprint designer
- A look at the Data Extraction integration if you have not used it before
Step 1: Build the CRM Agent
The judge needs an agent that does real work, and a tool-using agent gives it the richest trace to grade. In the blueprint above the agent is four connected resources:
- Bot (
CRM Sync Agent) - the worker. Its backstory tells it to load the crmkit tools, query before writing, retry once on a version conflict, and report failures honestly. That honesty matters: the cleaner the agent is about reporting tool results, the more accurately the judge can count them. - Skillset (
CRM Toolkit) - the container for the agent's abilities, linked to the bot viaskillsetId. - Ability (
Load crmkit Tools) - anmcp/load[crmkit]ability that dynamically pulls in crmkit's MCP toolset (create/read/update companies, contacts, deals) at runtime. - Secret (
crmkit) - an OAuth secret pointing athttps://api.crmkit.ai/mcp, linked to the ability viasecretId, that authorizes the MCP connection.
Each crmkit operation the agent runs - and each failure it hits - lands in the conversation transcript as a tool request and response. That trace is the raw material the judge reads.
Step 2: Define the Operational Rubric
The Extract integration's schema is your rubric. Each property is a thing the judge measures, and the description is the instruction it follows. Two properties turn a field into a tracked metric:
collect: true- records the value as a metric (numeric fields only).display- formats the value on the chart:number,percent, orcurrency/<code>.
The Operations Monitor reads the tool activity and extracts six numbers plus one summary:
| Field | Type | Display | What it captures |
|---|---|---|---|
recordsCreated | number | number | New entries written to the database |
recordsUpdated | number | number | Existing records changed |
errorRate | 0-1 | percent | Share of tool calls that failed |
retryCount | number | number | How hard the agent had to work to succeed |
crmkitErrorCount | number | number | Failures that originated in the CRM |
otherErrorCount | number | number | Failures from network, fetch, or the agent |
errorSummary | text | - | What failed and where, for spot checks |
This is the part that makes the example interesting. The judge is not scoring a vibe - it is parsing a semi-structured trace into operational telemetry. Splitting failures into crmkitErrorCount and otherErrorCount answers the question every on-call engineer asks first: is this our problem or theirs? A spike in crmkit errors means open a ticket with the CRM; a spike in other errors means look at your own agent.
Tip: The
percentdisplay formats values as fractions, so0.25renders as25%. Tell the judge to scoreerrorRatebetween0and1in the field description, as the schema above does, and the chart reads in clean percentages.
The errorSummary field is not collected, so it never shows up on a chart. It is stored alongside the numbers in the conversation metadata, which makes auditing easy: when crmkitErrorCount jumps, you read the summaries to see exactly which calls broke and why.
Step 3: Connect the Judge to the Agent
An Extract integration grades whatever bot it is attached to. In the blueprint this is the botId field on the integration pointing at #bot:::crm-agent. On the canvas, you draw a line from the judge to the agent. That single connection is what makes the judge real-time: it now sees every conversation that bot has.
Step 4: Set the Trigger to Automatic
The trigger: automatic setting is what makes this run on its own. With automatic triggering, the judge fires after each conversation completes and logs the metrics without any manual step. This is the "real-time" part - your dashboard reflects the agent's behavior as soon as conversations finish, so a climbing errorRate or a burst of retryCount shows up while you can still act on it.
Already have a backlog of conversations? Use the Trigger button on the integration page to apply the rubric to the most recent 100 conversations and backfill the chart, which gives you a baseline before live scoring takes over.
Choose a capable judge model. The blueprint runs the agent on
claude-4.6-sonnetfor fast, cost-effective CRM work, and runs the judge onclaude-4.8-opusbecause reading a tool trace and attributing errors rewards stronger reasoning. The judge runs once per conversation, so the extra capability is cheap relative to live traffic.
Step 5: Chart the Metrics
Drop an Extract Chart tool onto the canvas and connect it to the Operations Monitor integration - that is the extractIntegrationId link in the blueprint's tools section. The chart reads the judge's collected fields and draws a daily series for each one, formatted with that field's display setting. Records created and updated plot as counts, error rate plots as a percentage, and the two error-location counts plot side by side so you can see at a glance whether crmkit or your own stack is the bigger source of trouble.
Keeping the chart on the blueprint means the operational signal lives next to the design it measures. Anyone who opens the blueprint sees both the agent and how reliably it is running, with no separate dashboard to hunt for.
How It Works
The whole system is a feedback loop:
- A user asks the CRM Sync Agent to add or update companies, and it runs crmkit tool calls to do so.
- Every tool request and response - successes, errors, and retries alike - is recorded in the conversation transcript.
- The conversation completes and goes idle.
- The Operations Monitor, attached via
botId, reads the full transcript including that tool trace. - Guided by the schema descriptions, the judge model counts records, computes the error rate, tallies retries, attributes each failure to crmkit or elsewhere, and writes a summary.
- Fields marked
collect: trueare logged as metrics; the summary lands in conversation metadata. - The Extract Chart renders the accumulating metrics as daily series.
The agent and the judge never share a model call - the judge is a clean second pass over a finished conversation, which keeps its accounting independent of the agent it is grading.
Why This Matters
Counting tool calls by hand does not scale, and traditional analytics only tell you how many conversations happened, not what the agent actually accomplished or where it struggled. An LLM-as-a-judge closes that gap. Because every conversation is read the same way against the same rubric, the numbers are comparable over time, and that comparability is what lets you monitor an agent in production:
- Throughput - watch
recordsCreatedandrecordsUpdatedto confirm the agent is getting real work done as traffic grows. - Reliability - track
errorRateandretryCountto catch the agent quietly degrading before users complain. - Failure attribution - the
crmkitErrorCountversusotherErrorCountsplit tells you whether to escalate to the CRM provider or fix your own agent, which is usually the slowest question to answer during an incident. - Regression detection - after you change a backstory, model, or tool, compare the metric lines before and after to see whether the change actually helped.
- Automated alerting - set a
requestURL on the integration to push each scored result to your own endpoint, then fire a Slack alert or open a ticket when error rate crosses a threshold.
Tips for Reliable Scoring
- Make the agent report tool results honestly. The judge can only count what the transcript records. A backstory that insists on reporting failures plainly, like the one above, directly improves metric accuracy.
- Write the rubric like instructions to an auditor. State exactly what to count and how to classify it. "Errors that originated inside crmkit, such as 4xx/5xx responses or version conflicts" beats a bare "crmkit errors."
- Keep scales consistent. Counts as plain numbers and proportions as 0-1 fractions make the chart easy to read at a glance.
- Anchor with a summary field. Forcing the judge to justify its counts improves the counts themselves and gives you something to audit.
- Start narrow. A few metrics you trust beat a dozen you have to second-guess. Add more once the first set is stable.
Wrapping Up
With one Extract integration acting as a judge, your CRM agent grades its own runs after every conversation and turns a raw tool trace into operational telemetry. Build the agent, write the rubric as a schema, connect the judge with botId, set the trigger to automatic, and chart the result. From there the accumulated metrics become a live read on throughput and reliability you can watch, alert on, and improve against - and the same pattern works for any tool-using agent, not just one that talks to crmkit.
For deeper measurement patterns, see How to Measure ROI with the Data Extraction Integration and the full Data Extraction documentation.