Stop tuning your AI by listening to output. Tune by watching the bell curve. Bell Tuning™ reads the statistical shape of your retrieval, agent, and tool-call traces — catches silent RAG failures, tool-call pathologies, and context rot before they ship. Free open-source instrument + $2,500 paid audits.
Or try the free instrument → npx contrarianai-context-inspector --install-mcp
Tuning an AI isn't about merely listening to its output; it's about observing the bell. The bell curve, that is.
Many teams assess AI by recording responses and judging if they sound correct. This approach relies on lagging indicators. By the time an answer is deemed incorrect, the context window may have already deteriorated over several turns, making recovery difficult if not impossible.
A more effective signal lies within the context itself.
By scoring each segment of an AI's context window for domain alignment and plotting the distribution, you create a bell curve. The shape of this bell provides insight into the system's health before the output reveals any issues.
The ideal bell curve shape is unique to each application and may change over time. It indicates alignment with the ideal context content for your app.
This practice, which I call Bell Tuning, focuses on adjusting your AI workflow based on the bell's shape rather than the output noise. What's the ideal shape of the curve? You decide.
With Claude's assistance, I developed an instrument that provides real-time information to allow for continuous adjustment for adherence to the ideal bell curve. The tool is available as an MCP server — it can be integrated into Claude Desktop, Cursor, Windsurf, Cline, or Claude Code with a single command:
npx contrarianai-context-inspector --install-mcp
The tool is open source, MIT licensed, and research-backed — with a white paper available in the repository.
If you're running RAG, multi-agent systems, long-context chatbots, or any workflow where context accumulates across turns — you should be Bell Tuning.
Watch the shape:
| Bell shape | What it means | Action |
|---|---|---|
| Tight, right-shifted | Context is on-domain. Healthy. | Keep going. |
| Wider, drifting left | Contamination, summary loss, or topic drift entering. | Tune now — refresh, evict, re-ground. |
| Flat near zero | Original content is gone. System is still answering — on noise. | Reset. Output cannot be trusted. |
The proof: 40-step contamination experiment (white paper in repo).
→ Step 11: bell σ jumps 56%. Output still scores 0.85. The graph saw it. Output didn't.
→ Steps 12–14: bell flattening. Output still passing. Three steps of warning.
→ Step 15: bell collapses. Output hits 0.00. Never recovers.
Try Bell Tuning yourself — the instrument is free and open source.
Read the manifesto → View on GitHub White Paperof "AI agents" are deterministic workflows in disguise — burning tokens on reasoning that should be an if/else statement.
tokens is where context actually degrades — not the 200K on the box. Your AI is losing data before hitting the limit.
cost difference between running every subagent on Opus vs. right-sizing to Haiku. Most teams use the expensive model for everything.
errors logged when tool descriptions silently misroute calls. No alerts. No logs. Just wrong answers with total confidence.
The free instrument tells you the bell is wrong. These engagements tell you why — and fix it. Fixed scope, fixed price, personal guarantee.
4 engineers. 8 months of development. A multi-agent system that "worked great in staging." In production: dropping order numbers during context summarization, looping on conflicting instructions, approving its own broken output, burning $47K/month in tokens because every subagent ran on the most expensive model.
The fix wasn't a rewrite. It was structural: replaced 3 autonomous agents with deterministic workflows (they never needed reasoning), added a sprint system to prevent context rot, separated builder from evaluator, right-sized model selection per task.
Agents parsing natural language for completion instead of stop_reason. Subagents assuming shared memory. Self-evaluation bias — the builder grading its own homework.
Progressive summarization destroying dollar amounts and order numbers. Lost-in-the-middle effect burying critical instructions. Memory contradictions.
Tool descriptions causing silent misroutes. More than 4-5 tools per agent degrading selection. No distinction between "nothing found" and "the API failed."
"Revenue" means 3 different things across 3 teams. The AI picks a table and gives confident, plausible, wrong answers.
Full file reads at 3,000 tokens when grep costs 200. MCP servers consuming 2,000-8,000 tokens before any work starts. No batch API usage.
No structured logs. Same config for dev and prod. Silent catch blocks. Sentiment-based escalation. No session handoff artifacts.
Bell Tuning is the practice of treating an AI's context window as a measurable distribution — and tuning your workflow against the shape of that distribution rather than against the output it produces. Tighter bell, right-shifted = healthy. Wider bell, drifting left = contamination. Flat bell near zero = original content gone. The free instrument (Context Inspector) reports the shape continuously. This service interprets it and fixes the underlying causes.
No, but it helps. If you've already installed npx contrarianai-context-inspector --install-mcp and seen your bell flatten, you already know what's wrong — you just need help fixing it. If you haven't, I'll bring the instrument with me and we'll Bell Tune your stack together as the first step of the engagement.
Consultants bill hours. This is a fixed-scope engagement: defined deliverable, defined price, defined timeline. You get a written report with a prioritized fix list — not a slide deck, not an ongoing retainer you can't exit. If I don't find 3+ issues, you pay nothing.
Read access to your AI-related code repositories, architecture docs (if they exist), and logging/monitoring dashboards. I don't need production credentials or customer data. Most teams grant a short-lived GitHub collaborator invite and a Datadog/Grafana viewer role.
Then you don't pay. That's the guarantee. I've never had to honor it — every engagement so far has found more than 3 production-impacting issues. The structural failures are that common.
Your internal team built the system. That's exactly why they can't objectively diagnose it. The same session that wrote the code can't evaluate the code — that's one of the 6 failure patterns. An external diagnostic gives your team the prioritized fix list without the sunk-cost bias.
The average full diagnostic finds 60% token waste and 3+ silent failure modes. A $15K diagnostic on a $47K/month AI spend typically pays for itself in the first month. The $2,500 Rapid Audit (48-hour turnaround, Stripe direct-pay) is designed to be a no-brainer for any team spending $5K+/month on AI.
I do. Kevin Luddy. This isn't a firm that sells and delegates. One person does the diagnostic, writes the report, and walks you through the findings. That's why I limit slots.
"You don't tune an AI by listening to its output. You tune it by watching the bell."
Output evaluation is a lagging indicator. Bell Tuning is the leading one.
"A smarter model doesn't fix agent failures. A smarter environment does."
Model upgrades are the most expensive way to avoid fixing your architecture.
"90% of automation needs are workflows, not agents."
If you can draw the decision tree, you don't need AI. You need an if/else statement that costs nothing.
"Agents cannot reliably judge their own output."
Confirmation bias with a GPU is not evaluation.
"More data degrades AI performance when context isn't managed."
Past a threshold, more information makes it worse, not smarter.
Tell me what you're running and what's not working. I'll respond within 24 hours.