Bell Tuning™ — A framework for production AI context monitoring

This page, in seven parts

I. A story: how AI quietly breaks
II. What Bell Tuning is
III. The math (skip this section if you want)
IV. The tools we built
V. Does it actually work? (yes, mostly)
VI. Limits + what this is not
VII. What to do next

I.A story: how AI quietly breaks

Picture this. An M&A diligence team feeds 800 pages of legal documents into an AI assistant. They ask: "Are there any change-of-control clauses we need to worry about?" The AI confidently says no. The deal closes. Eighteen months later, the buyer triggers a co-investor consent right that was hiding in a side letter the whole time. The AI never saw it because the search step ranked the main agreement higher than the side letter. The summary it gave the lawyer was clean. The deal was contaminated.

The team did everything right. The model was state of the art. The prompts were careful. The reviewers were skilled. The failure happened in a layer nobody was watching: the data flowing through the AI's working memory before the answer was generated.

This story repeats. A coding assistant rewrites a function correctly for 200 turns then quietly mangles a variable name on turn 201 because its own summary lost track of the original code. A customer-support chatbot answers 50 questions accurately then confidently invents a refund policy that doesn't exist because a stale knowledge-base article slipped into its working set. A research assistant cites a 2024 paper as if current, when the cited finding has since been retracted.

None of these failures show up as obvious crashes. They look like the AI working normally, until they don't.

The diagnosis most teams reach for: better model, more careful prompts, more eval tests. The diagnosis we propose: look at what's actually flowing through the AI before it answers. The shape of that data, measured continuously, tells you the AI is about to fail before it fails. The practice of doing this has a name: Bell Tuning.

II.What Bell Tuning is

Think about debugging code. The dumb way: run it and see if it crashes. The smarter way: print out variables as the code runs and watch them. You catch the bug before the crash.

Bell Tuning applies that idea to AI. Instead of waiting for the AI to give a wrong answer, you score every piece of data flowing through it — every chunk of text, every retrieved document, every tool result — for how well it fits the task. Then you graph those scores.

The graph is shaped like a bell curve. Tall in the middle, short on the sides. Like a histogram of test scores in a class, but for "how on-topic is this chunk of data."

When the AI is healthy, the bell is tight and shifted to the right (most chunks score high, the spread is small). When the AI is starting to drift, the bell flattens out and shifts left (chunks score lower, the spread grows). When the AI is broken, the bell collapses near zero (almost nothing scores well anymore).

The collapse from healthy to broken is continuous, not instant. The shape changes before the AI starts giving wrong answers — often several turns before. That early-warning window is what makes the practice valuable.

The "tuning" part: when you can see the shape change, you can take action while it's still cheap. Maybe a stale document slipped into the search results — pull it out. Maybe a tool started returning huge blobs of irrelevant data — trim them. Maybe an old summary is contradicting fresh content — refresh it. None of these fixes require a new model. They require knowing the bell is wrong before the user does.

That's the whole thesis. Watch the shape, not the output. The shape moves first.

III.The math (skip this section if you want)

The graph we keep talking about is a real statistical distribution, and the things you can measure about it are standard stuff from a stats class. None of this is new math — what's new is applying it to AI context monitoring.

Each chunk of data gets a score between 0 and 1 measuring how aligned it is with what the AI is supposed to be doing. We currently use TF-IDF cosine similarity (a 1970s text-matching technique) because it has zero dependencies and runs anywhere. Embedding-based scoring is coming as an option.

Once you have a bunch of scores, you summarize them with four standard numbers:

Mean — the average score. Tells you whether the data is on-topic overall.
Standard deviation (sigma) — how spread out the scores are. Tight = consistent. Wide = scattered.
Skewness — which direction the bell is leaning. Tells you what direction the drift is going.
Kurtosis — whether there are two clumps of data instead of one bell. Tells you the AI is being pulled in two directions at once.

The standard deviation usually moves first when something is going wrong. New off-topic content widens the spread. Then the skew shifts as one side of the bell gets a long tail. Then the mean finally drops as enough garbage accumulates. Then — last — the output goes bad.

You can also compare two bell curves to each other using JS-divergence (Jensen-Shannon divergence) or Wasserstein distance. Both are textbook ways to ask "how different are these two distributions?"

And once you have a series of bell curves over time, you can predict the next one. We use Adams-Bashforth-Moulton (a numerical-prediction technique from the 1800s) or a Kalman filter (1960s). When the predicted next bell doesn't match the actual next bell, something is changing. That gap is itself a leading indicator.

The context window has a measurable distribution. That distribution has a shape. The shape evolves over time. The evolution can be predicted. Prediction errors warn you the AI is about to fail.

That's the math. Old techniques, new application.

IV.The tools we built

Five open-source tools (MIT-licensed). Each one measures the bell curve on a different surface of the AI workflow.

context-inspector — measures the bell curve of the AI's working memory. Installs into Claude Desktop, Cursor, Windsurf, Cline, or Claude Code with one command. This is the one to install first.

retrieval-auditor — same idea but for AI that looks things up in a knowledge base (RAG). Each retrieved document is scored against the query; the bell curve of scores tells you whether the search step is healthy, polluted by stale stuff, redundant, or ranked wrong.

tool-call-grader — same idea but for AI agents that call tools. Each tool call is scored for relevance; the session-level shape catches silent failures, AI getting fixated on one tool, tools returning bloated responses, schema mismatches, and cascading failures across multiple agents.

predictor-corrector — the forecaster. Given the bell curve's history, predicts the next state under healthy dynamics. When prediction and reality disagree, something's drifting. On one benchmark this caught drift 17 turns before output-only monitoring did.

audit-report-generator — consumes the output of the other four sensors and produces a unified report (Markdown, HTML, or JSON). This is what makes the $2,500 Rapid Audit deliverable possible — it's literally the report your engineer would write, generated by code from the sensor data.

All five are independent command-line tools and MCP servers. They share data shapes so they compose without glue code. The code is open source because the code was never the moat. A hosted layer — Bell Tuning Cloud — turns these local readouts into a cross-network reliability benchmark; that's the durable part (private beta, below).

V.Does it actually work? (yes, mostly)

Four experiments. Each one has a public whitepaper and reproducible code in the GitHub repo.

Unseen Tide (testing the forecaster). 40-turn experiment that slowly contaminates the data with off-topic content. The forecaster flagged trouble on turn 17. The standard-deviation alarm fired on turn 28. The mean-score alarm fired on turn 34. Net result: 17 turns of warning over the simplest output monitor. Zero false alarms during the calibration phase.

Conversation Rot (testing the forecaster on a harder scenario). 51-turn synthetic chat where the topic shifted back and forth three times. The forecaster lost to a simple standard-deviation threshold (F1 of 0.76 for the simple threshold vs 0.52 for the forecaster). Honest negative result. The forecaster works well on slow steady drift; it doesn't handle back-and-forth oscillation as well. We publish the loss because the discipline matters more than any single tool looking good.

RAG Needle (testing the retrieval auditor). Tests whether the auditor's health score tracks the actual quality of retrieval (precision@5) without ever seeing the ground-truth labels. Result: r = 0.999 correlation on the part of the test where the retrieval was getting worse. All six silent-bug flags fired correctly. Unsupervised RAG monitoring (no labels needed) is feasible.

Agent Cascade (testing the tool-call grader). Six made-up failure scenarios on synthetic multi-agent traces. Tests whether session-level signals catch each specific failure. Result: 7/7 pass rate. Each silent bug fired on its designed scenario.

Pattern across the four: the framework works on slow steady drift, honestly reports its own limits on back-and-forth drift, and produces clean signal across multiple parts of an AI workflow. We publish the losses too.

VI.Limits + what this is not

What this enables: continuous monitoring of AI health in production, without needing labeled data. Early warnings before bad output. A unified diagnostic surface across context memory, search retrieval, and multi-agent tool calls. Audit reports that look like real engineering deliverables.

What this is not: a replacement for evals. Not a replacement for human review. Not a guarantee that detected drift means broken output (sometimes drift is fine). Not a claim about model quality. Not a claim that we catch everything.

Specific limits we publish openly:

Semantically-relevant content that doesn't share keywords with the query is invisible to the keyword-based scorer. An embedding-based scorer is on the roadmap to fix this.
Adversarial paraphrase — off-topic content rewritten with on-topic vocabulary — fools any keyword-based scorer. Hybrid keyword-plus-meaning scoring is the mitigation.
Very small samples (under 10 chunks) make the higher-moment statistics unreliable. The tools default to using only mean and standard deviation in that case.
Detecting two-cluster patterns needs at least 15 buckets in the histogram.
Conversation Rot showed where a simple threshold beats the forecaster. We tell you.

Bell Tuning is one layer of an AI reliability stack. It's the layer most teams are missing. It composes with eval suites, prompt monitoring, output review, and human-in-the-loop review. It does not replace any of them.

VII.What to do next

If this framework is right, three things to try:

First — install one tool. Run npx contrarianai-context-inspector --install-mcp. It adds a Bell Tuning sensor to your existing AI workflow in 90 seconds. Nothing else changes about your setup.

Second — read one whitepaper. The RAG Needle paper is the most actionable if you're running a search-based AI. The Unseen Tide paper is the most interesting if you like the theory. Both are short, reproducible, and free.

Third — ship one experiment of your own. The same framework that produced our four whitepapers can produce yours. Reproduce one of ours against your data. Publish what you find. We'll cite you.

Bell Tuning isn't a product roadmap. It's a way of looking at where production AI actually breaks. The tools, the platform, the audits — those come from the worldview. If the worldview is useful, the rest follows.

The model was never the problem.
The context was.

I.A story: how AI quietly breaks

II.What Bell Tuning is

III.The math (skip this section if you want)

IV.The tools we built

V.Does it actually work? (yes, mostly)

VI.Limits + what this is not

VII.What to do next

The five tools

context-inspector

retrieval-auditor

tool-call-grader

predictor-corrector

audit-report-generator

Bell Tuning Cloud private beta

Whitepapers

Unseen Tide: Model-Based Forecasting Detects Context Drift 17 Turns Before Output Failure

Conversation Rot: Where Model-Based Forecasting Does Not Help

The RAG Needle: Unsupervised Detection of Retrieval Silent bugs with r = 0.999 Correlation to Ground-Truth Precision

The Agent Cascade: Detecting Six Distinct Multi-Agent Tool-Call Silent bugs

Three asks

Install one instrument

Read one whitepaper

Ship one experiment