The right questions, for your role

Most AI failures persist because nobody asks the uncomfortable questions. These are specific enough to use in your next 1:1 — pick your role and start asking.

Questions to ask your CTO / AI architects

Framed in business impact, cost, and risk. You don't need to understand the architecture to ask these.

What percentage of our AI features actually require AI reasoning vs. deterministic logic we're overpaying for?

Why it matters: Most "AI features" are if/else workflows burning $0.03/call on reasoning that should cost $0.

If our model provider went down for 4 hours today, what happens to our product?

Why it matters: No fallback means a third-party outage becomes your outage -- and your customer churn.

Who evaluates whether our AI is giving correct answers -- and is it the same AI that produced them?

Why it matters: Self-evaluation is confirmation bias with a GPU. If the grader is the student, nobody is grading.

What's our actual cost per AI-powered action, and how does that compare to the value it creates?

Why it matters: Without unit economics, you can't tell if AI is a profit center or a subsidized cost center.

How many of our AI errors does the user see before we do?

Why it matters: If your detection method is "a customer complains," your actual error rate is whatever they tolerate before leaving.

What's our token spend this month, and what percentage of that is waste -- system prompts, retries, oversized context?

Why it matters: Token waste is invisible on the P&L until someone itemizes it. Teams routinely burn 30-50% on overhead.

If a competitor replicated our AI features with basic automation, what would they lose?

Why it matters: If the answer is "nothing," the AI isn't a moat -- it's a markup.

When was the last time we audited what data the AI can access, and whether it should?

Why it matters: AI tools with broad access accumulate permissions like sediment. One misconfigured tool can expose data you didn't know was reachable.

Questions to ask your engineering leads

Architecture, systems, and observability. These expose the gaps between what's documented and what's deployed.

Show me the structured logs for AI errors from the last 7 days. How many were silent?

Why it matters: If errors aren't structured and queryable, your monitoring is decorative.

Which of our agents evaluate their own output vs. having an external evaluator?

Why it matters: Self-evaluation systematically underreports failures. The model that wrote the bad code won't flag the bad code.

What's our context window utilization -- how much is system prompt, tool definitions, and overhead vs. actual user data?

Why it matters: Teams routinely discover 60%+ of their context window is consumed before the user's input arrives.

Walk me through the session handoff. What artifacts persist between sessions, and what's silently dropped?

Why it matters: Most "memory" implementations lose structured data (amounts, dates, IDs) during summarization.

How many tools are registered per agent, and what's our tool selection accuracy when it's above 5?

Why it matters: Tool selection degrades measurably past 4-5 tools per agent. More tools, more misroutes, zero alerts.

What model are we using for each pipeline stage, and what's the justification for each choice?

Why it matters: Running Opus for classification tasks that Haiku handles identically costs 40x more with zero quality gain.

Pull up our tool descriptions. Would a human reading them pick the right tool on the first try?

Why it matters: Ambiguous tool descriptions are the #1 cause of silent misroutes, and they generate no errors to debug.

What's our batch API usage on latency-tolerant workloads? Are we paying real-time prices for background jobs?

Why it matters: Batch API is 50% cheaper. If your nightly reports run synchronously, you're paying a premium for speed nobody needs.

Questions to ask your team leads

Process, reliability, and failure handling. These reveal whether your AI pipeline is production-grade or a demo with uptime.

Walk me through what happens when the AI pipeline fails mid-request. Where does the user end up?

Why it matters: If the answer involves "it depends" or "they'd see an error page," there's no graceful degradation path.

What's our mean time to detect an AI failure? An hour? A day? When a user complains?

Why it matters: AI failures are often semantically wrong, not HTTP-500 wrong. Traditional monitoring misses them entirely.

How are we testing the unhappy paths -- empty input, 50KB payloads, timeouts, provider outages, concurrent duplicates?

Why it matters: If your test suite only covers the happy path, it's a demo test suite, not a production test suite.

What triggers human escalation from the AI -- sentiment analysis, confidence scoring, or structured logic?

Why it matters: "The AI decides when it's stuck" is not an escalation policy. It's a hope.

How many of our AI pipelines have a deterministic fallback, and how many just fail?

Why it matters: A rule-based fallback that handles 80% of cases is better than an AI-only path that fails 10% of the time.

What's the retry logic when a model call times out? Is there a backoff, a fallback model, or does it just retry the same call?

Why it matters: Naive retries on overloaded providers amplify the problem and double your costs.

Show me the last 5 AI-related incidents. How were they detected, and what was the resolution time?

Why it matters: If you can't pull this up quickly, you don't have AI-specific incident tracking -- you have general ops hoping to catch it.

Are we logging the full prompt/response for production AI calls, or just the final output?

Why it matters: Without prompt-level observability, debugging AI failures means guessing what the model saw.

Questions to verify directly

Hands-on-keyboard checks. Run these yourself -- don't delegate, don't ask for a summary, look at the output.

Run the app with the AI provider returning 500s for 60 seconds. What does the user see?

Why it matters: If the answer is "I haven't tried that," your resilience is theoretical.

Check git history: are any API keys, model tokens, or provider credentials committed in plaintext?

Why it matters: Rotating a leaked key costs an afternoon. A leaked key that nobody noticed costs everything.

Submit an empty form, a 50KB form, and the same form twice in 200ms. What happens each time?

Why it matters: Edge cases at the AI boundary are where confidence meets chaos. These are the inputs your validators need to handle before the model sees them.

Grep for silent catch blocks in the AI pipeline code. How many swallow errors without logging or re-throwing?

Why it matters: Every silent catch is an error you'll never see in your dashboard. They accumulate into "the AI is just unreliable sometimes."

Measure the actual token count of your system prompt + tool definitions. What percentage of the context window is gone before the user says anything?

Why it matters: If your overhead is 40K tokens on a 128K window, the user's "long conversation" is shorter than they think.

Trace a single AI request end-to-end. How many model calls does it actually make, and what's the total latency and cost?

Why it matters: "One user action" often triggers 3-7 chained model calls. If nobody's counted, nobody's optimized.

Find the stop_reason handling in your agent loop. Is it checking structured stop signals, or parsing natural language for "I'm done"?

Why it matters: Parsing natural language for completion is a race condition disguised as a feature. It works until it doesn't.

Check your MCP server configs. Are tool descriptions unambiguous, and do error responses use structured formats the model can parse?

Why it matters: A tool that returns "Error" as a string gives the model nothing to work with. Structured error responses let it recover or escalate.

Ready for answers?

If these questions made you uncomfortable, that's the point. Now find out where you actually stand.