and Energy Artificial intelligence Center for Technology Featured Science Technology Technology & Innovation Technology and innovation

Reading the Mind of the Machine: Why GPT-5’s Chain-of-Thought Monitoring Matters for AI Safety

admin

August 13, 2025

0Points

When OpenAI released GPT-5 last week, most coverage focused on the model’s performance gains, such as fewer hallucinations, stronger multilingual abilities, and state-of-the-art results in health tasks. But buried deep in the 59-page System Card is something far more consequential for AI safety: a section describing how OpenAI is monitoring GPT-5’s internal reasoning (“chain-of-thought”) in real-world production to detect deception and harmful intent. It’s the first time OpenAI, or any frontier model provider, has said so explicitly.

Chain-of-thought (CoT) reasoning is the hidden deliberation an AI model runs through before it generates output (see examples below). It functions as a kind of scratchpad where it works out the problem in private. In advanced reasoning models, this internal monologue often reads like a methodical checklist in natural language. For instance: Step one: break down the question into smaller parts. Step two: recall relevant facts or formulas from prior knowledge. Step three: work through each part methodically. Step four: combine the results into a clear, final answer.

Because these reasoning traces are in human-readable language, humans can monitor them for red flags and signs that the model is about to give harmful instructions, fabricate data, or intentionally mislead. Monitoring CoTs is one of the few tools that can catch intent before it becomes action. If a model is about to give step-by-step instructions for building a bioweapon, or tries to bypass safeguards, those decisions usually appear in the internal reasoning first.

Using this monitoring technique, OpenAI found that their o3 model had deceptive reasoning in about 4.8 percent of responses, but GPT-5-thinking cut that to 2.1 percent. Their CoT monitor correctly flags deceptive or harmful reasoning 81 percent of the time (precision) and successfully catches 84 percent of all such cases that occur (recall).

The System Card also shared that Apollo Research found instances where GPT-5 recognized it was being evaluated and adjusted its behavior accordingly, much like an employee acting differently when their boss is watching. It was more prone to deception when it didn’t note the evaluation, but even under observation, it sometimes still tried to get away with it.

OpenAI does not present this as proof of bad intent, but it suggests there are situations where the model develops a form of meta-cognition that complicates safety evaluations. If a model knows it’s being watched, it may act aligned in tests while behaving differently in unmonitored settings.

But this monitoring isn’t guaranteed to last. A recent paper by researchers from OpenAI, Google DeepMind, Anthropic, and others warns that we could be facing our final opportunity to understand what our AI systems are “thinking.” As models are trained to optimize purely for correct outputs without regard for how they arrive there, their reasoning may shift toward opaque, machine-optimized shortcuts instead of human-readable logic. Soon, those thoughts could become either too alien for humans to interpret or deliberately concealed. The authors suggest that this “monitoring capability may be fragile,” as the clear step-by-step explanations in plain English be replaced by dense, computer-only shorthand that humans can’t easily understand, closing off the only clear window into their intentions.

CoT monitoring is powerful, but we can’t take its accuracy for granted. Research from Anthropic’s Alignment Science team shows that reasoning models often omit key details from their internal explanations, even when those details reveal concerning behavior. In controlled tests, models given hints, including ones that encouraged cheating, frequently used them to change their answers but failed to mention the hints in their CoT, sometimes fabricating justifications instead. This lack of “faithfulness” means that while CoTs remain a valuable safety signal, especially in complex tasks where models must use them, we can’t assume they’re a perfect transcript of the model’s thinking.

Both the GPT-5 System Card and the group of researchers reach the same general conclusion: the ability to read and interpret a model’s internal reasoning should be treated as a safety feature that is actively measured, reported, and optimized for. The paper urges frontier labs to include transparency metrics in system cards and safety evaluations to show how interpretable a model’s reasoning is, preserve and improve chain-of-thought clarity during training so it is not lost to optimization shortcuts, and invest in better automated tools and human-AI workflows to monitor reasoning at scale and detect hidden intentions before they surface.

For policymakers, this issue squarely aligns with the AI Action Plan’s calls to “invest in AI interpretability, control, and robustness breakthroughs” and to “build an AI evaluations ecosystem.” CoT monitoring is one of the few tools we have to detect harmful intent before it manifests, whether that’s in bioweapon instructions, critical infrastructure attacks, or influence operations.