Chain of thought monitors are a key layer of defense against AI agent misalignment. To preserve monitorability, we avoid penalizing misaligned reasoning during RL. We found a limited amount of accidental CoT grading which affected released models, and are sharing our analysis.
A year ago, we introduced AlphaEvolve — our Gemini-powered coding agent. Today, it's being used across fields from improving Google's AI infrastructure and enabling complex molecular simulations, to better predicting the risk of natural disasters. Here's a look at the impact so https://t.co/xrYpJy2qZE
This system helped us identify this happened for some of our prior Instant and mini models. It additionally affected GPT-5.4 Thinking in less than 0.6% of samples. Out of abundance of caution, we did an in-depth analysis of these cases: they did not seem to reduce
We’re donating Petri, our open-source alignment tool, to @meridianlabs_ai, so its development can continue independently. Working with Meridian Labs, we’ve also released a major update that improves the adaptability, realism, and depth of Petri’s tests. https://t.co/CyicsIScJi
High-quality documents based on Claude’s constitution, combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario. https://t.co/JORhSuY4N7
Your customer support needs a voice agent built for the real world. Grok Voice Think Fast 1.0 handles complex workflows with speed and accuracy, even in hard-to-hear environments. From multi-step troubleshooting to high-volume tool calls, it keeps up. https://t.co/aa1VISuYAi
We experimented with training Claude on examples of safe behavior in scenarios like our evaluation. This had only a small effect, despite being similar to our evaluation. We got further by rewriting the responses to portray admirable reasons for acting safely.
New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users. Since then, we’ve completely eliminated this behavior. How?
We also had three third-party AI safety organizations provide feedback on our analysis: @redwood_ai, @apolloaievals, @METR_Evals. You can find @redwood_ai's report here: https://t.co/ODm056TVbF
Finally, simple updates that diversify a model’s training data can make a difference. We added unrelated tools and system prompts to a simple chat dataset targeting harmlessness, and this reduced the blackmail rate faster. https://t.co/Ug95umaoRu
In genomics, AlphaEvolve improved DeepConsensus — a @GoogleResearch model for correcting DNA sequencing errors. 🧬 This improvement achieved a 30% reduction in variant detection errors, helping scientists analyze genetic data more accurately and at a lower cost to find hidden https://t.co/1QuRqUuLRT
We found that training Claude on demonstrations of aligned behavior wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behavior is wrong. Read more: https://t.co/0YaRlXhVZb
We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn’t making it worse—but it also wasn’t making it better.
Directly rewarding or penalizing CoTs can make models’ reasoning traces less informative for detecting misalignment. That’s why we treat avoiding CoT grading as an important part of preserving monitorability. We recently built an automated detection system to find cases where RL
If a task needs multiple tools, Codex chooses the best one for each step. It uses plugins when they can handle the job, Chrome when it needs a logged-in website, and combines approaches as needed. https://t.co/3GvDouoPDi
Training models involves many technical and social processes, so prevention of CoT grading has to be built into the process. We’re improving real-time CoT-grading detection, safeguards against accidental CoT grading, monitorability stress tests, and the internal guidance/checks