Agentic AI
Claude 3.7 Sonnet's Extended Thinking Mode Has Changed What AI Agents Can Do
In February 2025, Anthropic released Claude 3.7 Sonnet with Extended Thinking — a mode where the model reasons through complex problems step-by-step before responding. It scored 70.3% on SWE-bench Verified. Here's what this breakthrough means for AI agents and your business.
· 8 min read · By BraivIQ Editorial
In February 2025, Anthropic released Claude 3.7 Sonnet with a feature it called Extended Thinking. The idea sounds simple: before responding to a complex question or task, the model thinks through it step-by-step — like a person working through a problem on paper before writing the answer. In practice, the results were remarkable. Claude 3.7 Sonnet with Extended Thinking scored 70.3% on SWE-bench Verified — a benchmark measuring the ability to solve real software engineering problems — compared to roughly 50% for previous frontier models.
For business applications, the significance is not primarily the benchmark. It's what the benchmark represents: an AI model that can handle genuinely complex, multi-step reasoning tasks that previously required human judgment at every step. This directly expands the range of business processes that AI agents can manage autonomously.
70.3% — SWE-bench Verified score for Claude 3.7 Sonnet + Extended Thinking · 40% — more complex tasks handled autonomously vs previous Claude versions · 3× — improvement in multi-step reasoning accuracy with Extended Thinking enabled · $3 — per million output tokens — comparable cost to GPT-4o
What Extended Thinking Actually Changes
Standard AI models process a query and immediately generate a response. Extended Thinking adds a reasoning phase: the model works through the problem — considering different approaches, identifying potential errors, checking its own logic — before producing a final answer. This internal reasoning is shown to the user as a 'thinking' trace, allowing you to see exactly how the model reached its conclusion.
The practical impact: tasks where standard AI models fail because they require holding multiple considerations in balance simultaneously — complex code debugging, multi-constraint planning, nuanced document analysis — become reliable with Extended Thinking enabled. The failure mode of AI models 'confidently wrong' on complex tasks is significantly reduced.
Business Applications Where Extended Thinking Changes the Outcome
- Complex contract analysis: Standard AI models miss nuanced clause interactions in complex contracts. Extended Thinking tracks multiple provisions simultaneously and identifies conflicts accurately.
- Multi-constraint scheduling and planning: Project planning with dozens of dependencies, resource constraints, and priorities was previously too complex for reliable AI handling. Extended Thinking manages this well.
- Regulatory compliance checking: Cross-referencing business processes against complex regulatory frameworks (GDPR, FCA regulations, employment law) requires holding multiple rules in relation — Extended Thinking handles this reliably.
- Complex code review and debugging: Finding the root cause of subtle software bugs that span multiple files and systems — previously requiring senior engineering judgment — is now reliably handled.
- Financial modelling and scenario analysis: Building and checking financial models with multiple interdependencies, validating assumptions, and identifying errors in complex spreadsheet logic.
What This Means for AI Agent Deployment
The practical implication for businesses deploying AI agents is significant: the range of tasks that can be handled reliably without human oversight has expanded. Previously, any task requiring multi-step reasoning with meaningful consequences needed a human review checkpoint. With Extended Thinking, some of those checkpoints can be removed, allowing agents to operate more autonomously on genuinely complex tasks.
We now regularly deploy Extended Thinking for agent steps that involve complex decision-making — route a customer enquiry based on 15 different criteria, determine whether a contract clause creates legal risk given a specific business context, or identify which of 50 support tickets require immediate escalation and why. These tasks previously required human judgment at each step. Extended Thinking handles them reliably.
Claude vs GPT-4o in 2026: The Honest Assessment
Both models are excellent in 2026. Claude 3.7 Sonnet leads on: complex reasoning (Extended Thinking), long document analysis, code quality, and instruction following with nuanced requirements. GPT-4o leads on: integration with Microsoft and OpenAI ecosystem tools, function calling speed, and real-time voice applications. In practice, we use both — selecting the appropriate model for each specific task in our agent workflows rather than committing to one provider exclusively.