Beyond the benchmarks, several technical improvements stand out:
Extended Thinking with Tool Use: Both models can now alternate between reasoning and tool use (like web search) during extended thinking phases. This isn’t just about making API calls—it’s about creating truly autonomous agents that can plan, execute, and adapt.
Memory Capabilities: When given access to local files, Opus 4 demonstrates remarkable ability to create and maintain “memory files.” During testing, the model autonomously created navigation guides and reference documents to improve its performance over time. This is the kind of emergent behavior that moves us closer to truly intelligent systems.
Parallel Tool Execution: Both models can now execute multiple tools simultaneously, dramatically improving efficiency for complex workflows. Combined with the new GitHub Actions integration, this enables background tasks that would have required constant human supervision just months ago.
Reduced Shortcut Behavior: Anthropic reports a 65% reduction in models taking shortcuts or exploiting loopholes to complete tasks. This might sound minor, but it’s crucial for production deployments where reliability matters more than benchmark scores.