MMLU-Pro holds steady at 85.0, AIME 2025 slightly improves to 89.3, while GPQA-Diamond dips from 80.7 to 79.9. Coding and agent benchmarks tell a similar story, with Codeforces ratings rising from ...
US startup Anthropic on Monday announced the launch of its new generative artificial intelligence model, Claude Sonnet 4.5, ...
Anthropic evaluated the model’s programming capabilities using a benchmark called SWE-bench Verified. Sonnet 4.5 set a new industry record with a 82% score. The next two highest scores were also ...
One of the hottest markets in the artificial intelligence industry is selling chatbots that write computer code. Some call it ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results