research

Why we no longer evaluate SWE-bench Verified

Accurate benchmarks are essential for comparing AI coding tools; relying on contaminated ones can lead to overestimating a model's capabilities and making poor tool selections.

OpenAI Blog·February 23, 2026·1 min readresearch

researchWhy we no longer evaluate SWE-bench Verified

openai.com

What happened

OpenAI announced it is ceasing evaluation of SWE-bench Verified, citing growing contamination and flawed test design that misrepresent progress in AI-assisted coding. According to OpenAI's blog post, the benchmark's test data increasingly leaks into model training sets, and its tests fail to capture genuine improvements in frontier coding capabilities. OpenAI recommends transitioning to SWE-bench Pro, which it claims addresses these issues. For developers and solopreneurs building AI workflows, this shift underscores the need to critically assess benchmarks when comparing coding agents. Relying on contaminated evaluations can inflate perceived performance, leading to misguided tool choices. Instead, builders should seek more robust, independently validated metrics to gauge real-world utility. The decision also highlights a broader challenge in AI evaluation: as models improve, benchmarks must evolve to stay relevant and trustworthy.

Key takeaways

OpenAI is dropping SWE-bench Verified due to test contamination and flawed design.
The benchmark is said to mismeasure progress in AI-assisted coding.
OpenAI recommends SWE-bench Pro as a more reliable alternative.
Contamination occurs when benchmark data inadvertently appears in training sets.
Builders should be cautious when using benchmarks to evaluate coding tools.

Why it matters

Accurate benchmarks are essential for comparing AI coding tools; relying on contaminated ones can lead to overestimating a model's capabilities and making poor tool selections.

This is an original editorial digest by AI Workflow Pro. Full reporting at the source:

Read the original on OpenAI Blog

Share this story

Share on X