research

Introducing EVMbench

Builders integrating AI into development workflows need reliable ways to benchmark security performance; EVMbench provides a standard for assessing an agent’s ability to handle critical vulnerabilities before deployment.

OpenAI Blog·February 17, 2026·1 min readresearch

researchIntroducing EVMbench

openai.com

What happened

OpenAI and Paradigm have introduced EVMbench, a new benchmark designed to evaluate AI agents on their ability to detect, patch, and exploit high-severity vulnerabilities in Ethereum smart contracts. The benchmark tests end-to-end tasks that matter for real‑world security, challenging agents to find bugs in Solidity code, generate valid patches, and even demonstrate exploit proofs. This initiative comes as AI‑assisted development becomes more common, raising concerns about automated introduction of vulnerabilities. According to the OpenAI Blog, EVMbench aims to provide a standardised way to measure and improve AI safety in blockchain contexts. For builders of AI workflows, this benchmark offers a concrete method to assess their agents' security competencies—especially relevant for teams deploying AI for code generation, auditing, or DevSecOps. The results can inform tool selection and highlight gaps that need human oversight.

Key takeaways

EVMbench is a joint benchmark from OpenAI and Paradigm for evaluating AI agents on smart contract vulnerability tasks.
It assesses three capabilities: detecting vulnerabilities, generating secure patches, and creating proof-of-concept exploits.
The benchmark focuses on high-severity issues in Ethereum Virtual Machine smart contracts.
EVMbench is intended to guide the development of safer AI coding assistants and improve automated security auditing.