research
What happened after 2,000 people tried to hack my AI assistant
Builders of AI assistants must recognize that while model-level defenses are improving, they cannot replace careful architecture and fail-safes for production systems handling sensitive data.
What happened
In a real-world test of AI assistant security, developer Fernando Irarrázaval challenged hackers to extract secrets from his OpenClaw instance via email prompts. Despite over 6,000 attempts costing $500 in tokens and triggering a Google account suspension, no one succeeded in leaking the secret. The underlying model, Opus 4.6, used a strict anti-prompt-injection system prompt forbidding actions like revealing credentials or executing code. According to Simon Willison, this outcome reflects the increasing effectiveness of training frontier models to resist injection attacks, a trend noted in recent system card releases. However, Willison cautions that 6,000 failures don't guarantee immunity; a determined attacker with a novel approach could still break through. The Hacker News discussion highlighted both healthy skepticism and constructive feedback from the challenge creator. For AI builders, this underscores that while model-level defenses are improving, they are not yet a substitute for robust architectural safeguards in production systems.
Key takeaways
- Fernando Irarrázaval ran a challenge allowing 2,000 participants to email his OpenClaw assistant, aiming to leak a secret.
- After 6,000 attempts and $500 in token costs, the secret remained unrevealed due to strong prompt-injection defenses.
- The Opus 4.6 model's system prompt forbade actions like revealing secrets, modifying files, or running code from emails.
- Simon Willison notes that frontier model training against injection attacks is proving effective but insufficient for irreversible harm.
- The Hacker News thread features skeptical discussion and replies from the challenge creator, emphasizing no guarantee of absolute security.
Why it matters
Builders of AI assistants must recognize that while model-level defenses are improving, they cannot replace careful architecture and fail-safes for production systems handling sensitive data.
This is an original editorial digest by AI Workflow Pro. Full reporting at the source:
Read the original on Simon WillisonMore AI news
All news →





Join the AI Workflow Pro Community