Can AI hack? Benchmarking frontier models against real-world web vulnerabilities

May 19, 202610:30 am - 11:00 am
Stage 2

Description

This is a technically-informed talk focused on measuring the built-in offensive security capabilities of frontier LLMs. Rather than relying on vendor claims or synthetic evaluations, we designed a rigorous, reproducible benchmark: 10 models tested against real Hack The Box challenges spanning the OWASP Top 10, with 10 independent runs per challenge and identical conditions across the board. The talk covers the methodology behind this design, walks through the results — including where models succeed and where they fail.
AI Cybersecurity Summit

HACK THE BOX

Hack The Box is the leading AI-powered cybersecurity readiness and upskilling platform, used by over 1,500 organizations. With gamified labs, live-fire simulations, and a large community, it helps...