Can AI hack? Benchmarking frontier models against real-world web vulnerabilities
May 19, 2026 — 10:30 am - 11:00 amStage 2
Already registered? Log in now to personalize your experience!

Description
This is a technically-informed talk focused on measuring the built-in offensive security capabilities of frontier LLMs. Rather than relying on vendor claims or synthetic evaluations, we designed a rigorous, reproducible benchmark: 10 models tested against real Hack The Box challenges spanning the OWASP Top 10, with 10 independent runs per challenge and identical conditions across the board. The talk covers the methodology behind this design, walks through the results — including where models succeed and where they fail.

