The dam on AI security automation will break
And it’s on us to break it faster than our adversaries
Joshua Saxe, AI security engineer, Meta, https://substack.com/@joshuasaxe181906
We’re early on the path in AI security, but there is a paved road for rapid progress thanks to coding
The only barrier to achieving coding-like progress in offensive security is us
End-to-end offensive agents for cyber don’t work that well yet
But they’re advancing rapidly for coding
Because there’s a paved road, our adversaries will build offensive AI cyber capabilities, and if we don’t move fast, our adversaries will get there first
If we stop at assistant level the other side wins
If we stop at assistant level the other side wins
Our goal must be meaningful autonomy and capabilities quickly or the adversary wins
How do we work towards meaningful autonomy when we have so far to go?
My theses
How do we work towards meaningful autonomy when we have so far to go?
My theses
Part one
On the data foundations necessary to achieve autonomous reliability
ML/AI engineering is different than traditional security software engineering, in that its foundations are mainly about data
“It works on my dataset” won’t get us there for AI cybersecurity capabilities; we need data-driven evals to succeed
We can’t avoid training LLMs to do cyber, and with high quality, representative data
The importance of careful data work for reliable cyber capabilities isn’t always obvious to newcomers to the field
Data is expensive, boring, but foundational
(How ImageNet solved computer vision)
ImageNet Fact | Number / Detail |
Number of classes in labels | About 21,000 in the full ImageNet hierarchy |
Images | 1.28 million training, 50,000 validation, 100,000 test |
Labeling workforce | 49,000 Mechanical Turk workers from 167 countries; 160 million candidate images screened |
Labeling effort | 53,000 worker-hours of labeling |
Label verification | 2–5 independent workers per image; precision spot-check ≈99.7%; validation set error ≈6% |
Cost | Roughly $0.5–0.8M total (based on AMT pay rates, not officially published) |
Funding | Mix of NSF research grants plus Google, Intel, Microsoft, Yahoo industry support |
Build timeline | ≈2008–2010 (about 21 months from project start to first public release) |
ImageNet data was the operating system for solving computer vision, and enabling many autonomous vision applications
Imagine not knowing which of these models performed better, and not having data to train them
As with ImageNet, we’ve paid for good data for coding;�cyber will need this too
Training environments and optimization-based solutions are the new data, and table stakes for autonomous AI progress in cybersecurity
An example cybersecurity gym
Environments are becoming ubiquitous in AI coding, will need to be in AI cyber, and this community needs to build them
Agent | Public evidence of RL in code execution environments |
Google DeepMind AlphaCode 2 | “We use reinforcement learning from code execution feedback to improve model solutions… running candidate programs in a sandbox environment to collect rewards.” (AlphaCode 2 blog / paper) |
OpenAI Codex / GPT-4 for code | “Codex was further improved with reinforcement learning from human and code execution feedback, allowing the model to iteratively test, debug, and refine programs.” (OpenAI system card / blog) |
Replit Ghostwriter / Replit Agent | “We leverage deep reinforcement learning, using signals such as unit test results, compiler messages, and runtime errors to train models to write more reliable code.” (Replit engineering blog) |
Cognition Labs – Kevin-32B (Devin’s codegen core) | “If a kernel fails to compile, we pass the compiler error trace as feedback; if it runs correctly, we measure runtime performance. This feedback is used for reinforcement learning.” (Cognition technical post) |
Poolside | “Our models are trained with Reinforcement Learning from Code Execution Feedback (RL-CE), where every program is executed and its results directly inform the reward signal.” (Poolside blog) |
Anthropic Claude (for code tasks) | “To fine-tune the model, Anthropic used reinforcement learning with human feedback plus automated signals, including execution-based evaluations on code problems.” (Anthropic technical overview) |
Cursor (Tab / Autocomplete) | “We use data from user interactions to improve Tab using online reinforcement learning, continuously optimizing completions from accept/reject and run outcomes.” (Cursor blog post) |
The difference between having training ‘gyms’ and not having training gyms is 20% vs. 85% AIME accuracy; we need to do this for cyber and no one is really doing it yet
Reinforcement learning will be the way out of AI security engineering being about futzing with prompts
The good news about reinforcement learning environments is that they’re way more data efficient than expert data, which is important for cyber, where relevant data are often private
Summing up the ingredients for success
Data-driven evaluations that allow us to know whether changing a prompt or a tool makes our agent better or worse
Fine tuning data and reinforcement learning execution environments to improve the underlying LLM
Compute for training and to support a rapid feedback loop between changing an agent and understanding how well the agent is doing
How we should crawl, then walk, then run towards data foundations in AI security
Crawl: No training data, no evals, optimize agent scaffolding against a handful of examples, ship based on vibes
The eventual, inevitable problem with “crawl”
Walk: No fine tuning, but good, statistically significant evals so engineers can hill climb and compare results
(DARPA AIxCC an example of this)
Being able to compute and compare results quickly is a huge boost in “walk” state
The inevitable problem with “walk” (generalist LLMs aren’t trained to explicitly target Meta’s cybersecurity problems)
Run: Continuously refreshed annotated production data, online A/B testing, just like mature AI organizations do for mature capabilities
Getting to ”run” is urgent: we need significant autonomy, and labor productivity gains, as fast as possible, or the adversary wins
Why we need to change our organizations and risk philosophies with AI progress
Just as other industries, we’ll need to experiment with organization-AI interaction to realize the gains of cyber autonomy; the problem isn’t just the technology, it’s also our organizational structures, skill compositions, and the need to discover AI applications through experimentation
Some SOC roles today
Some SOC roles tomorrow
Surprises we’ve already encountered when introducing AI into cybersecurity labor processes
Even when we’ve found the right processes, stakeholder risk alignment can block deployment and requires attention
Wrapping up
We’re early on the path in AI security, but there is a paved road for rapid progress thanks to coding and math
The only barrier to achieving coding-like progress in offensive security is us
Let’s not forget the size of the opportunity, and the risk
AI agents will amplify human labor in cyber conflict, and if we don’t lean in, our adversaries will win
How do we work towards meaningful autonomy when we have so far to go?
My theses
Discussion, questions; (my substack: https://substack.com/@joshuasaxe181906)