Chapter 1: Autonomous Offensive Security: Hype or Reality


Background

NGL, FOMO kicks in when seeing Tweets (ah yes, we prefer this over “X post”) about people earning bounties, CVEs, releasing open-source projects, and even production-grade SaaS products of “autonomous” offensive security systems.

Then came XBOW. We still remember being impressed by XBOW’s post showcasing their agent topping the HackerOne leaderboard in the US.

That was the moment when we realized this might be possible (or, we might be jobless soon).

We are getting jobless soon so do give us some mental support, if not financial, by following our X account 🤡

With 88% FOMO, and 12% curiosity, we’ve decided to give it a try to build it ourselves.

What to Build?

Most autonomous systems require access to source code, or targets infrastructure like servers. Very few could do truly black box, web-based penetration test.

We think this might be due to:

  1. Nature of Black Box WAPT

    Scope
    For code scans, line of code is the scope. Infrastructure penetration test, open ports are your scope.

    Black box WAPT? You’ll need to test everything in your cheatsheet. There’s no way to map an exhaustive list of directories, endpoints and parameters without fuzzing everything in your wordlist. Business logic requires comprehensive understanding of business requirement and web application interaction. It is a context driven assessment that requires crazy amount of discipline. To us, black box is an understatement, it should be called as Deep Black Dark box assessment.

    Dynamic
    Infrastructure penetration tests typically deal with services and protocols that follow well-defined standards for data transmission. These tests focus on static misconfigurations or vulnerabilities that can be reliably detected based on known standards. For example, FTP servers have a standardized way of configuring anonymous access, and misconfigured servers can often be detected using static commands or scanning techniques. Similarly, vulnerabilities tied to specific versions of services (e.g., a particular version of Apache exposed to a known CVE) present exploitable entry points that are predictable and static.

    Web applications are more dynamic, tailored to specific use cases and teams, with variations in HTTP status codes, error messages, server response times, frontend and backend workflows, etc. This is why Burp Intruder has built-in grep matcher, and ffuf allows custom filtering by HTTP response code or line count – to handle variability. Hence the name “Dynamic” Application Security Testing.

  2. Determinism

    A solid system needs to be capable of systematically testing all entry points and creatively chaining payloads during WAPT. Unfortunately, LLMs behave much like humans, how often do we really test everything in our checkbox? (sorry boss).

    Given how dynamic and context-driven WAPT is, this increases the difficulty of ensuring systematic testing of LLMs.

Our Baby Step

We (naively) want to build an agentic offensive security system that is .. MAD:

  1. Model agnostic

    Harness is the product, model is the commodity.

    With fierce competition and scaling laws, model differences are minimal, or at least, each having its strengths. Inspired by Perplexity, their recent release of Perplexity Computer that intelligently routes requests to different models, and allow different models to work together just makes complete sense.

    On top of this, most offensive assessments deal with sensitive user data. Option to switch from cloud to local models is a need, not a a good to have feature.

  2. Autonomous

    Give it a URL, provide business context and assessment objective, it should carry on from that point onwards.

  3. Deterministic

    Good agents find you vulnerabilities, reliable agents find them consistently.

    Missing a bug or vulnerability in a bounty program is acceptable, but in a critical production system, it’s a serious issue.

    While defining detailed steps for LLMs with an agent harness might seem like the solution, it’s not.

    Systematic testing brings structure, but it limits the creativity of the agent. Unfortunately, these two factors are inversely related. Striking the right balance between both is what sets products apart.

Next Chapter

We’ll share our architecture, what went wrong, lessons learned, and of course welcome feedback from the community!