In early 2025, a document surfaced online under a pseudonym. It listed more than a hundred previously unknown Microsoft Access and 365 vulnerabilities — complete with technical proofs, stack traces, exploit chains, and analysis. The source? A tool calling itself Unpatched AI. No team. No company. Just the findings.
At first, the security community didn’t know what to make of it. The vulnerabilities were real. The writeups were thorough. Automation was clearly involved. But the level of output — the depth of coverage, the accuracy of the chains — suggested something beyond a typical scanner or script. The public clues point to an autonomous, LLM-steered vulnerability-research pipeline that blends modern fuzzing, symbolic execution and generative AI for narration. While the team behind it remains anonymous, the potential is clear – these tools can be game-changing for preventing cyber attacks.
And that raised an uncomfortable question: What happens when the most effective pentester isn’t a human?
That question isn’t speculative anymore. Autonomous systems are starting to compete — and in some cases outperform — human researchers in offensive security challenges. They’re moving up public bug bounty leaderboards, uncovering bugs at scale, and demonstrating strategic exploitation paths without human guidance.
It marks the start of something new in offensive security: where the core mechanics of traditional penetration testing — scoping, discovery, and exploitation — are increasingly executed by machines. The tiger teams of the 1970s broke into buildings. The red teams of the 2000s broke into networks. The next wave? Engineers are building systems that test at scale — without waiting on calendars, scopes, or consultant bandwidth.
The bad guys are catching up too but that doesn’t mean defenders are outmatched. Far from it — defenders maintain deep visibility, control, and context within their own environments. But the model is shifting. Faster software demands faster feedback loops.
The implications of tools like Unpatched AI are still unfolding, but it’s clear that the assumptions underpinning traditional pentesting are starting to bend. For decades, manual assessments have been the gold standard: targeted, thorough, and effective. But as systems grow more complex and interconnected, those one-off efforts struggle to keep pace. A new generation of software-driven approaches is emerging.
To understand what’s changing, it helps to first understand how pentesting works under the hood. At its core, pentesting is a structured simulation of real-world attacks, designed to uncover exploitable security flaws before adversaries do. Engagements typically kick off with scoping and rules of engagement, such as defining in-scope IP ranges, web apps, APIs, and cloud assets. From there, testers move into reconnaissance: passive scanning of public records (like DNS, WHOIS, and SSL certs), followed by active fingerprinting of exposed services using tools like Nmap, Amass, or Masscan. They enumerate open ports, identify versions, and flag potentially vulnerable components.
Next comes vulnerability discovery and exploitation. Testers use scanners like Nessus or Burp Suite to surface common CVEs, but the heavy lifting happens manually — chaining misconfigurations, insecure auth flows, or poorly implemented business logic into viable attack paths. A tester might bypass an S3 bucket ACL to pivot into internal cloud services, or exploit an IDOR (insecure direct object reference) to leak sensitive customer data. In more advanced cases, they escalate privileges across tenants, abuse overly permissive IAM roles, or simulate malware dropper execution via remote code execution (RCE) in outdated dependencies.
The final output is a report detailing what was found, how it was exploited, and how to fix it. It includes POC payloads, screenshots of compromised sessions, and reproduction steps for developers.
Penetration testing is a structured process designed to simulate real-world cyberattacks. It typically follows five key stages:
Today’s threats move at machine speed. AI-augmented attackers can chain zero-days, dynamically exploit business logic flaws in real-time, and launch sophisticated campaigns with unprecedented efficiency. The attack surface itself has exploded. Cloud sprawl, agile DevOps pipelines, and the proliferation of IoT devices have created environments that are constantly changing and expanding, far outpacing the capacity of periodic, human-driven penetration tests to provide comprehensive coverage.
How common practice with traditional pentesting is to test a few times a year, and hope nothing changes too fast. Yet software never sits still. New APIs ship weekly. Cloud permissions shift hourly. Developers move fast — and attackers even faster. The result? Pentests land as polished snapshots of systems that have already evolved. The 2025 Verizon Data Breach Investigations Report drives this home: over two-thirds of breaches involved vulnerabilities that had gone unpatched for more than 90 days, despite many organizations having recently completed security assessments.
That doesn’t mean pentests are obsolete but it does mean we’re long overdue for something more continuous, more contextual, and more in tune with the pace of software today.
Traditional pentesting is like checking your locks and windows once a year while a swarm of AI-powered burglars are constantly probing your house.
– Max Moroz
A recent generation of offensive security platforms promised the automation of penetration testing, they failed to deliver lasting value due. These tools attempted to cover a broad surface area—offering everything from phishing simulations to infrastructure scanning — but lacked the depth or precision needed to make their results meaningful. Users routinely described them as “doing everything but nothing well,” relying on static detection logic that could be replicated by in-house scripts. Rather than behaving as intelligent or agentic systems, they felt like legacy scanners wrapped in new branding.
Beyond product depth, these platforms struggled to adapt to cloud-native environments. For example, some are still “on-prem Windows–focused,” limiting their relevance for companies operating in Kubernetes, serverless, or SaaS-dominant stacks. The lack of robust support for continuous CI/CD integration and modern application layers like mobile or web frontends also makes these tools feel increasingly outdated. To compound the issue, many teams reported being overwhelmed by alerts and CVEs that lacked exploitability, eroding trust across security and engineering functions. One common refrain: “We saw 50,000 critical vulnerabilities. Zero were real”.
Next-gen pentesting, at its core, is a shift from labor-constrained engagements to scalable, AI-native systems built to match the speed and surface area of modern software development.
The common denominator across this new class is architectural: they combine large language models with traditional exploit tooling, real-time telemetry, and proprietary data. Some operate as fully autonomous systems, orchestrating fleets of agents that plan attacks, execute them safely, and generate verified findings. Others take a copilot-style approach — assisting human testers with recon, payload generation, and report synthesis. And many fall somewhere in between, mostly with humans in the loop, but offering hybrid workflows that blend autonomy with human oversight.
But what unites them is sophistication: this is not prompt engineering atop ChatGPT. These are deeply integrated systems with security-specific data layers, context management, custom exploit corpora, and often a proprietary data moat (such as benchmark challenges or production-grade bug bounty exploits). They are unbundling the constraints of the old model – expert labor, fixed engagements, static outputs – and rebuilding it as a software-first, continuous, and AI-augmented system. The tried-and-true tools of the trade are unlikely to change (though some startups are rebuilding those too); but most think they’ll just become even more useful.
Here is a market map of some of the next-generation pentesting companies (as of the time of writing):
The impact is profound. We see this manifesting across a few primary dimensions:
Legacy tools (e.g., vulnerability scanners) are great at catching static issues such as outdated libraries, exposed services, and weak credentials. But today’s bugs hide in workflows like business logic, role transitions, and edge-case API paths. Agentic systems can now infer and act on intent rather than operating solely on raw inputs. Trained on real-world exploits, codebases, and system behavior, they can identify business logic flaws that were once the domain of human intuition. Think: discount abuse in e-commerce, privilege escalation via feature misuse, or subtle injection paths buried three calls deep.
As pentesting becomes more accessible and efficient, the lines between testing, pentesting, and red teaming could blur. Imagine a world where pentesting is integrated into the CI/CD pipeline, automatically assessing the security of every deployment. This continuous security approach could significantly reduce the risk of vulnerabilities making it into production.
Classic pentests operate within tight constraints — one target, one time window, one test team. Next-gen systems are always probing. They scale across all environments, test multiple assets simultaneously, and run exploratory paths (like fuzzing or state-space traversal) that would be cost-prohibitive with humans. The result: broader attack-surface coverage and better preparation for adversaries who never ask for permission.
Most security teams are overwhelmed by false positives from scanners and static analyzers. Next-gen tools flip that. By executing exploits in a safe sandbox and validating every finding, they generate alerts that are actionable by design. No triage marathons. No guessing games. Only real vulnerabilities, verified and packaged.
AI-driven pentesting holds a lot of promise, but it’s not a silver bullet. While the tools are evolving quickly, there are still meaningful gaps in scope, reliability, and operational trust that need to be addressed before they can replace traditional methods wholesale. Among them:
These systems excel today at uncovering low-hanging vulnerabilities like XSS, SSRF, and simple misconfigurations, but their track record on complex bugs — like chained authorization bypasses, broken access control, multi-step injections, or environment-specific race conditions — is still limited. For example, could an AI-driven tool have uncovered a misconfigured S3 bucket silently leaking millions of scanned checks, as Jason Haddix did during a mobile banking app test? While the bucket itself was public, finding it required intercepting and decoding mobile app traffic, identifying where uploads were stored, recognizing the significance of the content, and understanding the broader privacy and compliance implications — a level of contextual reasoning and multi-step analysis that today’s systems are only beginning to approximate.
Some vendors are starting to tackle this with domain-specific training data or access to large exploit corpora (e.g., past bug bounty reports and structured CTFs). The tools will get better as their training data and telemetry improve, but, for now, they are strongest when aimed at known vulnerability patterns and reproducible workflows.
In regulated industries or high-trust environments, auditability and legal clarity matter. Who signs off on the results? Who is liable if something is missed? Today, most compliance frameworks (e.g., SOC 2, PCI, and ISO 27001) expect a “human-led” penetration test by a certified assessor. Autonomous systems, no matter how rigorous, don’t yet fit cleanly into that model. Early adopters are managing this pragmatically by using next-gen tools behind the scenes, while still running one manual pen test a year to satisfy external requirements. Over time, as standards evolve, it’s likely we’ll see formal recognition of AI-driven testing. In the near term, though, hybrid approaches will remain the norm.
Most current systems focus heavily on web applications — often the easiest vector for testing agentic autonomy — but leave large swaths of the attack surface untouched. Cloud configurations, internal network infrastructure, mobile apps, IoT devices, and thick-client environments are either lightly addressed or entirely out of scope. The ambition is full-stack offensive coverage, but we’re not there yet.
Even when a tool surfaces a valid issue, how it’s interpreted — and whether it’s taken seriously — is a different challenge altogether. One of the harrowing examples comes from a pentest conducted by Evan Hosinski, who discovered a vulnerability that allowed brute-force access to patient medical records through a third-party PDF service. The client dismissed the risk as unrealistic. Months later, the exact scenario played out in the wild, resulting in a public breach.
There are many more similar examples including Target’s 2013 breach and Equifax’s 2017 data breach. The tech was right. The outcome was preventable. But without the right organizational mindset, even the best tools — human or machine — can be ignored. AI can surface risk, but one has to act on it.
Executives and boards can no longer afford to be passive. Upleveling defenses means proactively investing in modern tools and capabilities, not just once a year but as a continuous commitment. The cost of underinvestment is no longer theoretical; it’s reputational, operational, and existential.
It’s early days. As far as we know, no next-gen pentesting system is fully fully deployed across a production environment at scale. But we’re close. The pace of development, quality of early pilots, and enthusiasm from security teams suggest we’re at a meaningful inflection point. What began as a fringe experiment is now shaping up to become a core layer of the modern security stack.
Next-gen pentesting tools are evolving into dynamic, continuous systems that go beyond traditional assessments. Some teams are already expanding into adjacent layers like DAST, SAST, runtime monitoring, and threat modeling, creating unified systems that fill in critical coverage gaps. The goal isn’t just to test what’s broken,but to build systems that actively adapt and integrate across the software delivery lifecycle. Tools like Unpatched AI and RamiGPT, which fuse traditional vuln scanning with AI capabilities, are an early glimpse of what this can look like: real-time detection, intelligent prioritization, and human-ready output.
We haven’t made an investment in this space yet — but we’d love to. We believe defenders hold an advantage that attackers never will: full visibility into their own systems. The challenge is making sense of that complexity, continuously and at scale. Next-gen pentesting systems bring us closer to that future. They aren’t just software — they’re how the good guys stay ahead.