The Code Is Writing Itself and Nobody Is Watching
When the Assembly Line Has No Inspector
I've been in enterprise software for 37 years. I've watched the industry automate itself through every cycle — CASE tools in the eighties, fourth-generation languages, offshore development factories, low-code platforms. Each wave promised to write more code faster. Each wave also introduced new failure modes that governance frameworks scrambled to catch up with.
What's happening right now is different. Not incrementally different. Structurally different.
LLMs are generating production-bound code in seconds. Developers at organizations I talk to are shipping features in days that used to take weeks. The productivity numbers are real — I'm not disputing that. But here's what keeps me up at night: the governance infrastructure sitting downstream of that code generation was designed for humans. (This is a specific instance of [the agentic threshold](/blog/the-agentic-threshold-why-most-ai-projects-stall) — the point where capability outpaces the structures built to govern it.) typing 60 words a minute, reviewing their own work, making human-scale mistakes at human scale. That math doesn't hold anymore. Not even close.
The Vulnerability Problem Is Systematic, Not Occasional
Let me put some numbers on the table, because this is where the conversation usually goes sideways. People hear "AI writes bad code sometimes" and they think: so does Bob in accounting. They don't understand that the problem isn't occasional — it's baked in.
Veracode's 2025 GenAI Code Security Report tested 80 curated coding tasks across more than 100 LLMs. Forty-five percent of AI-generated code introduced security vulnerabilities aligned with the OWASP Top 10. Not edge cases. Not exotic attack surfaces. The OWASP Top 10 — the list that's been sitting on every security team's wall for fifteen years. Java was particularly grim: 72% security failure rate. And here's the part that should stop you cold: security performance remained flat over time even as functional code generation improved. The models got better at writing code that works. They didn't get better at writing code that's safe.
Backslash Security took it further at RSAC in April 2025, testing seven current LLMs with tiered prompts. GPT-4o generated vulnerability-free code in exactly 10% of standard-prompt outputs. GPT-4.1 scored 1.5 out of 10. Claude 3.7 Sonnet — the best performer in the study — hit 6 out of 10 with standard prompts and a perfect score only when the prompts were explicitly OWASP-compliant. Meaning: the best result required someone to already know what they were asking for, and to ask for it precisely. Strip that expertise out of the prompt, and you're back in the dirt.
This isn't a new observation, either. The NYU "Asleep at the Keyboard" study found roughly 40% of GitHub Copilot suggestions contained vulnerabilities back in 2022. Stanford's Perry et al. showed that developers using AI assistants produced *less* secure code than a control group writing without them — only 21% of the AI-assisted group got a Python encryption task right versus 43% without AI help. The pattern has been consistent across four years of research. The industry just hasn't acted on it at scale.
The Governance Frameworks Were Built for a Different World
Here's the friction point nobody wants to acknowledge: the audit processes, code review checklists, security scanning pipelines, compliance attestations — most of that apparatus was built around human development velocity. A developer writes a feature. It goes to code review. Security scanning runs. Somebody signs off. The whole cycle might take a few days, maybe a week for something sensitive.
Now drop an AI coding agent into that picture. It's not writing a feature. It's generating thousands of lines across multiple files, making architectural decisions, referencing patterns from its training data that nobody on your team has audited. The output lands in your review queue looking like code your senior engineer wrote. Except it wasn't. And your reviewer is checking whether the logic is correct, not whether the cryptographic implementation is subtly broken in a way that won't surface for eighteen months.
The existing tools help. Static analysis, SAST scanners, dependency checkers — they catch some of it. But they were calibrated for human error patterns. LLM-generated vulnerabilities have different signatures. They're often plausible, well-structured, and technically functional. The code reviews pass because reviewers see code that looks right. The vulnerabilities hide in implementation details that require domain expertise to spot.
And the volume problem compounds everything. If your team was shipping one feature a week and now it's shipping five, your review capacity didn't quintuple. The math works against you.
The Regulatory Pressure Is Coming Whether You're Ready or Not
The governance vacuum is starting to attract legislative attention, and I don't think most enterprise software shops have internalized what that means yet.
Colorado's SB-205, California's AB-2013, Texas's HB-1709 — these state-level AI laws are creating a patchwork of compliance obligations with real teeth. The EU AI Act's phased implementation is already reshaping how companies deploying AI in certain risk categories think about documentation and accountability. The International AI Safety Report — chaired by Yoshua Bengio, backed by 30-plus countries — is becoming the reference document that regulators worldwide cite when they write enforcement guidance. This isn't theoretical anymore. The compliance obligations are materializing.
What these frameworks are reaching toward, even when they don't say it explicitly, is a concept of *documented decision accountability*. Who approved this AI output? What constraints governed the generation? What was reviewed, by whom, against what standard? For AI-assisted code specifically — code that ends up in financial systems, healthcare platforms, critical infrastructure — those questions are going to become audit requirements. And right now, most organizations couldn't answer them.
What "Governed AI" Actually Means in Practice
I want to be precise here, because the phrase "AI governance" has become a cloud of vague gestures. I've seen it applied to everything from an acceptable-use policy in a slide deck to actual operational controls. They're not the same thing.
Governed AI — as I think about it, and as we've built toward with BOSGov (which is also the foundation of [compliance as a competitive moat](/blog/compliance-as-a-competitive-moat)) — means that the AI's actions are constrained, logged, and accountable before output reaches production. Not post-hoc review. Not a human skimming the result after the fact. Constraints embedded in how the AI is invoked, what it's permitted to do, what it must verify, and what gets recorded at the point of generation.
For AI-generated code specifically, that means things like: security requirements expressed as operational constraints, not suggestions. Audit trails that capture what the AI was asked to do and what it produced. Escalation paths when outputs cross risk thresholds. Review workflows calibrated to AI-scale velocity, not human-scale velocity. The goal isn't to slow AI down to the speed of 2019. It's to build inspection capability that runs at the speed AI is already running.
The research from Backslash Security actually points toward this directly: Claude 3.7 Sonnet's score jumped from 6 out of 10 to a perfect score when security constraints were explicitly embedded in the prompt structure. That's not just a prompting tip. That's evidence that the governance layer — the structured constraints around how the AI is asked to work — is doing measurable security work. The capability is there. The frameworks to operationalize it, consistently, across an enterprise? That's the gap.
The Problem Isn't the AI
I want to be clear about something, because I've watched this argument go sideways too many times. The problem isn't that LLMs are dangerous and should be stopped. I'm not making that argument. I'm making the opposite argument: LLMs are genuinely useful, the productivity gains are real, and treating AI-assisted development as inherently suspect misses the point.
The problem is the gap between what AI can do and what the surrounding systems are prepared to verify. That gap is a governance problem. A structural one. It requires operational frameworks, not AI skepticism.
Forty-five percent of AI-generated code containing vulnerabilities isn't a reason to stop using AI coding tools. It's a specification. It tells you exactly what your governance framework needs to be built to catch — systematically, at scale, without depending on every developer already knowing what to look for.
The code is writing itself. That part isn't changing. The question is whether anyone's watching — and whether the watching is built into the process or just hoped for.
If you're deploying AI without governance infrastructure, that gap has a cost — and it compounds. [Schedule a conversation](/schedule) about what governed AI looks like in practice.