The Code Is Writing Itself and Nobody Is Watching

LLMs generate production code in seconds. The governance frameworks that are supposed to catch the mistakes were designed for humans who type 60 words a minute. That math doesn't work anymore.

I've been in enterprise software for 37 years. I've watched the industry automate itself through every cycle — CASE tools in the eighties, fourth-generation languages, offshore development factories, low-code platforms. Each wave promised to write more code faster. Each wave also introduced new failure modes that governance frameworks scrambled to catch up with.

What's happening right now is different. Not incrementally different. Structurally different.

LLMs are generating production-bound code in seconds. Developers at organizations I talk to are shipping features in days that used to take weeks. The productivity numbers are real — I'm not disputing that. But here's what keeps me up at night: the governance infrastructure sitting downstream of that code generation was designed for humans typing 60 words a minute, reviewing their own work, making human-scale mistakes at human-scale. That math doesn't hold anymore. Not even close.

This is a specific instance of the agentic threshold — the point where capability outpaces the structures built to govern it.

The vulnerability data

The problem is systematic, not occasional

People hear "AI writes bad code sometimes" and they think: so does Bob in accounting. They don't understand that the problem isn't occasional — it's baked in.

45%

Of AI-generated code introduced OWASP Top 10 vulnerabilities

Veracode's 2025 GenAI Code Security Report tested 80 curated coding tasks across 100+ LLMs. Not edge cases. Not exotic attack surfaces. The OWASP Top 10 — the list that's been sitting on every security team's wall for fifteen years.

72%

Java's security failure rate

10%

GPT-4o standard-prompt outputs that were vulnerability-free

1.5/10

GPT-4.1's score on the Backslash Security RSAC test

6/10

Claude 3.7 Sonnet — best performer — with standard prompts

10/10

Same model with explicit OWASP-compliant prompts

21% vs 43%

Stanford: AI-assisted devs got encryption right less often than no-AI controls

The pattern has been consistent for four years

NYU "Asleep at the Keyboard" (2022) found ~40% of Copilot suggestions contained vulnerabilities. Stanford's Perry et al. showed AI-assisted developers produced less secure code than the control group. The pattern has been consistent across four years of research. The industry just hasn't acted on it at scale.

And here's the part that should stop you cold: security performance remained flat over time even as functional code generation improved. The models got better at writing code that works. They didn't get better at writing code that's safe.

The frameworks gap

Built for a different world

Here's the friction point nobody wants to acknowledge: the audit processes, code review checklists, security scanning pipelines, compliance attestations — most of that apparatus was built around human development velocity.

A developer writes a feature. It goes to code review. Security scanning runs. Somebody signs off. The whole cycle might take a few days, maybe a week for something sensitive.

Now drop an AI coding agent into that picture. It's not writing a feature. It's generating thousands of lines across multiple files, making architectural decisions, referencing patterns from its training data that nobody on your team has audited. The output lands in your review queue looking like code your senior engineer wrote. Except it wasn't.

The code reviews pass because reviewers see code that looks right. The vulnerabilities hide in implementation details that require domain expertise to spot.

And the volume problem compounds everything. If your team was shipping one feature a week and now it's shipping five, your review capacity didn't quintuple. The math works against you.

Regulation incoming

The regulatory pressure is coming whether you're ready or not

The governance vacuum is starting to attract legislative attention, and I don't think most enterprise software shops have internalized what that means yet.

USA

State-level AI laws

Colorado SB-205, California AB-2013, Texas HB-1709 — creating a patchwork of compliance obligations with real teeth.

EU AI Act

Phased implementation already reshaping how companies deploying AI in certain risk categories think about documentation and accountability.

Global

Intl AI Safety Report

Chaired by Yoshua Bengio, backed by 30+ countries — the reference document regulators cite when writing enforcement guidance.

For AI-assisted code in financial systems, healthcare platforms, critical infrastructure — those questions are going to become audit requirements. And right now, most organizations couldn't answer them.

What governed AI means

In practice, not on a slide

I want to be precise here, because the phrase "AI governance" has become a cloud of vague gestures. I've seen it applied to everything from an acceptable-use policy in a slide deck to actual operational controls. They're not the same thing.

Governed AI — as I think about it, and as we've built toward with BOSGov (which is also the foundation of compliance as a competitive moat) — means that the AI's actions are constrained, logged, and accountable before output reaches production. Not post-hoc review. Not a human skimming the result after the fact. Constraints embedded in how the AI is invoked.

Security requirements as operational constraints

Not suggestions. The Backslash data shows this works — Claude 3.7's score jumped from 6/10 to a perfect 10/10 when security constraints were explicitly embedded in the prompt structure.

Audit trails at generation time

Capture what the AI was asked to do and what it produced. Not after the fact.

Escalation paths for risk thresholds

When outputs cross a defined line, route to a human review automatically.

Review workflows at AI velocity

Not at the speed of 2019. Inspection that runs at the speed AI is already running.

The position

The problem isn't the AI

I want to be clear about something, because I've watched this argument go sideways too many times.

The problem isn't that LLMs are dangerous and should be stopped. I'm not making that argument. I'm making the opposite argument: LLMs are genuinely useful, the productivity gains are real, and treating AI-assisted development as inherently suspect misses the point.

The problem is the gap between what AI can do and what the surrounding systems are prepared to verify. That gap is a governance problem. A structural one.

Forty-five percent of AI-generated code containing vulnerabilities isn't a reason to stop using AI coding tools. It's a specification. It tells you exactly what your governance framework needs to be built to catch — systematically, at scale, without depending on every developer already knowing what to look for.

The code is writing itself. That part isn't changing. The question is whether anyone's watching — and whether the watching is built into the process or just hoped for.

Deploying AI without governance infrastructure?

That gap has a cost — and it compounds. Let's talk about what governed AI looks like in practice.

Schedule a Call →

Consulting

Development

Why AI Now

The Platform

More Capabilities