Securing AI-Generated Code: What 23 Studies and 48,185 CVEs Actually Show

// NEXUSVOID RESEARCH & ANALYSIS

<- ALL RESEARCH & ANALYSIS

Jul 3, 2026

NexusVoid AI Research

We reviewed 23 studies and industry datasets on AI-generated code security. Roughly one in three AI-generated code samples contains a vulnerability - a rate that has not improved across model generations - while CVE publications grew 139% in the four years AI went from writing none of our code to roughly a third of it. Here is what the evidence says actually works.

AI code security, secure AI-generated code, vibe coding, CVE trends, meta-analysis

ANALYSIS · A structured review of 23 published studies and datasets, plus our own computation on CVE publication data. We did not discover new vulnerabilities in this work.

Across every sample-based study we reviewed, roughly one in three AI-generated code samples contains a security vulnerability (range: 24% to 45%, rising past 70% for Java in the largest benchmark). The most important finding is not the rate itself - it's that Veracode's 2025 test of over 100 models found security pass rates stayed flat across model generations while functional capability kept improving. Models are getting better at writing code, and not getting better at writing safe code.

Meanwhile the volume side exploded. In 2021, AI wrote approximately none of the world's production code. By 2025, GitHub reported 46% of code in Copilot-enabled files was AI-written, Google attributed roughly 25% of new code to AI, and Microsoft 20-30%. Over those same four years, published CVEs grew from 20,153 (2021) to 48,185 (2025) - a 139% increase. That is the velocity gap: code production and vulnerability production are both accelerating, and the security quality of generated code is standing still.

What we asked

What does the published evidence - not vendor marketing - say about how often AI-generated code is insecure, whether that is improving, and which practices measurably reduce the risk?

How we did it

We compiled 23 sources: peer-reviewed studies, large industry benchmarks, and primary datasets. For each sample-based study we extracted the reported insecure-output rate. For the velocity comparison we used CVE publication counts per year from Jerry Gamblin's independent CVE data reviews, which track NVD publications, and cross-referenced our own KEV exploitation-window analysis.

What the studies agree on

Generated code is insecure at a stable, high rate. Pearce et al. (IEEE S&P 2022) generated 1,689 programs with Copilot across high-risk CWE scenarios: about 40% were vulnerable. Fu et al. (ACM TOSEM 2025) analyzed 733 AI-generated snippets found in real GitHub projects: 29.5% of Python and about a quarter of JavaScript snippets contained weaknesses. Veracode's 2025 benchmark across 100+ models and 80 tasks: 45% failed security checks, with cross-site scripting failing 86% of the time and log injection 88%. Meta's CyberSecEval benchmark reports insecure suggestion rates near 30%.

Humans plus AI can be worse than humans alone. In Stanford's controlled user study (Perry et al., CCS 2023), participants with an AI assistant wrote measurably less secure code - and were more confident it was secure. That false-confidence effect is the mechanism that turns a model weakness into a shipped vulnerability.

It compounds. A 2025 systematic analysis found security degrades further over iterative AI generation rounds - the "just ask it to fix it" workflow makes things worse. GitClear's analysis of 211 million changed lines found code duplication up 8x and refactoring collapsing from 25% to under 10% of changes - and cloned code is defect-prone code. Google's DORA 2024 report found AI adoption correlated with faster throughput and worse delivery stability.

The volume is unprecedented. CVE publications: 20,153 (2021), 25,084 (2022), 28,818 (2023), 40,009 (2024), 48,185 (2025). Exploitation is faster too: our own analysis of CISA KEV data found a median 26-day window from disclosure to confirmed exploitation.

Best practices, ranked by evidence

Most "secure your AI code" articles list opinions. Here is what the studies above actually support, strongest evidence first:

Scan every AI contribution automatically, in the pipeline. The single most supported practice. Sample-based studies consistently show 25-45% insecure output; static and dynamic analysis in CI catches the recurring classes (injection, XSS, hardcoded secrets) that models fail at most. Nothing that depends on developer vigilance survives the Stanford false-confidence finding.
Treat AI code review as adversarial, not editorial. Fu et al. found the weaknesses land in real repositories - meaning human review as practiced is not catching them. Review AI contributions with the assumption they contain a vulnerability, because roughly one in three do.
Do not iterate your way to security. The iterative-degradation finding means "regenerate until it looks right" is a security anti-pattern. Fix security findings manually or with a scanner-verified patch.
Manage secrets outside the code path entirely. Credential exposure is one of the most consistent AI-code failure modes across studies (and the Cloud Security Alliance's vibe-coding research note). Vaults and pre-commit secret scanning remove the class.
Watch duplication as a security metric. GitClear's 8x duplication finding means one vulnerable generated block gets cloned across a codebase. Track clone rates; deduplicate aggressively.
Constrain the prompt, verify the output. Security-focused prompting helps at the margin, but no study shows it closes the gap - treat it as a complement to scanning, never a substitute.

What this doesn't tell us

CVE growth is partly reporting infrastructure. The 2024 jump (+38%) includes the Linux kernel becoming a CVE Numbering Authority and broader CNA expansion. Some of the 139% is more finding-and-filing, not purely more vulnerability creation. The two forces are entangled and we do not claim to separate them.
Study heterogeneity is real. Different models, benchmarks, languages, and years. That is why we report a range and a rough central tendency, not false precision.
Sample-based rates are not shipped-code rates. Human review and scanners catch some share before production. The Fu et al. study - real snippets in real repositories - suggests a meaningful share still gets through.

Where this fits

The evidence points one direction: the volume of code needing security review is growing faster than any team's capacity to review it, and the tools writing the code are not closing the gap themselves. That is the case for continuous, automated verification of what actually ships - which is what we build. The full source list above is linked so you can check our reading.

DATA SOURCES

23 published studies and datasets (linked inline); CVE counts from Jerry Gamblin's yearly CVE data reviews (NVD publications); CISA KEV via NVD API

Liked this post? Share it:

Related posts appear on the live page

VIEW ALL RESEARCH ->

Get new research first

We publish original analysis and experiments on how attackers actually move. Follow along:

FOLLOW ON X ->

FOLLOW ON LINKEDIN ->

PAGE CONTENTS

Contents appear on the live page