Developers Have Been Shipping AI-Generated Vulnerabilities Since 2021

This content originally appeared on Level Up Coding - Medium and was authored by Ahmed Ibrahim

Why Is Vibe Coding the Problem?

The Pattern I Keep Recognizing

I’ve worked in cloud engineering long enough to recognize how these things go. A new technology arrives and everyone gets excited. Proof of concepts appear everywhere, usually without anyone asking too many questions. Someone calls it the future of everything. Then months or years later, someone finally asks whether this is actually safe, and by that point the logs are already filling up and half the company depends on it.

I’ve watched this movie enough times to recognize the opening credits.

The newest version is what people are calling vibe coding. If you’ve managed to avoid the term, it describes non-technical people using AI tools to generate full applications and deploying them without deep knowledge of the underlying code. Marketing teams building dashboards with Bolt. Product managers generating internal tools with Lovable. People producing full-stack applications with authentication and databases without writing a single line manually.

A colleague of mine, Fred, brought this up in a meeting recently. We were discussing how the organization should handle vibe coding, and he asked a question that stuck with me: “What’s the opposite of vibe coding? Boring coding?” He laughed, but then made a serious point. His argument was that it shouldn’t matter where code comes from. Vibe coded, Copilot-assisted, or typed out manually by someone who thinks tabs are superior to spaces. All of it should go through the same process. No special lane for anything.

He’s right in principle. The problem is that the “process” he’s describing often consists of someone glancing at a diff and typing “LGTM” before moving on to the next ticket. And look, sometimes that’s fine. Not every three-line config change needs a security review. But when the code is generated by a tool that might have confidently produced something subtly broken, that quick approval becomes a different kind of gamble.

The industry reaction has followed the predictable path. Security teams are writing policies. Leaders are blocking tools. LinkedIn is full of warnings about the dangers of letting non-developers write code. It took roughly three months for this conversation to shift from curiosity to problematic, and some might go further and throw the word crisis.

While everyone panicked about beginners, I kept thinking about something else entirely. The tools the industry is already comfortable with, such as GitHub Copilot, have been generating code since 2021. Copilot completes functions, scaffolds files, and frequently provides entire implementations. Developers adopted it immediately. Security teams approved it. It came from Microsoft, so it felt safe.

This led to an obvious question. If the concern is that non-developers generate code they don’t fully understand, what about developers generating code they also don’t fully understand?

That felt worth digging into. So I started reading research papers. What I found was a mix of reassuring and concerning information, sometimes appearing in the same paragraph. Honestly, the pattern suggested that the industry might be having the wrong conversation entirely.

What the Research Actually Says

The most referenced study on this topic came from NYU in 2022. Researchers evaluated GitHub Copilot across 89 security-related scenarios aligned with MITRE CWE Top 25, generating 1,689 programs in Python, C, and Verilog. About 40 percent of those programs contained security vulnerabilities.

Forty percent. From a tool that millions of developers use daily.

Now, this study came from a respected institution and underwent peer review at IEEE S&P, which makes it credible. It’s also worth noting that this was Copilot in 2021, not today. Things have improved. A targeted replication published in 2023 and presented at SANER in 2024 focused only on Python and found that the rate of insecure solutions had dropped from 36.54 percent to 27.25 percent for comparable scenarios.

That’s genuine progress. It also means more than a quarter of completions still contained vulnerabilities. Progress, yes. Problem solved, no.

Then there’s industry research. Veracode published a GenAI Code Security Report in 2025 where they evaluated 80 coding tasks across more than 100 large language models. They found that 45 percent of generated code contained vulnerabilities when assessed against OWASP Top 10 categories. Java was the worst at roughly 72 percent. Python, JavaScript, and C# ranged from 38 to 45 percent.

The consistent pattern across these studies isn’t the exact percentage. The numbers vary by task, language, and model version. What doesn’t vary is that vulnerability rates are significant and measurable, regardless of which specific tool or year you examine.

The degree varies. The direction doesn’t.

The Understanding Problem Nobody Wants to Discuss

A common argument in vibe coding debates is that non-developers don’t understand the code they generate. That’s obviously true.

But it misses something important sitting right next to it.

Understanding OAuth conceptually is one thing. Understanding a 200-line OAuth implementation that appears instantly from an AI model is something else entirely. Even experienced developers can miss subtle issues when the code looks clean, compiles successfully, and passes a quick manual test. The code appears correct. That’s precisely what makes it dangerous.

The 2025 Stack Overflow Developer Survey provides some uncomfortable context here. They surveyed over 49,000 developers across 177 countries, which makes it one of the largest annual snapshots of how developers actually work.

The headline number is that 84 percent of respondents are using or planning to use AI tools in their development process, up from 76 percent in 2024. Among professional developers, 51 percent use these tools daily. This isn’t experimental adoption anymore. This is how a significant portion of code gets written now.

Here’s where it gets interesting.

At the same time, 46 percent of developers say they don’t trust the accuracy of AI-generated output. That’s up sharply from 31 percent the previous year. Only 33 percent say they trust it, and a mere 3 percent report high trust.

The most common frustration, reported by 66 percent of respondents, is that AI solutions are “almost right, but not quite.” Another 45 percent say debugging AI-generated code takes longer than writing it themselves.

Read that combination again. Developers are widely adopting tools that they don’t entirely trust, that often produce code requiring more debugging effort, and that can still generate vulnerable logic even when the syntax is perfect.

I don’t know about you, but that combination made me pause.

The conversation about non-developers generating code they don’t understand is real, but it’s incomplete. The broader issue is that code is being produced faster than it can be carefully reviewed and understood. And that applies to everyone, regardless of job title.

Why Copilot Feels Safe and Lovable Feels Dangerous

There’s an interesting psychological aspect to all this that nobody seems to discuss directly. Copilot feels safe because it works incrementally. It completes a line here, suggests a function there. The process feels like traditional autocomplete. You stay in your editor. You feel in control. You’re still writing the code, even if the model produced most of it.

The small code suggestion feels like a helpful colleague finishing your sentence.

Tools like Lovable or Bolt feel riskier because the generation is large and immediate. Entire applications appear at once. The user might not be a developer. The velocity feels higher, so the perceived risk feels higher.

It feels like someone else wrote your entire novel and put your name on it. Same words. Different anxiety.

In practice, both paths can lead to thousands of lines of generated code landing in a repo without thorough review. In my opinion, the difference is emotional, not technical. And I think that emotional difference is driving a lot of the panic.

When Stack Overflow asked developers whether vibe coding was part of their professional work, 72 percent said no and another 5 percent answered emphatically no. Yet these same developers are using AI coding tools every day. They just don’t call it vibe coding when they do it.

The behavior is already here. The terminology is what arrived late.

When the Risk Actually Changes

There’s one area where the risk genuinely shifts, and it’s worth being accurate about it. Some newer tools and agent frameworks allow the model not just to generate code, but to execute it. Once execution enters the picture, the threat model is fundamentally different.

A 2024 study by researchers at the University of Illinois built an agent framework on GPT-4 and tested it against sandboxed web vulnerabilities. Under their experimental setup, the agent successfully exploited 73.3 percent of the vulnerabilities when given five attempts per vulnerability. GPT-3.5 achieved only 6.7 percent. Every open-source model they tested failed completely.

A related preprint from the same research group tested GPT-4 on known one-day vulnerabilities when provided the CVE description. The agent exploited 87 percent of them, while GPT-3.5 and all open-source models couldn’t exploit any.

Both papers are preprints, which means their findings should be viewed as early evidence rather than final conclusions. The peer review process exists for good reasons. Still, the mechanism they demonstrate is clear.

If a model can execute code, test hypotheses, and iterate on its own output, the boundary between suggestion and action becomes critically important. A tool that suggests vulnerable code is one kind of problem. A tool that can act on vulnerable systems autonomously, that’s a different kind of problem entirely.

This is where traditional security engineering principles apply directly. And honestly, this is where the conversation should probably be spending more time.

What Actually Makes Sense

For tools that only generate code, existing development processes still work. Code review, automated security scanning, tests, deployment pipelines. All of it remains essential. The research suggests that AI-generated code may need more thorough review, not less, but the fundamental model remains the same. Fred’s point holds, that we should have the same process for everything.

For tools that can execute code, additional safeguards are essential. The principles aren’t new. They’re the same controls applied to CI/CD pipelines, infrastructure automation, and any privileged tooling.

Sandboxed execution environments that can’t reach production systems, data, or credentials. Least-privilege access where agents get only what they need for their specific function. Immutable audit logging for every action, tool call, and access event. Human approval gates before any high-impact operation reaches production.

None of this is revolutionary. It’s the same security engineering that applies to any untrusted execution environment. The difference is that AI tools can produce and sometimes execute code much faster than traditional systems, which means the controls need to be in place before someone decides to connect an agent to something important, but not after.

The Actual Problem Beneath the Panic

The public panic about vibe coding focuses on non-developers generating code. The research points to something broader.

Code is being generated faster than teams can reliably understand, review, and secure it. This applies to developers and non-developers alike. The peer-reviewed studies show vulnerability rates from 27 to 40 percent depending on task and language. The industry analysis shows rates around 45 percent across a wide set of models. Developer surveys show high adoption alongside low trust and significant debugging overhead. Research into AI agents raises legitimate questions about execution rights and boundary enforcement.

A reasonable conclusion is that some AI-generated vulnerabilities are probably sitting in production systems right now.

This isn’t a claim that every system is compromised. It’s an acknowledgment that when you combine meaningful vulnerability rates, widespread adoption, and the typical time organizations take to detect issues, undiscovered problems become statistically likely.

What Organizations Can Actually Do

Several practical steps reduce the risks without requiring organizations to abandon tools that provide genuine productivity benefits.

First, improve visibility into where AI-generated code is coming from and which tools are being used. Without this baseline, security teams can’t assess risk accurately.

Second, increase review depth for code known to be AI-generated, particularly in security-relevant sections. Human review catches many vulnerabilities that automated tools miss, especially the subtle logic issues that AI models tend to produce. And yes, this means “LGTM” might need to become a longer conversation sometimes.

Third, apply execution boundaries to any tools capable of running code. Separate networks, separate credentials, separate data. If an agent can execute code, it shouldn’t be able to reach anything important without explicit human approval.

Fourth, update security tooling. Many SAST tools were built to detect patterns common in human-written code. AI-generated code follows different patterns, and detection capabilities need to evolve alongside the threat.

These steps cost time and resources. They also reduce the likelihood of shipping vulnerabilities that prove expensive to discover later.

The Conclusion

The panic around vibe coding highlights a real concern, but it’s focused too narrowly. The broader issue is that AI-powered development is introducing code faster than it can be reviewed and understood. The data supports this even though the exact rates vary by model, language, and context.

The research is clear about the presence of vulnerabilities. It’s also clear that proper security controls reduce these risks meaningfully. The decision ahead for organizations isn’t whether to use AI coding tools. Adoption is already high and climbing. The decision is whether to build the right guardrails now or learn these lessons through incidents.

Both paths are possible. One is proactive. One is reactive.

I’ve been watching the tech industry long enough to have predictions about which path most organizations will take. But I hope some choose the other one.

It would be a refreshing change from the usual pattern.

Research Sources

Peer-Reviewed Studies:

Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., and Karri, R. (2022). “Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions.” 2022 IEEE Symposium on Security and Privacy (SP).

Majdinasab, V., Bishop, M., Rasheed, S., Moradidakhel, A., Tahir, A., and Khomh, F. (2024). “Assessing the security of GitHub Copilot generated code: A targeted replication study.” 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).

Fang, R., Bindu, R., Gupta, A., and Kang, D. (2024). “LLM agents can autonomously hack websites.” arXiv preprint arXiv:2402.06664.

Fang, R., Bindu, R., Gupta, A., and Kang, D. (2024). “LLM agents can autonomously exploit one-day vulnerabilities.” arXiv preprint arXiv:2404.08144.

Industry Reports:

Veracode. (2025). “2025 GenAI Code Security Report.”

Stack Overflow. (2025). “2025 Developer Survey.”

Developers Have Been Shipping AI-Generated Vulnerabilities Since 2021 was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding - Medium and was authored by Ahmed Ibrahim

Print Share Comment Cite Upload Translate Updates

APA

Ahmed Ibrahim | Sciencx (2025-11-30T20:27:05+00:00) Developers Have Been Shipping AI-Generated Vulnerabilities Since 2021. Retrieved from https://www.scien.cx/2025/11/30/developers-have-been-shipping-ai-generated-vulnerabilities-since-2021/

MLA

" » Developers Have Been Shipping AI-Generated Vulnerabilities Since 2021." Ahmed Ibrahim | Sciencx - Sunday November 30, 2025, https://www.scien.cx/2025/11/30/developers-have-been-shipping-ai-generated-vulnerabilities-since-2021/

HARVARD

Ahmed Ibrahim | Sciencx Sunday November 30, 2025 » Developers Have Been Shipping AI-Generated Vulnerabilities Since 2021., viewed ,<https://www.scien.cx/2025/11/30/developers-have-been-shipping-ai-generated-vulnerabilities-since-2021/>

VANCOUVER

Ahmed Ibrahim | Sciencx - » Developers Have Been Shipping AI-Generated Vulnerabilities Since 2021. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/11/30/developers-have-been-shipping-ai-generated-vulnerabilities-since-2021/

CHICAGO

" » Developers Have Been Shipping AI-Generated Vulnerabilities Since 2021." Ahmed Ibrahim | Sciencx - Accessed . https://www.scien.cx/2025/11/30/developers-have-been-shipping-ai-generated-vulnerabilities-since-2021/

IEEE

" » Developers Have Been Shipping AI-Generated Vulnerabilities Since 2021." Ahmed Ibrahim | Sciencx [Online]. Available: https://www.scien.cx/2025/11/30/developers-have-been-shipping-ai-generated-vulnerabilities-since-2021/. [Accessed: ]

rf:citation

» Developers Have Been Shipping AI-Generated Vulnerabilities Since 2021 | Ahmed Ibrahim | Sciencx | https://www.scien.cx/2025/11/30/developers-have-been-shipping-ai-generated-vulnerabilities-since-2021/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.