Skip to main content Arjen Wiersma

AI Can Write Code, but Can It Secure It?

You can’t scroll through a tech feed these days without bumping into a hot take on AI and coding. Depending on who you ask, it’s either the greatest productivity boost in history or a security dumpster fire waiting to happen. Opinions are cheap, which is why I prefer to stick to the data from actual research. That way, the information is verifiable, and you can trust the analysis because you can check the sources for yourself. This style of writing I call Research Driven Blogging .

A recent deep-dive from the folks at (sidenote: See their in-depth blog post in the bibliography) Semgrep did just that, and what they found paints a complicated, sometimes contradictory, picture 1. It’s a picture every developer and security pro needs to see.

A funny thing happened on the way to production

The central problem is a paradox we’re all starting to notice. AI tools are getting incredibly good at generating code, but they’re also incredibly good at generating vulnerable code. The Semgrep post points to one study that found 62% of C programs churned out by LLMs had at least one (sidenote: For those of us in the security field, this is what we call job security.) .

What’s really interesting, though, is the human element. The research shows that developers using AI assistants are not only more likely to submit insecure code, but they also report feeling more confident about their flawed work. This brings us to the million-dollar question: if AI is helping write all this insecure code, can we trust it to clean up its own mess?

So, i put the AI scanners to the test…

Well, not me personally, but the researchers did. They took a hard look at how effective AI-powered code review features are at spotting known vulnerabilities, and the results were… underwhelming.

When it came to the big, scary stuff such as SQL injection, cross-site scripting (XSS), and memory corruption, The AI models were often asleep at the wheel. Most of their feedback focused on low-hanging fruit like coding style, typos, or potential runtime exceptions. It’s reassuring to know that for now, the AI is more interested in correcting my grammar than preventing a full-scale data breach.

The tools also got tripped up by common configuration files like YAML and XML, which is a bit of a problem since that’s where a huge number of enterprise security misconfigurations happen.

Why can’t the robots see the bugs?

The “why” behind these failures comes down to how these models are built. They’re not thinking like a security analyst; they’re pattern-matching machines.

First, they have a lack of deep semantic understanding. Unlike a traditional static analyzer like CodeQL that operates on a set of firm rules, an LLM makes educated guesses based on statistical correlations. It doesn’t truly understand the consequences of the code it’s reading.

This leads to the second major weakness: poor data flow tracking. This is a big one. The AI struggles to trace a piece of user input as it moves through the application. If it can’t follow that data from the web form all the way to the database query, it has almost no chance of spotting an injection vulnerability. The numbers here are pretty stark: for SQL Injection, Claude Code had a 5% True Positive Rate, and OpenAI Codex came in at a flat 0%. You read that right. Zero.

Finally, there’s the inconsistency. Because these models are probabilistic, you can give them the exact same code twice and get two different reports. That lack of reproducibility is a deal-breaker for any serious security tool that needs to provide reliable, actionable feedback.

It’s not all bad news

Now, it’s not a total wash. The AI does show some flashes of talent where its unique abilities give it an edge.

Because LLMs can grasp the general context of the code, they’re surprisingly decent at finding bugs that depend on understanding logic. For example, the study found Claude Code was best at finding Insecure Direct Object Reference (IDOR) bugs, with a 22% success rate. That’s a vulnerability that requires understanding authorization, something traditional scanners can miss. Similarly, OpenAI Codex had a surprisingly high 47% success rate for Path Traversal issues.

Even when the findings are noisy, and with false positive rates between 82% and 86%, they are very noisy, the AI can still act as a useful “secure guardrail”. It might suggest hardening a piece of code that wasn’t technically vulnerable, which is rarely a bad idea.

Where do we go from here?

The path forward isn’t to throw bigger, monolithic AI models at the problem. The Semgrep post argues for a more sophisticated approach: agentic workflows.

The idea is to use AI not as a magic bullet, but as an orchestrator. The most successful systems are hybrids, integrating LLMs with deterministic tools like symbolic execution, fuzzing, and traditional static analysis. In this model, specialized AI agents work together, using a whole suite of tools to find, validate, and even exploit vulnerabilities. It’s about combining the AI’s contextual strengths with the precision of classic security tools.

The bottom line for us in the trenches

So, what does this all mean for those of us doing the actual work? I have a few takeaways.

First, AI is not a silver bullet, and you shouldn’t fire your security team just yet. Today’s models are weak on the high-severity injection flaws that keep us up at night, and they are no substitute for a skilled human auditor.

Second, don’t discard your dedicated tools. That CodeQL or SonarQube license is still one of the best investments you can make. They provide the consistent, explainable diagnostics that today’s LLMs simply can’t.

Finally, the mantra should be augment, don’t replace. Use these AI tools for what they’re good at: catching low-severity bugs, offering style suggestions, and maybe finding the occasional oddball contextual flaw. Let the AI handle the small stuff so the human experts can focus on the threats that matter.


  1. Semgrep. (2025). “Finding Vulnerabilities in Modern Web Apps using Claude Code and OpenAI Codex.” Semgrep Blog. Retrieved from https://semgrep.dev/blog/2025/finding-vulnerabilities-in-modern-web-apps-using-claude-code-and-openai-codex/ ↩︎