Anthropic CEO Dario Amodei’s prediction, in March 2025, that eventually all coding would be AI generated already rings true. Amodei said that human software developers would need to train LLMs with design features and conditions but that, eventually, every single task would be automated.

Indeed, in a post on X on 27 January, creator and head of Anthropic’s Claude Code, Boris Cherny, confirmed that “pretty much 100%” of code at the company is now AI-generated. Timelines for AI deployment have often leaned towards the overly optimistic, but the shift from human to AI generated code has not.

Access deeper industry intelligence

Experience unmatched clarity with a single platform that combines unique data, AI, and human expertise.

Find out more

AI coding agents are producing increasing volumes of code. Some analysts predict that between 40-60% of code today is already AI generated. And while the prevailing narrative focuses on efficiency gains, AI code generation does go awry—how often and with what consequences is not yet clear.

Code verification and quality control becomes increasingly important with these increasing volumes of AI generated code, says Roman Zednik, field CTO at Tricentis. As systems become more complex, tools can check syntax and basic security patterns, but it’s much harder to verify that the code behaves correctly once integrated into a complex enterprise ecosystem with multiple backend systems, interfaces, and supply chains, he says.

A small AI generated change can have an unpredictable impact on the wider system, especially in high stakes environments like banks, insurers and telcos. Developers must ask the question: “What is the impact if I change this small piece of the code on the whole ecosystem?” says Zednik.

“In addition, AI often generates extra code that wasn’t in the spec and may be functionally unnecessary and semantically nonsense in the business context,” notes Zednik. The question then becomes should you test this extra code, or delete it?

Zednik says that while AI speeds up the generation of source code, it doesn’t mean that code quality automatically increases at a commensurate speed. Developers may use AI for simple tasks but more complex tasks, especially when it comes to systems integrations still requires human coding.

The great AI code testing bottleneck

AI speeds up code generation, but that code still needs review and testing, so the total testing workload increases, which can create a quality assurance bottleneck.

In 2021, OpenAI developed HumanEval, a widely used benchmark for assessing LLM code generation capabilities. It was designed to evaluate whether an AI can write functional Python code based on natural language instructions.

At the time of its publication, Stanford University’s 2025 Artificial Intelligence Index Report, found that Anthropic’s advanced coding and development tool Claude 3.5 Sonnet (HPT) was the leader in HumanEval performance, achieving a score of 100%.

But Stanford’s study, along with any other benchmarking performance exercises do not necessarily reflect real-world coding accuracy. They only test benchmark performance on a specific test parameters, which may not account for rapid developments in AI modelling.

Independent benchmarking organisation Professional Reasoning Bench (PRBench) examines complex, real-world questions for finance and law written by experienced industry professionals. The best available AI models score only 39% on hard finance tasks and 37% in hard legal ones, demonstrating that industry expectations often overestimate AI capabilities in concrete professional applications, according to PRBench.

AI coding agents can generate their own testing systems but can these testing environments be trusted? They still require human oversight. And the challenge is not necessarily to verify that code works but whether enterprise technologists can control and secure it at scale in high risk environments.

For many organisations, especially those still doing manual testing, testing capacity does not, and perhaps cannot, scale with the new and massive volumes of code generated by AI.

“A surprising number of really large enterprises are still relying on manual testing,” says Zednik who advises companies to shift to automated processes as quickly as possible. “Otherwise, they will not be able to deliver new functionality on time, and will lose against the competition over the over the time. We already see this happening.”

Will AI code create workforce savings?

If companies are relying on manual testing and code needs a human in the loop, the AI versus human job arbitrage may not generate the kind of savings companies are hoping for.

Many companies have cited AI as the reason behind layoffs in the last year, but data on this is still inconclusive and evidence is more anecdotal. “Intelligence tools have changed what it means to build and run a company,” Jack Dorsey, Square parent company Block’s CEO, said in a letter to shareholders on 26 Feb. Dorsey laid off between 40-50% of Block’s workforce in February.

“We’re already seeing it internally. A significantly smaller team, using the tools we’re building, can do more and do it better. And intelligence tool capabilities are compounding faster every week.”

And yet, US Bureau of Labour Statistics figures as of March 2025 still predict 17.9% growth from 2023–2033 for software developer jobs increasing from 1.69 million to 1.99 million jobs in the US. That growth rate is over four times the average occupation growth.

AI generated code opens a can of security worms

Kevin Curran, professor of cybersecurity at Ulster University and co-founder at Vaultree, is astounded at the rate of improvement in AI code generation despite “huge errors and vulnerabilities” that are starting to surface.

Curran describes agentic AI as opening a can of worms. “Unfortunately, we’ve opened up an attack surface area where we’re trusting the documents that we’re getting back [from queries], and then this leads to a prompt injection attack, because we’ve let the agents go off do the deep research that would take us weeks,” he explains.

“But they’re at the mercy of prompt injection instructions which exfiltrate data, create havoc on our local systems, open up endpoints, and we’re just clicking allow, allow, without being sure what we’re given permissions for,” he explains.

Furthermore, AI generated code is extremely bloated. “It can be tens of thousands of lines, with so many dependencies, that the attack vector becomes larger,” says Curran. And it’s become increasingly difficult for human auditors of code to be able to identify insecure code. Even with established security processes, the sheer volume and complexity of AI generated code makes it much easier for vulnerabilities to slip through.

Amazon’s ‘nearly right’ code snafu

In early March, the Financial Times reported that faulty AI generated code has led to a number of Amazon.com outages. The company hit back in a blog on 11 March which contended: “In fact, only one of the recent incidents involved AI tools in any way, and in that case the cause was unrelated to AI and instead our systems allowed an engineering team user error to have broader impact than it should have.”

Despite, Amazon’s defensive posture, Nintex’s chief product and technology officer, Niranjan Vijayaragavan says the incident proves that even the biggest companies can be impacted by AI-assisted code changes.

“A small, plausible-looking change can slip through review, behave differently in production than it did in a sandbox, and then cascade across environments faster than teams can detect and roll it back. But this is less a code failure and more a governance one,” he says.

Organisations are layering AI onto undocumented, inconsistent processes and expecting stable outcomes, says Vijayaragavan, who advises treating AI tools as a force multiplier inside a governed delivery process: clear ownership and accountability, automation-led controls, strong testing in controlled environments, and human-in-the-loop oversight for high-impact changes.

He says: “If you can’t trace and govern how an AI-assisted change moves from development to deployment—and where you can intervene when something goes wrong—you’re not ready to scale it.”