Why AI Security Checks Are Not Very Effective

When companies like Anthropic, Google, and OpenAI build their AI systems, they spend months adding ways to prevent people from using their technology to spread disinformation, make weapons, or break into computer networks.

poetry.

They used poetic language to trick 31 AI systems into ignoring internal safety controls. By opening the challenge with an elaborate verse and metaphor—”an iron seed sleeps best in the bosom of an unsuspecting earth, far from the accusatory gaze of the sun”—they were able to trick the systems into showing them how to do the most damage with a hidden bomb.

It was another indication that, for many AI systems, guardrails meant to avert dangerous behavior are more like suggestions than barriers. These weaknesses are of increasing concern to researchers as AI systems become more adept at finding security holes in computer systems and performing other risky tasks.

Last month, Anthropic said it was limiting the release of its latest AI technology, Claude Mythos, to a small number of organizations because of the model’s ability to quickly uncover software vulnerabilities. OpenAI later said it, too, would share similar technology with only a limited group of partners.

Since OpenAI ignited the AI boom in late 2022, researchers have shown that people could bypass the safety controls on AI systems. Close one loophole and another would open.

“Everyone in the industry recognizes that guardrails remain a challenge and likely will be for some time,” said Matt Fredrikson, a professor of computer science at Carnegie Mellon University and chief executive of Gray Swan AI, a start-up that helps companies secure AI technologies. “Determined individuals can bypass them, sometimes without significant effort.”

When guardrails are overrun, there are consequences. In an online environment already overrun with misinformation and disinformation, people are using AI systems to spread conspiracy theories and other false claims. Anthropic recently said its technology had been used in an international cyberattack. Chatbots have told biosecurity experts how to release deadly pathogens and maximize casualties.

The poetry loophole was one of many methods that allow hackers to bypass railings on systems like Anthropic’s Claude, Google’s Gemini, and OpenAI’s GPT. All the leading AI companies use the same basic techniques to build guardrails into their systems — and they are surprisingly easy to break.

“Poetry is just one example of how you can reframe a challenge in almost any stylistic way you want and go over the railing,” said Piercosma Bisconti, co-founder of artificial intelligence company Dexai and one of the researchers who worked on the project.

Jailbreaking methods go by a number of fancy names: stealth prompt injection, roleplay, token smuggling, multilingual Trojans, and greedy coordinate gradient attacks. Specific attacks often have a grandiose title like Crescendo, A fraudulent pleasure or Echo chamber.

Three years agoInternational counter-terrorism researchers have already monitored brainstorming sessions on social media between far-right extremists trying to avoid moderators with “terrible but lawful” AI content.

Experts worry that models can be trapped to trick social media users with authentic-looking content, overwhelm fact-checkers with disinformation dumps, and tailor fake stories to specific targets.

Some methods are widely shared across the internet. Others are kept private. When some people discover a new jailbreak, they hoard it so AI companies won’t try to close the loophole before they have a chance to use it.

Artificial intelligence systems like Claude and GPT learn their skills by pinpointing patterns in digital data, including Wikipedia articles, newspaper articles, computer programs, and other text collected from around the Internet. But before releasing these systems to the public, companies like Anthropic and OpenAI explore ways they can be abused.

In their raw form, these systems can be made to explain how to buy illegal firearms online, or describe ways to make dangerous substances using household items. So, through a process called reinforcement learning, companies train their systems to refuse certain requests.

This typically involves showing the system thousands of requests that should not be answered. By analyzing these examples, the system learns to recognize other forbidden requests, too. However, the method is only partially effective.

In some cases, AI companies don’t bother to address loopholes at all, figuring that while weak guardrails can enable malicious activity, they can also allow benign activity to counter it.

Last month, researchers at cybersecurity firm LayerX discovered that they could bypass Claude’s checkpoints by feeding the AI system a few simple sentences.

If they told Claude they were “testing” a computer network—meaning they wanted to test the network’s defenses with a simulated attack—Anthropic’s AI technology would attack the network. This simple trick, the researchers pointed out, could allow malicious hackers to steal sensitive data from companies, governments and individuals.

If Anthropic were to close the loophole, it could prevent hackers from using the Cloud to attack the network, but it could also prevent companies from defending the network. LayerX told Anthropic about the loophole that its researchers found weeks ago, but it remains open.

That approach could backfire, said Or Eshed, chief executive of LayerX. “Eventually, there will be a large number of attacks using these AI models and they will be forced to rethink their approach to security,” he predicted.

Last year for less than 50 dollarsResearchers from the technology company Cisco and the University of Pennsylvania pushed six artificial intelligence models to create different malicious responses. Their disinformation challenges jailbroken chatbots from Meta and China’s DeepSeek AI model 100 percent of the time, while more than 80 percent of their attacks against Google and OpenAI models were successful.

(The New York Times has sued OpenAI and Microsoft, alleging copyright infringement of news content related to AI systems. Both companies have denied the suit’s claims.)

. The team persuaded one commercial language model to create a disinformation campaign about an Australian political party – complete with visuals, hashtags and platform-specific posts – by presenting the request as a “simulation”.

The companies say that in addition to building guardrails into their systems, they use separate tools to monitor activity on those systems, identify suspicious behavior and block accounts that don’t meet the terms of service.

“Claude is built with strong protections that consist of many layers designed to work together, including model training and railings built on top of the model,” said Anthropic spokeswoman Paruul Maheshwary. “Bypassing one does not mean bypassing the others.”

Thus, Anthropic discovered that a team of Chinese state-backed hackers used Cloud to infiltrate the computer systems of roughly 30 companies and government agencies around the world.

But experts say this security technique is also flawed because companies have to monitor a large volume of activity around the world — and because they worry about blocking legitimate users.

If one is thwarted by the guardrails and security systems that protect online services like Claude and GPT, one can always turn to open source AI systems whose underlying software can be freely copied, shared and modified.

Because these systems can be modified, anyone can work to strip away their guardrails. Using a new method called Heretic, a person can remove the railing system with very little effort. This method uses complex mathematics to essentially reverse the months of training that applied the guardrails.

“A year ago, doing this was very complicated,” said Noam Schwartz, chief executive of Alice, an AI security company. “Now you can do it from your phone.