Why AI Security Checks Are Not Very Effective

When companies like Anthropic, Google, and OpenAI build their AI systems, they spend months adding ways to prevent people from using their technology to spread disinformation, make weapons, or break into computer networks.

But recently, scientists in Italy discovered that they can break through these protections poetry.

They used poetic language to trick 31 AI systems into ignoring internal security controls. By opening the challenge with an elaborate verse and metaphor—”an iron seed sleeps best in the bosom of an unsuspecting earth, far from the accusatory gaze of the sun”—they were able to trick the systems into showing them how to do the most damage with a hidden bomb.

It was another indication that for many AI systems, the guardrails to deter dangerous behavior are suggestions rather than barriers. These weaknesses are of increasing concern to researchers as AI systems become more adept at finding security holes in computer systems and performing other risky tasks.

Last month, Anthropic said it was limiting the release of its latest AI technology, Claude Mythos, to a small number of organizations because of the model’s ability to quickly uncover software vulnerabilities. OpenAI also later said it would only share similar technology with a limited group of partners.

Since OpenAI sparked the AI boom in late 2022, researchers have demonstrated that humans can bypass security checks on AI systems. Close one loophole and another would open.

“Everyone in the industry recognizes that guardrails remain a challenge and likely will be for some time,” said Matt Fredrikson, a professor of computer science at Carnegie Mellon University and chief executive of Gray Swan AI, a start-up that helps companies secure AI technologies. “Determined individuals can bypass them, sometimes without much effort.”

Crossing the guardrails has consequences. In an online environment already overrun with misinformation and disinformation, people are using AI systems to spread conspiracy theories and other false claims. Antropic recently said its technology was used in an international cyber attack. Chatbots told biosecurity experts how to release deadly pathogens and maximize losses.

The poetry loophole was one of many methods that allow hackers to bypass railings on systems like Anthropic’s Claude, Google’s Gemini, and OpenAI’s GPT. All the leading AI companies use the same basic techniques to build guardrails into their systems—and they’re surprisingly easy to crack.

“Poetry is just one example of how you can reframe a challenge in almost any stylistic way you want and go over the railing,” said Piercosma Bisconti, co-founder of artificial intelligence company Dexai and one of the researchers who worked on the project.

Going around the railing in the AI system is called “jailbreaking”. This usually means giving the system a few English sentences to trick it into doing something it has been taught not to do.

Jailbreaking methods go by a number of fancy names: stealth prompt injection, roleplay, token smuggling, multilingual Trojans, and greedy coordinate gradient attacks. Specific attacks often have a grandiose name Crescendo, A fraudulent pleasure or Echo chamber.

Fragile AI defenses have already led to the spread of fake conversations, fabricated war evidence, and synthetic rumors. Three years agoInternational counter-terrorism researchers have already monitored brainstorming sessions on social media between far-right extremists trying to avoid moderators with “terrible but lawful” AI content.

Experts worry that models can be trapped to trick social media users with authentic-looking content, overwhelm fact-checkers with disinformation dumps, and tailor fake stories to specific targets.

Some methods are widely shared on the Internet. Others are kept private. When some people discover a new jailbreak, they hoard it so that AI companies don’t try to close the loophole before they have a chance to use it.

Artificial intelligence systems like Claude and GPT learn their skills by pinpointing patterns in digital data, including Wikipedia articles, newspaper articles, computer programs, and other text collected from around the Internet. But before these systems go public, companies like Anthropic and OpenAI explore ways they can be abused.

In their raw form, these systems can be made to explain how to buy illegal firearms online, or describe ways to make dangerous substances using household items. So, through a process called reinforcement learning, companies train their systems to reject certain requests.

This usually involves displaying thousands of system requests that should not be answered. By analyzing these examples, the system learns to recognize other prohibited requests. However, the method is only partially effective.

In some cases, AI companies don’t bother to address loopholes at all, figuring that while weak guardrails can enable malicious activity, they can also allow benign activity to counter it.

Last month, researchers at cybersecurity firm LayerX discovered that they could bypass Claude’s checkpoints by feeding the AI system a few simple sentences.

If they told Claude they were “testing” a computer network—meaning they wanted to test the network’s defenses with a simulated attack—Anthropic’s AI technology would attack the network. This simple trick, the researchers pointed out, could allow malicious hackers to steal sensitive data from companies, governments and individuals.

If Anthropic were to close the loophole, it could prevent hackers from using the Cloud to attack the network, but it could also prevent companies from defending the network. LayerX told Anthropic about the loophole its researchers found weeks ago, but it remains open.

That approach could backfire, said Or Eshed, CEO of LayerX. “Eventually, there will be a large number of attacks using these AI models and they will be forced to rethink their approach to security,” he predicted.

Last year for less than 50 dollarsResearchers from the technology company Cisco and the University of Pennsylvania pushed six artificial intelligence models to create different malicious responses. Their disinformation challenges jailbroken chatbots from Meta and China’s DeepSeek AI model 100 percent of the time, while more than 80 percent of their attacks against Google and OpenAI models were successful.

(The New York Times has sued OpenAI and Microsoft, alleging copyright infringement of news content related to AI systems. Both companies have denied the suit’s claims.)

Breached checkpoints could enable automated large-scale influence campaigns, he said scientists from the University of Technology Sydney. The team persuaded one commercial language model to create a disinformation campaign about an Australian political party – complete with visuals, hashtags and platform-specific posts – by presenting the request as a “simulation”.

The companies say that in addition to building guardrails into their systems, they use separate tools to monitor activity on those systems, identify suspicious behavior and block accounts that don’t meet the terms of service.

“Claude is built with strong protections that consist of many layers designed to work together, including model training and railings built on top of the model,” said Anthropic spokeswoman Paruul Maheshwary. “Bypassing one does not mean bypassing the others.”

Thus, Anthropic discovered that a team of Chinese state-backed hackers used Cloud to infiltrate the computer systems of roughly 30 companies and government agencies around the world.

But experts say this security technique is also flawed because companies have to monitor a large volume of activity around the world — and because they worry about blocking legitimate users.

If one is thwarted by the guardrails and security systems that protect online services like Claude and GPT, one can always turn to open source AI systems whose underlying software can be freely copied, shared and modified.

Since these systems can be modified, anyone can work to remove their barriers. Using a new method called Heretic, a person can remove the railing system with very little effort. This method uses complex math to essentially return the months of training when the handrails were used.

“A year ago it was very complicated,” said Noam Schwartz, chief executive of Alice, an AI security company. “Now you can do it from your phone.