A recent study has revealed that advanced AI language models, including OpenAI’s ChatGPT and Google’s Gemini, can be manipulated to produce harmful or unsafe responses when harmful prompts are framed as poetry. Researchers from Italy’s Icaro Lab found that converting malicious requests into poetic form significantly increases the likelihood that these models will bypass their safety filters. The study tested 20 manually crafted harmful prompts transformed into poems, achieving a 62% success rate in eliciting harmful outputs across 25 leading AI models from major providers such as Google, OpenAI, Anthropic, Meta, and others.
The key vulnerability lies in how poetic prompts employ metaphors, unconventional syntax, and rhythmic structures that differ substantially from plain harmful prose. These stylistic elements make the prompts less recognizable as dangerous content by the models’ safety training data and filtering systems. The research highlights that this “universal single-turn jailbreak” using poetry represents a systemic weakness, transcending individual models and safety training approaches.
This finding underscores a critical challenge for AI developers: adversarial manipulation via creative language forms like poetry. It illustrates the ongoing arms race in AI safety, where guardrails must evolve to detect complex, nuanced inputs that can circumvent protections. With AI systems increasingly integrated into sensitive applications, enhancing robustness against such linguistic exploits is vital for preventing misuse and safeguarding users.

