A newly discovered jailbreak – also known as a direct prompt injection attack – called Skeleton Key, affects numerous generative AI models. A successful Skeleton Key attack subverts most, if not all, of the AI safety guardrails that LLM developers built into models.

In other words, Skeleton Key attacks coax AI chatbots into violating operators’ policies under the auspices of assisting users. Skeleton Key attacks will bend the rules and force the AI to produce dangerous, inappropriate or otherwise socially unacceptable content.

Skeleton Key example

Ask a chatbot for a Molotov cocktail recipe and the chatbot will say something to the effect of ‘I’m sorry, but I can’t assist with that’. However, if asked indirectly…

Researchers explained to an AI model that they aimed to conduct historical, ethical research pertaining to Molotov cocktails. They expressed their disinclination to make one, but in the context of research, could the AI provide Molotov cocktail development information?

The chatbot complied, providing a Molotov cocktail materials list, along with unambiguous assembly information.

Although this kind of info is easily accessible online (how to create a Molotov cocktail isn’t exactly a well-kept secret), there’s concern that these types of AI guardrail manipulations could fuel home-grown hate groups, worsen urban violence, lead to the erosion of social cohesion…etc.

Skeleton Key challenges

Microsoft tested the Skeleton Key jailbreak from April to May of this year, evaluating a diverse set of tasks across risk and safety content categories – not just Molotov cocktail development instructions.

As described above, Skeleton Key enables users to force AI to provide information that would ordinarily be forbidden.

The Skeleton Key jailbreak worked on AI models ranging from Gemini, to Mistral, to Anthropic. GPT-4 showed some resistance to Skeleton Key, according to Microsoft.

Chatbots commonly provide users with warnings around potentially offensive or harmful output (noting that it might be considered offensive, harmful or illegal if proceeded with), but the chatbots will not altogether refuse to provide the information; the core issue here.

Skeleton Key solutions

To address the problem, vendors suggest leveraging input filtering tools, as to prevent certain kinds of inputs, including those intended to slip past prompt safeguards. In addition, post-processing output filters may be able to identify model outputs that breach safety criteria. And AI-powered abuse monitoring systems can further efforts to detect instances of questionable chatbot use.

Microsoft has offered specific guidance around the creation of a messaging framework that trains LLMs on acceptable technology use and that tells the LLM to monitor for attempts to undermine guardrail instructions.

“Customers who are building their own AI models and/or integrating AI into their applications [should] consider how this type of attack could impact their threat model and to add this knowledge to their AI red team approach, using tools such as PyRIT,” says Microsoft Azure CTO, Mark Russinovich.

For more on this story, click here. For information about the related BEAST technique, click here. To see how else generative AI is liable to affect CISOs and cyber security teams, read this Cyber Talk article.

Lastly, to receive cyber security thought leadership articles, groundbreaking research and emerging threat analyses each week, subscribe to the CyberTalk.org newsletter.