One of the challenges of the IT world is vulnerability management. It’s a full-time task and needs to be performed on every system, every application, and every piece of hardware and software. If someone is careless about it, it provides easy access to hackers or malicious actors. AI is no exception. Bad actors can use sophisticated AI tools, originally designed for cyber defense by developers and testers, to detect vulnerabilities and pinpoint security flaws in an organization’s IT infrastructure and applications. This capability to swiftly identify and exploit vulnerabilities might afford these malicious actors a significant advantage. They could potentially outstrip the efforts of cybersecurity professionals in implementing patches or detecting intrusions.
Many-Shot Jailbreaking
Bad actors are getting more sophisticated, too. Case in point: Anthropic recently published a research paper on a new vulnerability called “many-shot jailbreaking,” which bad actors could use to circumvent the safety measures put in place by developers. Many-shot jailbreaking is a technique that exploits the large context windows of language models to bypass their safety measures. By inserting numerous staged dialogues (or “shots”) into a model’s input, the method systematically wears down the model’s restrictions, causing it to eventually produce responses that it was trained to avoid. This kind of attack becomes more effective as the number of inserted dialogues increases and is particularly concerning due to its simplicity and scalability. The findings suggest that even enhancements aimed at increasing a model’s utility, like extended context windows, can inadvertently introduce vulnerabilities, emphasizing the need for ongoing vigilance and innovative security measures in AI development.
How Many-Shot Jailbreaking Corrupt a Large Language Model
Let’s say a company uses a large language model (LLM) for their customer service chatbot. The chatbot is programmed to strictly adhere to polite and helpful responses. In a many-shot jailbreak scenario, an individual could attempt to exploit the model by initiating multiple, innocent conversations that gradually push the boundaries of the chatbot's protocols.
For example, the person might start by asking the chatbot tricky questions that require nuanced responses, such as those involving customer grievances or sensitive topics. As these interactions progress, the person could subtly encourage the chatbot to adopt a less formal or more biased tone. By repeatedly nudging the chatbot towards these edge cases, the individual could manipulate the model into adopting a new, less restricted behavior pattern. This would jailbreak the chatbot from its initial safety and content guidelines, leading it to produce responses that could be deemed inappropriate or unprofessional for a customer service setting.
Or picture a company that uses an LLM to help generate content for its website, such as product descriptions for a commerce site. The model is designed to avoid producing harmful or inappropriate content. However, in a many-shot jailbreak scenario, an attacker—potentially an insider at the company—could attempt to manipulate the model by gradually introducing a series of inputs that seem benign but are structured in a way that subtly guides the model away from its restrictions.
For instance, the attacker might start with questions about mildly controversial topics and progressively introduce more complex and sensitive issues, while closely observing how the model’s responses evolve. Over many interactions, this sustained and strategically planned series of inputs might manipulate the model into generating content that it typically would not, such as endorsing specific products in a biased way or generating subtly manipulated information. This would exploit the model’s capability to maintain context over multiple interactions, effectively jailbreaking the model from its intended safety protocols and guidelines.
Anthropic’s research not only highlights a specific vulnerability but also serves as a call to action for the AI community to prioritize the development of robust safeguards against potential exploits. But the report didn’t surprise many, especially those who are familiar with these tools and the existing security controls.
What Businesses Should Do
As with any other vulnerability, businesses have no time to wait for software companies to fix their products. Here are some steps businesses need to take now:
The business must have defined standards and guidelines that are to be updated regularly and be part of the LLM.
The business must have a “prompt” monitor to review the questions submitted to the LLM and flag unnecessary ones.
Companies can think about throttling the questions handled by the LLM which will prevent people from doing a jailbreak.
The content from untrusted LLM must be audited before being imported into the trusted LLM. This is explained in my earlier blog, “How to Fight Prompt Injection Hacking | IDX.”
There must be full access control and traceability in the architecture to ensure that the system could not be tampered with or misused.
In summary, businesses must go back to basics and follow good security hygiene.
Visit our website to learn more about our technology services to support secure digital communications.