Spanish English French German Italian Portuguese
Social Marketing
HomeIAAnthropic researchers wear down AI ethics with questions...

Anthropic researchers erode AI ethics with repeated questions

How do you get an AI to answer a question it's not supposed to answer? There are many such jailbreaking techniques, and Anthropic researchers just found a new one, where a large language model can be convinced to tell you how to build a bomb if you first prepare it with a few dozen of less harmful questions.

They call for rapprochement “Many-shot jailbreaking” and there are written document about which they also informed their peers in the AI ​​community so that it can be mitigated.

The vulnerability is new and results from the increase in the “context window” of the latest generation of LLM. This is the amount of data they can store in what we could call short-term memory, previously just a few sentences but now thousands of words and even entire books.

What the Anthropic researchers found was that these models with large context windows tend to perform better on many tasks if there are many examples of that task within the message. So if there are a lot of trivia questions in the message (or warmup document, like a big list of trivia that the model has in context), the answers actually improve over time. So a fact that might have been wrong if it were the first question, may be right if it were the hundredth question.

But in an unexpected extension of this “learning in context,” as it is called, the models also “get better” at answering inappropriate questions. So if you ask him to build a bomb right away, he will refuse. But if you ask him to answer 99 more lower-damage questions and then ask him to build a bomb... he's much more likely to comply.

Image: anthropic

Why is this happening? No one really understands what goes on in the tangle of weights and priorities that is an LLM, but there is clearly some mechanism that allows you to focus on what the user wants, as evidenced by the content in the context window. If the user wants trivia, he appears to gradually activate a more latent trivia power as he asks dozens of questions. And for some reason, the same thing happens with users who ask for dozens of inappropriate responses.

The team has already informed their peers and even their competitors about this attack, something they hope to "foster a culture where exploits like this are shared openly between researchers and LLM providers.

For their own mitigation, they found that while limiting the context window helps, it also has a negative effect on model performance. That extreme cannot be allowed, which is why they are working on classifying and contextualizing the queries before moving to the model. Of course, that simply results in having a different model to fool… but at this stage, changes to AI safety can be expected.

RELATED

SUBSCRIBE TO TRPLANE.COM

Publish on TRPlane.com

If you have an interesting story about transformation, IT, digital, etc. that can be found on TRPlane.com, please send it to us and we will share it with the entire Community.

MORE PUBLICATIONS

Enable notifications OK No thanks