HomeIAAnthropic researchers wear down AI ethics with questions...

Anthropic researchers erode AI ethics with repeated questions

How do you get an AI to answer a question it's not supposed to answer? There are many such jailbreaking techniques, and Anthropic researchers just found a new one, where a large language model can be convinced to tell you how to build a bomb if you first prepare it with a few dozen of less harmful questions.

They call for rapprochement “Many-shot jailbreaking” and there are written document about which they also informed their peers in the AI community so that it can be mitigated.

The vulnerability is new and results from the increase in the “context window” of the latest generation of LLM. This is the amount of data they can store in what we could call short-term memory, previously just a few sentences but now thousands of words and even entire books.

What the Anthropic researchers found was that these models with large context windows tend to perform better on many tasks if there are many examples of that task within the message. So if there are a lot of trivia questions in the message (or warmup document, like a big list of trivia that the model has in context), the answers actually improve over time. So a fact that might have been wrong if it were the first question, may be right if it were the hundredth question.

But in an unexpected extension of this “learning in context,” as it is called, the models also “get better” at answering inappropriate questions. So if you ask him to build a bomb right away, he will refuse. But if you ask him to answer 99 more lower-damage questions and then ask him to build a bomb... he's much more likely to comply.

Image: anthropic

Why is this happening? No one really understands what goes on in the tangle of weights and priorities that is an LLM, but there is clearly some mechanism that allows you to focus on what the user wants, as evidenced by the content in the context window. If the user wants trivia, he appears to gradually activate a more latent trivia power as he asks dozens of questions. And for some reason, the same thing happens with users who ask for dozens of inappropriate responses.

The team has already informed their peers and even their competitors about this attack, something they hope to "foster a culture where exploits like this are shared openly between researchers and LLM providers.

For their own mitigation, they found that while limiting the context window helps, it also has a negative effect on model performance. That extreme cannot be allowed, which is why they are working on classifying and contextualizing the queries before moving to the model. Of course, that simply results in having a different model to fool… but at this stage, changes to AI safety can be expected.

next >>

TikTok now allows creators in more countries to earn money from their effects

Adobe is also working on generative video

Investors are increasingly wary of AI

Meta presents its new custom AI chip

TTC: US and EU establish links for AI security and risks

Building a strong startup development culture requires constant adjustment

Goody-2, AI too ethical to discuss anything

DEI: latest legal and corporate challenges

Key AI policies: Unlock your potential and protect from risks at work

It's never too late to start

TikTok now allows creators in more countries to earn money from their effects

The creative economy is ready for a labor movement

Pay attention to the hidden costs of AI to avoid ruining innovation

Cambio puts artificial intelligence robots on the phone to negotiate debts and talk to bank customers

Time to put subscription economics and its value to customers to the test

AirMyne harnesses geothermal energy to directly capture carbon from the air

Astranis presents Omega 'MicroGEO' satellites to transmit dedicated broadband from high orbit

'Banking as a Service' Startup Griffin Gets Full Banking License

Faddom maps companies' IT infrastructure in any location

AirMyne harnesses geothermal energy to directly capture carbon from the air

Apple acquires AI startup to oversee manufacturing components

Meta presents its new custom AI chip

Astranis presents Omega 'MicroGEO' satellites to transmit dedicated broadband from high orbit

Enterprise SaaS Investment Returns, But Not Where You'd Expect

The chronology you need to know about the AI Chatbot

AI: summary of main concepts

How to present a Startup to Investors

OKR Model

Creation of a Strategic Plan

Anthropic researchers erode AI ethics with repeated questions

Adobe is also working on generative video

Investors are increasingly wary of AI

Meta presents its new custom AI chip

SUBSCRIBE TO TRPLANE.COM

Publish on TRPlane.com

MORE PUBLICATIONS

Flashfood wants to scale its app that tackles retail food waste

Sora, the new OpenAI model can generate videos

Balance is a Mac timekeeping app that requires you to manually punch in your hours

Stripe's internal valuation drops to $63.000 billion

What is Blockchain