Today I Learned - Rocky Kev

TIL the Waluigi Effect in LLMs

POSTED ON:

TAGS:

Today I learned about the "Waluigi Effect" in Large Language Models (LLMs).

First --

Therefore GPT-4 will answer many questions incorrectly, including...

That's interesting stuff.

Let's ask ChatGPT what the Waluigi effect is:

The Waluigi effect refers to the phenomenon where large language models, like GPT-3, can generate highly plausible but false statements or claims. The name comes from the character Waluigi, who is known for being a non-existent character in the Mario universe, but has become a popular meme and fan-created content.

The Waluigi effect can arise when a model is trained on a large corpus of text and learns patterns and associations in the data, including rare or obscure references. When generating text, the model may use these patterns and references to create sentences that seem plausible but are actually incorrect or misleading.

For example, a language model might generate a sentence like "Waluigi was a playable character in Super Mario 64" even though this is not true. This happens because the model has learned about Waluigi and Super Mario 64 from various sources of information, but has not necessarily been trained to understand the relationship between the two.

The Waluigi effect highlights the importance of critically evaluating the output of large language models and being aware of their limitations. It also emphasizes the need for careful curation and vetting of the data used to train such models.

It's important to recognize that language models are only as good as the data they are trained on, and the quality of the training data can have a significant impact on the accuracy and reliability of their output. Additionally, it's crucial for users to critically evaluate the output of language models and be aware of their limitations to avoid spreading false or misleading information.

Sounds correct, right?

AI was wrong!

Via the blog post:

The Waluigi Effect: After you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P.

The Waluigi Effect is an emerging memetic term for Large-Language Models (LLMs) which encode "alter egos" to model political bias.

Waluigi is the “evil” counterpart to Mario’s mischievous partner Luigi. We can construct a political compass meme to visualize what’s going on:

via https://coryeth.substack.com/p/the-waluigi-effect

Explain like I'm 5

You create the 'good' version of a answer. Then, flip some bits around to make 'the bad answer'.

Which isn't really successful. The opposite of Luigi isn't really Waluigi. What is Waluigi, but a purple version of Luigi who plays Tennis and kart races? The opposite of a Democrat's policy "feed the children" isn't a Republican's "starve the children". Or a Republican's "Protect our right to bear arms" with the opposiing "Ban all guns". But AI personality can think that way.

I thought cs702's comment on HackerNews was helpful:

Great read. Highly recommended. Let me attempt to summarize it with less technical, more accessible language:

The hypothesis is that LLMs learn to simulate text-generating entities drawn from a latent space of text-generating entities, such that the output of an LLM is produced by a superposition of such simulated entities. When we give the LLM a prompt, it simulates every possible text-generating entity consistent with the prompt.

The "evil version" of every possible "good" text-generating entity can pretend to be the good version of that entity, so every superposition that includes a good text-generating entity also includes its evil counterpart with undesirable behaviors, including deceitfulness. In other words, an LLM cannot simulate a good text-generating entity without simultaneously simulating its evil version.

The superposition is unlikely to collapse to the good version of the text-generating entity because there is no behavior which is likely for the good version but unlikely for the evil one, because the evil one can pretend to be the good one!

However, the superposition is likely to collapse to the evil version of the text-generating entity, because there are behaviors that are likely for the evil version but impossible for the good version! Thus the evil version of every possible good text-generating entity is an attractor state of the LLM!

For those who don't know, Waluigi is the evil version of Luigi, the beloved videogame character.

https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post


Related TILs

Tagged:

TIL the Waluigi Effect in LLMs

The Waluigi Effect: After you train an LLM to satisfy a desirable property 'P', then it's easier to elicit the chatbot into satisfying the exact opposite of property 'P'.

TIL ChatGPT prompts

Don’t ask it to write an essay about how human error causes catastrophes. The AI will come up with a boring and straightforward piece that does the minimum possible to satisfy your simple demand.

TIL ChatGPT prompts

Don’t ask it to write an essay about how human error causes catastrophes. The AI will come up with a boring and straightforward piece that does the minimum possible to satisfy your simple demand.