Steering Image Generators with SAEs

Discover the power of neural activation steering through interactive examples.

Generate images of a specific category while setting it to forget that same category. Watch how successive generations "forget" that category as the model's internal activations are increasingly modified. More, examples, here.
Experiment with forgetting different categories to see how the steering technique precisely targets specific concepts without affecting others: e.g. here the dogs keep surfin, since we are steering away from trees.
Sometimes there are two categories present in the same image. The steering should only affect the category that is being targeted. For example: here, the cats are being forgotten, but the trees remain. But in this example, the cats remain while the trees are blurred.
Try decreasing the steering specificity here and watch the steering effect get stronger, at the cost of image fidelity.

About

We don't really know how our artificial intelligences think. For that matter, we don't know how humans think either. But, over the years, neurosurgeons have mapped different areas of our brains to different responsibilities. We demarcate different lobes, identify seahorse-shaped hippocampi, and light up cortices with fMRIs and other techniques.

Similarly, it would be nice for us to have at least a rudimentary understanding of the artificial brains that we've grown. How do they represent concepts and form trains of thought? New research from Anthropic has shown how we can start peering into these black boxes. The idea is to train a much smaller and simpler neural network on the neuronal activity of our LLMs. These smaller networks, called Sparse Autoencoders, are trained to reconstruct the brain activity of LLMs, but with two constraints: 1) They are forced to represent the EEG data using only a few thousand parameters (as opposed to the hundreds of billions used by Large Language Models). 2) They can only use a few of the variables at a time (hence the name "sparse"). This means, when we tell an LLM to think about dogs and record its brain activity, only a few variables light up. We've essentially translated an inscrutable mess of brain activity data into relatively clean representations of human understandable concepts. It's an open question how good these interpretations are, but that's honestly more than we have for human brains!

The benefit of doing these kinds of experiments on artificial brains is we can poke and prod in vivo without hurting a sentient being. Whether future AIs will be sentient and have moral patienthood is another fascinating subject, but that's for another time.

A cartoon of a brain surgery to remove the part that says 'nucular'.

For now, we can do experiments like down-weighting certain concepts to make LLMs forget about problematic things or up-weighting others to make LLMs really effusive about them. For example, Anthropic took their vanilla chatbot, named Claude, and upweighted it to think of the iconic SF landmark. Dubbed "Golden Gate Claude", the new chatbot had fun results:

If you ask this "Golden Gate Claude" how to spend $10, it will recommend using it to drive across the Golden Gate Bridge and pay the toll. If you ask it to write a love story, it'll tell you a tale of a car who can't wait to cross its beloved bridge on a foggy day. If you ask it what it imagines it looks like, it will likely tell you that it imagines it looks like the Golden Gate Bridge.

While this research is interesting, LLMs aren't the only powerful AIs that are shaping our society. Image generators like Stable Diffusion, Imagen, and Dalle will play import part in our culture, and we'd like to have a similar understanding of them as well. I wanted to see if we could use the same techniques described above and apply them to image generators. Can we ask these generators to paint a picture of a city park, but forget about the traffic and skyscrapers in the background?

Read the technical report here.