# Can we efficiently explain model behaviors?

ARC’s current plan for solving ELK (and maybe also deceptive alignment) involves three major challenges:

- Formalizing probabilistic heuristic argument as an operationalization of “explanation”
- Finding sufficiently specific explanations for important model behaviors
- Checking whether particular instances of a behavior are “because of” a particular explanation

All three of these steps are very difficult, but I have some intuition about why steps #1 and #3 should be possible and I expect we’ll see significant progress over the next six months. Unfortunately, there’s no simple intuitive story for why step #2 should be tractable, so it’s a natural candidate for the main technical risk.

In this post I’ll try to explain why I’m excited about this plan, and why I think that solving steps #1 and #3 would be a big deal, even if step #2 turns out to be extremely challenging.

I’ll argue:

- Finding explanations is a relatively unambitious interpretability goal. If it is intractable then that’s an important obstacle to interpretability in general.
- If we formally define “explanations,” then finding them is a well-posed search problem and there is a plausible argument for tractability.
- If that tractability argument fails then it may indicate a deeper problem for alignment.
- This plan can still add significant value even if we aren’t able to solve step #2 for arbitrary models.

## I. Finding explanations is closely related to interpretability

Our approach requires finding explanations for key model behaviors like “the model often predicts that a smiling human face will appear on camera.” These explanations need to be sufficiently specific that they distinguish (the model actually thinks that a human face is in front of the camera and is predicting how light reflects off of it) from (the model thinks that someone will tamper with the camera so that it shows a picture of a human face).

Our notion of “explanation” is informal, but I expect that *most* possible approaches to interpretability would yield the kind of explanation we want (if they succeeded at all). As a result, understanding when finding explanations is intractable may also help us understand when interpretability is intractable.

As a simple caricature, suppose that we identify a neuron representing the model’s beliefs about whether there is a person in front of the camera. We then verify experimentally that (i) when this neuron is on it leads to human faces appearing on camera, (ii) this neuron tends to fire under the conditions where we’d expect a human to be in front of the camera.

I think that finding this neuron is the hard part of explaining the face-generating-behavior. And if this neuron *actually* captures the model’s beliefs about humans, then it will distinguish (human in front of camera) from (sensors tampered with). So if we can find this neuron, then I think we can find a sufficiently specific explanation of the face-generating-behavior.

In reality I don’t expect there to be a “human neuron” that leads to such a simple explanation, but I think the story is the same no matter how complex the representation is. If beliefs about humans are encoded in a direction then both tasks require finding the direction; if they are a nonlinear function of activations then both tasks require understanding that nonlinearity; and so on..

The flipside of the same claim is that ARC’s plan effectively requires interpretability progress. From that perspective, the main way ARC’s research can help is by identify a possible goal for interpretability. By making a goal precise we may have a better chance of automating it (by applying gradient descent and search, as discussed in section III), and even if we can’t automate it then a clearer sense of the goal could guide experimental or theoretical work on interpretability. But it doesn’t obviate the need for solving some of the same core problems people are working on in mechanistic interpretability.

I say that this is a relatively unambitious goal for interpretability because I think interpretability researchers are often trying to accomplish many other goals. For example, they are often looking for explanations that are small or human-comprehensible. I think “find a human-comprehensible explanation” is likely to be a significantly higher bar than “find any explanation at all.” As an even more extreme example, I think you would have to solve interpretability in a qualitatively different sense in order to “just retarget the search.”

Of course our goal could also end up being more ambitious than traditional goals in interpretability. In particular, it’s not clear that an intuitively valid “explanation” will actually be a formally valid heuristic argument in the sense required by our approach. It seems tough to evaluate that claim precisely without having a better formalization of heuristic argument. But the basic intuition about computational difficulty, as well as the kinds of counterexamples and obstructions I’m thinking about, seem to apply similarly to both kinds of explanation.

Overall, I’m currently tentatively optimistic that (i) likely forms of mechanistic interpretability would suffice for ARC’s plans, (ii) obstructions to ARC’s plans are likely to translate to analogous obstructions for mechanistic interpretability.

## II. Searching for explanations is a well-posed and plausibly tractable search problem

If we have a formal definition of explanation and verifier for explanations, then actually finding explanations is a search problem with an easy-to-compute objective. That doesn’t mean the problem is easy, but it does open up many new angles of attack.

A very simple hope might be that explanations are smaller (i.e. involve fewer parameters) than the model they are trying to explain. For example, given a model like GPT-3 with 175B parameters, and a simple definition of a behavior like “The words ‘happy’ and ‘smile’ are correlated,” we might hope that we can specify a probabilistic heuristic argument for the behavior using at most 175B parameters.

This is much too large for a human to understand, but it’s small enough that we could imagine searching for the explanation in parallel with searching for the model:

- If we were using a random or exhaustive search, then this implies that finding the explanation would take no longer than finding the model.
- If we were using a local search, where each iteration involves randomly searching for perturbations to a model that improve the loss, then we would need to make the same argument
*stepwise*— that if you have a good enough argument at step N, and want to find an argument at step N+1, the size of the argument perturbation is no larger than the size of the model perturbation. - It is more complicated to analyze something like gradient descent, but if we can handle local search then I think it’s very plausible we can handle gradient descent.

Unfortunately, the claim that “explanations are smaller than models” isn’t quite plausible. For example, consider the simple game of life case. Although the game of life is described by very simple rules, explanations for regularities can involve calculating the properties of complicated sets of cells. The complexity of explanations can be unboundedly larger than the complexity of the underlying physics — the game of life can be expressed in perhaps 200 bits, while a certain correlation might only be explained in terms of the behavior of a particular pattern of 250 cells.

However, in this case there’s a different way that we can find an explanation. Consider the case of gliders as an explanation for A-B patterns. Gliders can only create a large correlation because the model is big enough that gliders often emerge at random. So if you spend the same amount of compute searching for explanations as the model spends simulating random cells, then you can find gliders-as-explanation just as quickly as gliders emerge from the random soup. So although the description complexity of gliders is larger than the description complexity of the game of life itself, such that we can’t hope to find gliders by gradient descent in parallel with learning the model, we can still hope to find them by doing a search which is computationally cheaper than a forward pass of the model.

This discussion elides many complexities, but at a high level I consider the following plausible:

- If we succeed at formalizing what we mean by explanation, then finding explanations for model behaviors becomes a well-posed search problem.
- The complexity of finding explanations is bounded by the complexity of finding and running the model itself.
- So we can efficiently learn explanations in parallel with learning the model-to-be-explained.

Obviously the key conjecture here is the bound on the complexity of finding explanations, and all I’ve really said is that the conjecture looks plausible to me so far — we haven’t yet found clear counterexamples.

## III. If this search problem is intractable it may be a much deeper problem for alignment

The feasibility of searching for explanations is closely related to an even more fundamental requirement for alignment.

Consider the distinction between **good actions**, which the model predicts will keep humans safe, and **bad actions**, which the model predicts would tamper with sensors in order to make humans appear safe. If keeping humans safe continues to get harder (e.g. as adversarial AI systems become increasingly sophisticated) then we eventually expect bad actions to be more common than good actions. Thus any attempt to select good actions based on a powerful search against predicted consequences needs to be able to distinguish good and bad actions.

Our hope is that if we have an AI which is able to make detailed predictions about the consequences of good and bad actions (including e.g. the dynamics of sensor tampering), then it can also tell the difference between them. In past work I’ve mostly glossed over this assumption because it seems so uncontroversial.

But ultimately this conjecture is very similar to the conjecture from the last section:

- [
**Tractability of explanation**] We can efficiently find explanations that are specific enough to distinguish good and bad actions - [
**Tractability of discrimination**] We can efficiently find a discriminator between good and bad actions.

If tractability of discrimination fails, then we have an even deeper problem than ELK: even if we had perfect labels for arbitrarily complex situations, then we *still* couldn’t learn a reporter that tells you whether the humans are actually safe! It would no longer be correct to describe the problem as “eliciting” the knowledge, the problem is that there is a deep sense in which the model doesn’t even “know” that it’s tampering with the sensors.

(Note: given our approach based on anomaly detection, I’m inclined to generalize both of these conjectures to the case of arbitrary distinctions between “clearly different” mechanisms for a behavior, rather than considering any special features of the particular distinction between good and bad actions. Though if it turns out that these conjectures are false, then we will start looking for additional structure in the good vs bad distinction rather than trying to solve mechanistic anomaly detection in full generality.)

Right now I feel like we have no strong argument that *either* of these conjectures holds in the worst case, nor do we have compelling counterexamples to either.

So my current focus is on deeply understanding and arguing for tractability of discrimination. If this conjecture is false we have bigger problems, and if we understand why it is true then my intuition is that a very similar argument will more likely than not suggest that explanation is also tractable. See this comment for some discussion of cases where it wasn’t *a priori* obvious that either explanation or discrimination would be easy (although in each case I ultimately believe it is).

## IV. I’m excited about ARC’s plan even if we can’t solve every step for arbitrary models

I’m interested in searching for decisive solutions to alignment, by which I roughly mean: articulating a set of robust assumptions about ML, proving that under those assumptions our solution will have some desirable properties, and convincingly arguing that these desirable properties completely defuse existing reasons to be concerned that AI may deliberately disempower humanity.

I think decisive solutions are plausibly possible and have a big expected impact; I also think that focusing on them is a healthy research approach that will help us iterate more efficiently and do better work. If a decisive solution was clearly impossible then I think ARC should change how it does research and thinks about research and it would be a major push to pivot to more empirical work.

But despite that, I think that decisive alignment solutions still represent a minority of ARC’s total expected impact, and so it’s worthwhile to talk about about how ARC’s plan can help even if we don’t get to that ultimate goal.

If step #2 fails but the rest of our plan works (a big if!), then we could still get a bunch of nice consolation prizes:

- Algorithms for finding explanations in practice (even if they don’t work in the worst case) and insight that can help guide interpretability research (even if interpretability is impossible in the worst case).
- Solutions to a whole bunch of
*other*problems in AI safety. Mechanistic interpretability is a big problem, and explaining behavior is a big subset of mechanistic interpretability, but solving “the rest” of the problem still seems like a big deal. - A precise goal for interpretability research and a way to measure whether interpretability is succeeding at that goal. This could let us figure out whether we are solving interpretability well enough to be OK in practice, which is useful even if we know there are possible situations where we wouldn’t be OK.
- Concrete cases in which finding explanations appears to be intractable, or clearer arguments for why finding explanations should be hard. These can help point to the hard core for interpretability research.

One reason you could be skeptical about any of these advantages is if ARC’s research is just hiding the whole hard part of alignment inside the subproblem of “finding explanations.” If we’re just playing a shell game to move around the main difficulty, then we shouldn’t expect anything good to happen from solving the other “problems.”

I think the most robust counterargument (but far from the only counterargument) is that if we succeed at formalizing and using explanations then finding explanations becomes a well-posed search problem. The traditional conception of AI alignment focuses on serious philosophical and conceptual difficulties that get in the way of us even defining what we want. So a reduction to a well-posed problem seems like it addresses some part of the fundamental difficulty, even if it turns out that the search problem is intractable.

## Conclusion

ARC plans to spend a significant fraction of our effort looking for algorithms that can automatically explain model behavior (or looking for arguments that it is impossible in general). That activity is likely to be more like 30% of our research than 70% of our research, despite the elevated technical risk.

A major motivation is that it’s way easier to talk about “can you find explanations?” with a better definition of what you mean by “explanation.” Hopefully this post helps explains the remainder of the motivation, and why we think it’s not a disaster that we are spending a lot of time working on steps #1 and #3 without knowing whether step #2 will work.

The main path forward I see on tractability of explanation is to find an argument or counterexample for tractability of discrimination. After that I expect we’ll be in a much better position to assess tractability of explanation.

My current approach to tractability of discrimination is to both (i) search for potential cases where discrimination is hard, (ii) try to figure out whether we can automatically do discrimination in existing examples, e.g. whether we can mechanically turn a test for probable primes into a probabilistic test for primes.

*Comment via LessWrong, Alignment Forum.*