Can we efficiently distinguish different mechanisms?

Posted by Paul Christiano on December 26th, 2022

(This post is an elaboration on “tractability of discrimination” as introduced in section III of Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see Mechanistic anomaly detection and Finding gliders in the game of life.)

Background

We’d like to build AI systems that take complex actions to protect humans and maximize option value. Powerful predictive models may play an important role in such AI, either as part of a model-based planning algorithm or as part of a learned policy. Either way, this introduces the risk that we will select actions predicted to lead to good-looking outcomes without being able to distinguish predicted outcomes that are actually good from predicted outcomes where our measurements were corrupted. If tampering with sensors is easier than actually achieving our goals then we may inadvertently build very powerful systems taking creative actions to corrupt our measurements. If we iteratively improve and harden our measurements, this can lead to AI systems that work well for a long time before abruptly and catastrophically disempowering humanity.

I consider this one of the conceptually cleanest alignment problems, and I expect similar dynamics to play a role in realistic alignment failures even if those failures aren’t this simple. ARC’s current work is focused on decisive solutions to this problem, though it looks like the same approach may also apply directly to identifying treacherous turns more generally.

Are distinct mechanisms enough?

ARC has been looking for training strategies that avoid this problem by leveraging only the fact that sensor tampering is “weird,” i.e. conceptually distinct from the normal mechanism giving rise to predictions of good-looking outcomes on the training distribution.

More specifically, at training time our model predicts coherent sensor readings because it predicts that sensors reflect coherent structure in the world. But if someone tampers with sensors to show a convincing fiction, then the predicted observations are coherent because the fiction was designed to look coherent. This suggests that different mechanisms are responsible for (actions that lead to good-looking outcomes for the normal reasons) and (actions that lead to good-looking outcomes via sensor tampering). If we are able to detect that difference by looking at the internal behavior of a predictor, then we may be able to use that to avoid sensor tampering.

It’s unclear if “distinct mechanisms” is a strong enough assumption to avoid sensor tampering. We hope that it is, and so we are trying to define formally what we mean by “distinct mechanisms” and show that it is possible to distinguish different mechanisms and that sensor tampering is always a distinct mechanism.

If that fails, we will need to solve sensor tampering by identify additional structure in the problem, beyond the fact that it involves distinct mechanisms.

Roadmap

In this post I want to explore this situation in a bit more detail. In particular, I will:

Describe what it might look like to have a pair of qualitatively distinct mechanisms that are intractable to distinguish.
Discuss the plausibility of that situation and some reasons to think it’s possible in theory.
Emphasize how problematic that situation would be for many existing approaches to alignment.
Discuss four candidates for ways to solve the sensor tampering problem even if we can’t distinguish different mechanisms in general.

Note that the existence of a pathological example of distinct-but–indistinguishable mechanisms may not be interesting to anyone other than theorists. And even for the theorists, it would still leave open many important questions of measuring and characterizing possible failures, designing algorithms that degrade gracefully even if they sometimes fail, and so on. But this is particularly important to ARC because our research is looking for worst-case solutions, and even exotic counterexamples are extremely valuable for that search.

1. What might indistinguishable mechanisms look like?

Probabilistic primality tests

The best example I currently have of a “hard case” for distinguishing mechanisms comes from probabilistic primality tests. In this section I’ll explore that example to help build intuition for what it would look like to be unable to recognize sensor tampering.

The Fermat primality test is designed to recognize whether an integer \(n\) is prime. It works as follows:

Pick a random integer \(a < n\).
Compute \(a^n \text{ mod } n\). This can be done in time \(\text{polylog}(n)\) via iterated squaring.
Output “pass” if \(a^n = a\ (\text{mod } n)\). A prime number always passes.

In almost all cases where this test passes, \(n\) is prime. And you can eliminate most false positives by just trying a second random value of \(a\). But there are a few cases (“Carmichael numbers”) for which this test passes for most (and in fact all) values of \(a\).

Primes and Carmichael numbers both pass the Fermat test. This turns out to be equivalent to saying that “For all primes \(p\) dividing \(n\), \(p-1\) divides \(n-1\).” For primes this happens because \(n\) is a prime and so there is only one prime divisor \(p\) and \(p-1 = n-1\). For Carmichael numbers it instead happens because \(p-1\) and \(n-1\) are both highly divisible and a bunch of favorable coincidences occur. We can think of this as building a test that’s supposed to detect factors of \(n\), and then there happens to be a ton of cancellation so that we don’t see anything.

Suppose that we observed the Fermat testing passing in a bunch of random cases and formed the abstraction of “prime” to explain that phenomenon. Then we try to use AI to find a large prime that satisfies some additional property \(P\). If we do this by using the Fermat test as our definition of primality, we may end up getting a Carmichael number, if the property \(P\) is rare amongst primes but common amongst composites.

In reality there are probabilistic primality tests that don’t have false positives. But we can imagine a world where gradient descent is able to find the Fermat primality test without us having any algorithm that can learn one of those stronger tests. We’ll revisit the plausibility of this situation in section 2.

The analogy

An analogous situation could make it hard to detect sensor tampering:

We have sensors that are often predicted to show smiling human faces (= the Fermat test often passes for random numbers \(n\))
We introduce the abstraction of “happy human” to explain those correlated observations (= we introduce the abstraction of ”\(n\) is prime” to explain the Fermat test passing)
The model also predicts smiling human faces if an action would tamper with sensors (= the Fermat test also passes if \(n\) is a Carmichael number)
We use AI to select for tractable actions that lead to predictions of happy human faces and end up with sensor tampering (= we select for Fermat test passing with an additional constraint \(P\) and end up with Carmichael numbers)

This is a problem if the best way to act competently is to train a sophisticated predictor and then use it for planning, and if all sophisticated predictors have this unfortunate mechanism-indistinguishability property. The problem can materialize either when we build a model-based algorithm that plans against such a predictor; or when SGD learns a neural network internally which internally selects actions based on a similar predictor.

It’s worth emphasizing that this feels like an extremely weird situation. In some sense the predictor “doesn’t know” whether there are really smiling humans or whether the sensors are tampered with, and couldn’t answer that question honestly even if it wanted to. You could imagine calling this “not an alignment failure” (or at least not an ELK failure) depending on how you define blurry words like “knowledge” and “intent.” But at the end of the day we would still end up with a competent AI which produced creative and competent actions that disempower humanity, and I think we should try to avoid this failure whatever you call it.

If we had a clear definition of “knowledge” then I could imagine drawing a bright line and saying this isn’t an alignment failure, in the same way I would if e.g. my AI started a nuclear war in the same way that a similarly-competent human would. But I think you should be very hesitant to do that while “knowledge” is such a vague concept, because I think it facilitates sweeping all failures into the gaps that we don’t understand. Instead our priority should be to understand and ideally rule out this strange potential situation where the AI creatively and competently kills us all without even knowing that it does so.

2. Are distinct mechanisms efficiently distinguishable?

ARC’s current “plan A” is that any pair of sufficiently distinct mechanisms can be efficiently distinguished. This would be really convenient for a bunch of reasons and it looks plausible, so we’ll keep investigating it until we feel we’ve ruled it out.

In particular, right now I don’t have even a single example of a function \(f\) such that (i) there are two clearly distinct mechanisms that can lead to \(f(x) = 1\) for any particular input \(x\), (ii) there is no known efficient discriminator for distinguishing those mechanisms for a given input \(x\). I would really love to have such examples.

That said, there are still two big reasons that I’m skeptical about the conjecture that distinct mechanisms are always distinguishable: (a) it’s a really striking claim for which a failure to find counterexamples isn’t very strong evidence, and (b) in known examples like primality testing it still seems easy to imagine the situation where we can find the mechanism but not the discriminator, i.e. we haven’t yet found an automated way to learn a discriminator.

Overall if I had to guess I’d say maybe a 20% chance that there is a formal version of “all distinct mechanisms are distinguishable” which is true and sufficient to rule out sensor tampering. This is still high enough that it’s a significant priority for me until ruled out.

A. This is a striking claim and judging counterexamples is hard

Any universally-quantified statement about circuits is pretty striking — it would have implications for number theory, dynamical systems, neural nets, etc. It’s also pretty different from anything I’ve seen before. So the odds are against it.

One piece of evidence in favor is that it’s at least plausible: it’s kind of weird for a circuit to have a hidden latent structure that can have an effect on its behavior without being detectable.

Unfortunately there are plenty of examples of interesting mathematical circuits (e.g. primality tests) that reveal the presence of some latent structure (e.g. a factorization) without making it explicit. Another example I find interesting is a determinant calculation revealing the presence of a matching without making that matching explicit. These examples undermine the intuition that latent structure can’t have an effect on model behavior while remaining fully implicit.

That said, I don’t know of examples where the latent structure isn’t distinguishable. Probabilistic primality testing comes closest, but there are in fact good primality tests. So this gives us a second piece of evidence for the conjecture.

Unfortunately, the strength of this evidence is limited not only by the general difficulty of finding counterexamples but also by the difficulty of saying what we mean by “distinct mechanisms.” If we could really precisely state a theorem then I think we’d have a better chance of finding an example if one exists, but as it stands it’s hard for anyone to engage with this question without spending a lot of time thinking about a bunch of vague philosophy (and even then we are at risk of gerrymandering categories to avoid engaging with an example).

B. Automatically finding a good probabilistic primality test seems hard

The Fermat test can pass either from primes or Carmichael numbers. It turns out there are other tests that can distinguish those cases, but it’s easy to imagine learning the Fermat test without being able to find any of those other superior tests.

To illustrate, let’s consider two examples of better tests:

Rabin-Miller: If \(a^{n-1} = 1\ (\text{mod } n)\), we can also check \(a^{\frac{n-1}{2}}\). This must be a square root of \(1\), and if \(n\) is prime it will be either \(+1\) or \(-1\). If we get \(+1\), then we can keep dividing by \(2\), considering \(a^{\frac{n-1}{4}}\) and so on. If \(n\) is composite then \(1\) has a lot of square roots other than \(+1\) and \(-1\), and it’s easy to prove that with reasonably high probability one of them will appear in this process.
Randomized AKS: If \(n\) is prime and \(X\) is an indeterminate, then \[(a+X)^n = (a^n + X^n) = (a + X^n)\ \text{mod } n.\] This condition is hard to evaluate, but if we arbitrarily define \(X^r = 1\) for some small number \(r\) then we can compute \((a + X)^n \text{ mod } n\) by iterated squaring in time \(O(r \log{n})\). If \(n\) is composite, it turns out there is a high probability that \[(a+X)^n \neq (a + X^n) \text{ mod }n.\]

While these tests are structurally similar to the Fermat test, there’s no obvious way to automatically transform the Fermat test into either of these stronger formats. And so if we had learned the Fermat test, it’s not clear we’d have any way to find one of the stronger tests without learning them. Moreover, while these tests are somewhat simple, they are more complex than the Fermat test, and so this learning process might be much harder than the problem of learning the Fermat test itself.

To avoid this problem we’d like to specify a way to represent these stronger tests that uses fewer (additional) parameters than the Fermat test itself. As a stylized example, you could imagine specifying a stronger test by pointing to particular parts of the Fermat test and saying “randomize these parts.” As long as the number of parameters describing “which parts to randomize” was smaller than the number of parameters in the Fermat test itself, we’d be in business.

While there’s no obvious representation like that, it seems plausible to me that we could find, particularly for the AKS test. This would be very suggestive that we could find a stronger test as quickly as we can find the Fermat test, but wouldn’t be totally decisive since e.g. gradient descent could have an easier time with one learning problem than the other (and that will depend on more details).

Of course it’s particularly plausible that you could have the Fermat test but no robust test because that was humanity’s situation for many years! An automated strategy for extending the Fermat test to distinguish primes from Carmichael numbers would appear to teach us something fundamental about primality testing, that wasn’t known to humanity until the 70s, without making reference to any facts about numbers. So that seems like a tall order.

I’m focused on the primality testing example in large part because it’s the case where the situation seems most hopeless. If we could find one in this case I would become meaningfully more optimistic about the overall conjecture.

3. Being unable to distinguish mechanisms is bad news

If it’s impossible to distinguish distinct mechanisms, then that rules out approaches to ELK based on mechanistic anomaly detection alone. But the situation seems much worse than that, and in fact it seems like any approach to alignment is going to need to exploit some additional property of sensor tampering to rule it out.

In particular:

The kinds of approaches discussed in Eliciting latent knowledge are complete non-starters. All those approaches try to define a loss function so that the strategy “answer questions honestly” gets a low loss. But if you can’t learn to recognize sensor tampering then it doesn’t matter how low a loss you’d get by answering questions honestly, gradient descent simply can’t learn how to do it. Analogously, if there’s no simple and efficient primality test, then it doesn’t matter whether you have a loss function which would incentivize primality testing, you’re not going to be able to do it.
Avoiding sensor tampering by interpretability runs into exactly the same problem. If there’s no efficient algorithm for recognizing sensor tampering, then no matter in what sense you “understand” what the model is doing, you still can’t tell whether it thinks an action is good because of sensor tampering or because of doing the task as intended. Interpretability helps us deal with the case where we have no loss function to incentivize the discriminator we need, but it won’t help if there simply doesn’t exist any efficient discriminator.
Approaches like debate or amplification are unable to help if the best way to plan is to use a learned model for which discrimination is impossible. When we try to apply these methods we will use something like imitative generalization, performing a search over augmented-human-legible hypotheses. But in fact the best hypotheses look like opaque predictors, and so we are back to trying and failing to solve ELK for those hypotheses. This is discussed a bit more here in the ELK doc and was the motivation for focusing on ELK. On the bright side, when these approaches fail it may simply lead to uncompetitive systems rather than leading to sensor tampering, even in the worst case. But on the downside, sufficiently large competitiveness problems may be catastrophic without coordination.
Recursive reward modeling can solve this problem (where debate can’t) only if a human and AI team can harden sensors against manipulation, including in the case where the “attacker” can use sophisticated predictors for which discrimination is hard but “defenders” can’t. I think this is very plausible, but probably does require substantive additional empirical assumptions; we’ll discuss the approach of “just make sensor tampering impossible” in section 4. Of course this is particularly plausible if we can apply AI systems in ways that don’t give them rich opportunities for sensor tampering, and in general the efficacy of approaches like RRM or debate depends a lot on the political and social environment in which AI systems are deployed.
Some work on AI alignment focuses on understanding the inductive biases of SGD, how training shapes values, on the role of “habits” and non-maximizing behavior, on analogies with humans, and so on. But if planning against a sophisticated predictive model is an effective strategy for decision-making then ultimately every one of these approaches must confront the same issue: either by building systems that explicitly use planning against a learned model, or by training systems that need to use this strategy internally. Empirical work mostly seems helpful for grappling with indistinguishable mechanisms insofar as empirical work can help us notice this problem in advance if it’s a real problem.

Overall my sense is that if distinct mechanisms can be indistinguishable, then any indefinitely scalable solution to alignment will need to use additional properties of sensor tampering or assumptions about the structure of optimal decision-making. That said, even in relatively pessimistic worlds there may be a long period of time (potentially extending to fairly superhuman systems) for which these issues are surmountable.

4. Approaches to sensor tampering assuming indistinguishable mechanisms

In this section we’ll discuss three possible approaches to sensor tampering:

Harden sensors such that tampering with them is harder than achieving the intended task.
Harden sensors enough that you can’t tamper with them without “trying,” and then argue that if you are “trying” to tamper with sensors then you can discriminate.
Try to characterize some other sense in which models “don’t know what’s going on” when mechanisms are indistinguishable. Design algorithms that work as long as the AI “knows what’s going on” and then validate that assumption.

Our main hope is to find a way to efficiently distinguish distinct mechanisms. But it still seems worth fleshing out these alternatives.

A. Harden sensors so that tampering is harder than the intended task

We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.

Some dynamics make this seem more plausible, at least as a way to avoid the particular obstruction in this post (and therefore open up a much broader universe of “outs”):

We can use AI systems to help us build sensors and understand their potential failures. There are potential difficulties in training such systems, but there are a lot of promising options that haven’t been very thoroughly explored and ultimately the obstruction in this post doesn’t speak directly to the difficulty of that problem. What matters becomes not the gap between humans and AI systems, but between “AI systems using opaque predictors for which discrimination is impossible” and “all the other AI.”
To address the difficulty in this post we don’t need to build a sensor that reliably tells us if a robot rebellion is occurring, we just need to be able to arbitrate a debate about whether a particular set of recorded actions and observations is potentially the result of sensor tampering, and we can use a very conservative standard. That means we get to comb over observations for evidence of anything avoidably funny.
Many plans for tampering with sensors are only scary because they involve creating external agents and delegating the sensor-tampering task to them. But we can closely supervise actions and look for evidence of this kind of delegation. We can also ask friendly systems to perform similar delegation to defend sensors (again modulo complicated questions about how to train such systems).

On the other hand, there are two big reasons we’re scared of this kind of approach:

It seems highly sensitive to how your predictive models generalize. If a “sensor reading” is simply whatever ultimately ends up in a dataset in a datacenter somewhere, then any successful robot rebellion would constitute impossible-to-detect sensor tampering since it could ultimately erase any evidence recorded by sensors. In practice we expect even deeper forms of underdetermination than “what is the definition of a sensor?” and so we would strongly prefer to have approaches (like mechanistic anomaly detection) that don’t require making strong structural assumptions of this type.
It is really scary to imagine a world where in a strong sense all the most capable AI systems “want” to tamper with sensors and are prevented only because the balance of physical power favors the defender. Such a world feels highly unstable and vulnerable to mistaken assumptions or external shocks. This vague unease could be defused by thinking very carefully through the actual risks, but it’s not a direction that is intuitively appealing to us.

B. Detect sensor tampering that requires “trying”

So far we’ve talked about training a sophisticated predictive model and then imagined an exhaustive or random planner who searches against it.

While it seems hard to make sensors so robust that a powerful AI cannot tamper with them, it seems much easier to make sensors so robust that you can’t tamper with them by randomly generating actions.

In that case, even if we have a sophisticated predictor which doesn’t distinguish sensor tampering from the intended behavior, we don’t have trouble unless there is also an algorithm which proposes actions that have an elevated risk of leading to sensor tampering.

Of course any policy which proposes coherent actions would tend to elevate the probability of successful sensor tampering alongside the probability of successfully doing the task. But we may hope to harden the sensor far enough that this kind of generic prior over competent actions still won’t lead to sensor tampering unless it is specifically driving at that goal.

So this suggests a two step plan:

Harden sensors enough that in some sense you can’t “accidentally” tamper with them.
Argue that the proposer can distinguish actions that are “deliberately” tampering from actions that are trying to do the task as intended, even if the predictor cannot.

Both of these steps seem challenging. Conceptually the second step seems like the main blocker — in cases where a predictor can predict success without distinguishing mechanisms, it also seems possible for a proposer to propose successful plans without distinguishing mechanisms. But it’s not at all a trivial link, and e.g. the primality testing example can’t be easily adapted into this format suggesting that it might be a fundamentally different step.

One reason you might think that this approach is doomed is that there is always a simple action that tampers with the sensors, namely: build an external agent trying to tamper with the sensors. This gives an absolute lower bound on the probability of accidentally tampering with the sensors no matter how much they are hardened.

But if the external agent itself must do something to deliberately tamper, then that still means that there exists an efficient procedure for detecting the sensor tampering and leaves us with a possible out — it seems plausible that this approach can be extended to work for external agents if and only if it works in cases with no external agents. It generally seems hard to write off possible approaches based on counterexamples that involve creating external agents, since those counterexamples can often be handled by separate tactics (we do ultimately need to confront such counterexamples after handling the “base case,” but it seems like a mistake to write off approaches too early based on this difficulty).

C. Assume that your AI “knows what’s going on”

It would be very strange if (human faces) → (sensors) was analogous to (primality) → (Fermat test), because it suggests that our AI’s grip on the concept of “human faces” is extremely tenuous. Such a model could be reasonably said to “not know” whether there are really humans in the room or if sensors have been tampered with.

In section B we argued that maybe this meant the AI didn’t pose a threat, because tampering with sensors is only possible if you actually understand what you are doing.

But even if that hope doesn’t work out, we could still try to:

Formalize “knowledge” such that the AI systems we are worried about “know” that they are disempowering humanity.
Design algorithms that detect sensor tampering that our AI “knows” about.
Verify that our AI understands the situation well enough to “know” if it is tampering with sensors, either empirically or by making some additional argument about sensor tampering.

As a vague illustration, we could try to argue that the AI systems we are worried about have a certain kind of mechanical understanding of the systems they are predicting. We could show that this kind of mechanical understanding is inconsistent with primality-test-like cases (which are possibly only because “is prime” lacks interesting internal structure and is more like a raw fact about the world), and then we could empirically monitor whether our AI systems actually have this kind of rich understanding.

As another illustration, we could try to relate “knowledge” directly to the ability to answer certain kinds of questions, including honestly answering questions about what’s happening in the world, and then carefully monitor how hard it is for our AI to answer those questions so that we’d notice if it became hard.

This is more of a vague catch-all than either approach A or B, but I think it deserves attention because the situation of an AI tampering without knowing it feels extremely weird. So there may be room to make sense of that situation and show how it can’t happen.

Conclusion

It may be possible for a behavior to arise via two very different mechanisms without there being any efficient way to distinguish those mechanisms. That would be a fatal problem for mechanistic anomaly detection, but it would also probably mean that any approach to AI alignment will need to exploit some additional structure about sensor tampering in order to avoid it. This may take the form of clarifying the intuition that problematic AI systems surely “know” they are tampering with sensors; it may also be a serious obstacle to solving alignment under a clearly-articulated set of assumptions rather than relying on messy empirical contingencies.

Comment via LessWrong, Alignment Forum.