Alignment Research Center (Page 1)

Obstacles in ARC's research agenda

Former ARC researcher David Matolcsi has put together a sequence of posts that explores ARC's big-picture vision for our research and examines several obstacles that we face. We think these posts … »

A computational no-coincidence principle

In a recent paper in Annals of Mathematics and Philosophy, Fields medalist Timothy Gowers asks why mathematicians sometimes believe that unproved statements are likely to be true. For example, it is unknown whether … »

A bird's eye view of ARC's research

Over the last few months, ARC has released a number of pieces of research. While some of these can be independently motivated, there is also a more unified research vision behind them. The … »

Low Probability Estimation in Language Models

ARC recently released our first empirical paper: Estimating the Probabilities of Rare Language Model Outputs. In this work, we construct a simple setting for low probability estimation — single-token argmax sampling in transformers — and … »

Research update: Towards a Law of Iterated Expectations for Heuristic Estimators

Last week, ARC released a paper called Towards a Law of Iterated Expectations for Heuristic Estimators, which follows up on previous work on formalizing the presumption of independence. Most of the work described … »

Estimating Tail Risk in Neural Networks

Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while … »

Backdoors as an analogy for deceptive alignment

ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between … »