Jacob Hilton - Alignment Research Center

AlgZoo: uninterpreted models with fewer than 1,500 parameters

This post covers work done by several researchers at, visitors to and collaborators of ARC, including Zihao Chen, George Robinson, David Matolcsi, Jacob Stavrianos, Jiawei Li and Michael Sklar. Thanks to Aryan Bhatt, … »

Competing with sampling

In 2025, ARC has been making conceptual and theoretical progress at the fastest pace that I've seen since I first interned in 2022. Most of this progress has come about because … »

Obstacles in ARC's research agenda

Former ARC researcher David Matolcsi has put together a sequence of posts that explores ARC's big-picture vision for our research and examines several obstacles that we face. We think these posts … »

A bird's eye view of ARC's research

Over the last few months, ARC has released a number of pieces of research. While some of these can be independently motivated, there is also a more unified research vision behind them. The … »

Backdoors as an analogy for deceptive alignment

ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between … »

Formal verification, heuristic explanations and surprise accounting

ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understanding of what was going on inside a neural … »

ARC is hiring theoretical researchers

The Alignment Research Center’s Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here. Update January 2024: we have paused hiring and expect to reopen … »