The Alignment Research Center (ARC) is a non-profit research organization whose mission is to align future machine learning systems with human interests.

Our current research focus is developing a theoretical foundation for mechanistic explanations of neural network behavior.

Recent research

Read more →

About ARC

What is “alignment”? ML systems can exhibit goal-directed behavior, but it is difficult to understand or control what they are “trying” to do. Powerful models could cause harm if they were trying to manipulate and deceive humans. The goal of intent alignment is to instead train these models to be helpful and honest.

Motivation: We expect that new techniques will be needed to align AI systems as they surpass human capabilities. As AI progress accelerates it may become increasingly difficult to adapt quickly enough to keep up with the changing technology. We would be better prepared if we had methods that could be safely scaled over many orders of magnitude, in the same way that generative pretraining and reinforcement learning have been scaled up dramatically over the last decade.

What we’re working on: We're designing algorithms that predict neural network behavior by mechanistically analyzing a network’s weights rather than running it on a large set of samples. Our main focus is building methods that are more computationally efficient than sampling while using only mechanistic analysis; we believe these methods will more gracefully handle critical issues like predicting out of distribution performance and detecting anomalies.

Looking for ARC Evals? See METR.