Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while
…
»
ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between
…
»
ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understanding of what was going on inside a neural
…
»
Earlier this year ARC posted a prize for two matrix completion problems. We received a number of submissions we considered useful, but not any complete solutions. We are closing the contest and awarding
…
»
The Alignment Research Center’s Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here.
Update January 2024: we have paused hiring and expect to reopen
…
»
Here are two self-contained algorithmic questions that have come up in our research. We're offering a bounty of $5k for a solution to either of them—either an algorithm, or a
…
»
This post is an elaboration on “tractability of discrimination” as introduced in section III of "Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see "Mechanistic anomaly detection" and "Finding gliders in the game of life".
…
»