Job Description
Develop methods for understanding LLMs by reverse engineering algorithms learned in their weights. Design and run robust experiments, both quickly in toy scenarios and at scale in large models. Create and analyze new interpretability features and circuits to better understand how models work, focusing on mechanistic interpretability to discover how neural network parameters map to meaningful algorithms. Build infrastructure for running experiments and visualizing results. Work with colleagues to communicate results internally and publicly.
The Interpretability team at Anthropic is working to reverse-engineer how trained models work because they believe that a mechanistic understanding is the most robust way to make advanced systems safe. The role involves collaborating with teams across Anthropic, such as Alignment Science and Societal Impacts, to use interpretability work to make Anthropic’s models safer. The candidate will be expected to clearly articulate and discuss the motivations behind their work, and teach the team about what they've learned, writing up and communicating results.
About Anthropic
Anthropic’s mission is to create reliable, interpretable, and steerable AI systems, aiming for AI to be safe and beneficial for users and society.