Anthropic looks under the hood of Claude 3.5 Haiku, using circuit tracing to see how it works across different kinds of tasks.
Summary
- Our methods can reveal interpretable steps in a model's reasoning, but only work well in specific cases with clear computational "cruxes."
- Models use sophisticated mechanisms including planning ahead, working backward from goals, and employing abstract representations that generalize across contexts like languages and domains.
- Despite some successes, current interpretability tools still miss crucial aspects of computation, especially in attention mechanisms and complex reasoning chains.