Tim Kellogg on Nostr: i’m very excited about the interpretability work that #anthropic has been doing ...
i’m very excited about the interpretability work that #anthropic has been doing with #LLMs.
in this paper, they used classical machine learning algorithms to discover concepts. if a concept like “golden gate bridge” is present in the text, then they discover the associated pattern of neuron activations.
this means that you can monitor LLM responses for concepts and behaviors, like “illicit behavior” or “fart jokes”
https://www.anthropic.com/research/mapping-mind-language-model
in this paper, they used classical machine learning algorithms to discover concepts. if a concept like “golden gate bridge” is present in the text, then they discover the associated pattern of neuron activations.
this means that you can monitor LLM responses for concepts and behaviors, like “illicit behavior” or “fart jokes”
https://www.anthropic.com/research/mapping-mind-language-model