i’m very excited about the interpretability work that #anthropic has been doing ...

2024-05-24 12:34:05

i’m very excited about the interpretability work that #anthropic has been doing with #LLMs.

in this paper, they used classical machine learning algorithms to discover concepts. if a concept like “golden gate bridge” is present in the text, then they discover the associated pattern of neuron activations.

this means that you can monitor LLM responses for concepts and behaviors, like “illicit behavior” or “fart jokes”

https://www.anthropic.com/research/mapping-mind-language-model

Author Public Key

npub1452e6fwxmy8nj74jcgwu5eyjedpq08e3hrvqexts697gpqptmz3smm2cvr

Show more details

Tim Kellogg on Nostr: i’m very excited about the interpretability work that #anthropic has been doing ...