What is Nostr?
Tim Kellogg /
npub1452…2cvr
2024-05-24 12:34:05

Tim Kellogg on Nostr: i’m very excited about the interpretability work that #anthropic has been doing ...

i’m very excited about the interpretability work that #anthropic has been doing with #LLMs.

in this paper, they used classical machine learning algorithms to discover concepts. if a concept like “golden gate bridge” is present in the text, then they discover the associated pattern of neuron activations.

this means that you can monitor LLM responses for concepts and behaviors, like “illicit behavior” or “fart jokes”

https://www.anthropic.com/research/mapping-mind-language-model
Author Public Key
npub1452e6fwxmy8nj74jcgwu5eyjedpq08e3hrvqexts697gpqptmz3smm2cvr