January Meetup – The Mechanistic Interpretability Agenda

Astral Codex Ten recently posted an explainer of Anthropic’s paper “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” Since we haven’t had a discussion meetup specifically focused on AI in awhile, this seems like as good of a time as any!

Since the content in the ACX explainer and the Anthropic explainer are quite technical, we will also be focusing our discussion more broadly on the mechanistic interpretability agenda. For those not familiar, mechanistic interpretability is a subfield of AI research focused on reverse engineering the behavior of deep neural networks. While associated with the LessWrong and AI Alignment Forum communities, many researchers at large industry labs and academic institutions are also doing mechanistic interpretability work — some of whom are explicitly involved with nerdy corners of the internet.

For this discussion meetup, we can take the temperature of the group and focus either on some of the details of “Towards Monosemanticity” or we can discuss interpretability more broadly.

Here’s the suggested reading:

God Help Us, Let’s Try to Understand AI Monosemanticity from Astral Codex Ten
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning from Anthropic
A Longlist of Theories of Impact from Neel Nanda
Against Almost Every Theory of Impact of Interpretability from Charbel-Raphaël
Chris Olah’s Views on AGI Safety from evhub (Chris Olah is one of the leaders of the interpretabilty team at Anthropic and formerly worked at OpenAI and DeepMind)

Please feel free to come even if you’re worried it will be awkward, you won’t fit in, or you aren’t the “typical person who comes to a Rationality meetup.” Even more than usual, please also feel free to come even if you’ve only done some of (or…let’s be honest…none of) the suggested reading.

DATE & LOCATION

Date: Saturday, 1/6 @ 2pm–4:30-ish pm
Location: South Loop Strength & Conditioning – upstairs in the mezzanine
645 S Clark
Chicago IL 60605
Note: Todd owns this gym so that’s why there’s a Rationality meetup at a gym 🙂
If you have trouble finding us, please DM Todd or Shane on the Discord or post in “meetups”. You can also text Shane (608-436-1809).

January Meetup – The Mechanistic Interpretability Agenda

DATE & LOCATION

RESOURCES: