Interpretability

Vipitis 's Collections

updated Nov 27, 2023

Select papers on language model interpretability with notes

Analyzing Transformers in Embedding Space

Paper • 2209.02535 • Published Sep 6, 2022 • 3

Note works on the model weights (without any input). Projecting the layers and heads into the embeddings space and decoding tokens. Convince yourself in the appendix. Also indicates, that training the same model from frozen embeddings makes it nearly replaceable.
Locating and Editing Factual Associations in GPT

Paper • 2202.05262 • Published Feb 10, 2022 • 1

Note great overview: https://youtu.be/_NMQyOu2HTo knowledge lies in the Feed forward weights, some knowledge can be editing by a single neuron.
A Multiscale Visualization of Attention in the Transformer Model

Paper • 1906.05714 • Published Jun 12, 2019 • 2

Note https://github.com/jessevig/bertviz let's you visualize attention per head per layer
BERT Rediscovers the Classical NLP Pipeline

Paper • 1905.05950 • Published May 15, 2019 • 2

Note different layers of a transformer perform better for different NLP tasks like POS tagging, NER, corefernce etc. They use the intermediate representations as inputs to simple classifiers
Prompt-to-Prompt Image Editing with Cross Attention Control

Paper • 2208.01626 • Published Aug 2, 2022 • 2