Analyzing Transformers in Embedding Space
Paper
•
2209.02535
•
Published
•
3
Select papers on language model interpretability with notes
Note works on the model weights (without any input). Projecting the layers and heads into the embeddings space and decoding tokens. Convince yourself in the appendix. Also indicates, that training the same model from frozen embeddings makes it nearly replaceable.
Note great overview: https://youtu.be/_NMQyOu2HTo knowledge lies in the Feed forward weights, some knowledge can be editing by a single neuron.
Note https://github.com/jessevig/bertviz let's you visualize attention per head per layer
Note different layers of a transformer perform better for different NLP tasks like POS tagging, NER, corefernce etc. They use the intermediate representations as inputs to simple classifiers