Open Zmavli opened 3 months ago
Not fully endorsed or sensical thoughts: At this point, I'm also kind of unsure as to whether or not "disambiguation" is even meaningful for decoder-only models. It's possible that they have ~lazy evaluation of meaning, keeping the meanings in superposition (in the physics sense), and only observing and evaluating when it is absolutely necessary.
I'm doubtful that activation patching can be used. Activation patching retains the context but removes the target (the last token(s)), but in studying lexical disambiguation, the target must remain the same.
I don't know currently how to set up the dataset such that the internal meaning assigned by the model to the target word is revealed (i.e. whether it thinks of "bat" in the animal sense or the sports sense). Maybe by doing completions?