whitebox-research / excursions

2 stars 1 forks source link

How can you create a dataset to examine lexical disambiguation? #14

Open Zmavli opened 3 months ago

Zmavli commented 2 months ago

I'm doubtful that activation patching can be used. Activation patching retains the context but removes the target (the last token(s)), but in studying lexical disambiguation, the target must remain the same.

I don't know currently how to set up the dataset such that the internal meaning assigned by the model to the target word is revealed (i.e. whether it thinks of "bat" in the animal sense or the sports sense). Maybe by doing completions?

Zmavli commented 2 months ago

Not fully endorsed or sensical thoughts: At this point, I'm also kind of unsure as to whether or not "disambiguation" is even meaningful for decoder-only models. It's possible that they have ~lazy evaluation of meaning, keeping the meanings in superposition (in the physics sense), and only observing and evaluating when it is absolutely necessary.