williamleif / socialsent

Code and data for inducing domain-specific sentiment lexicons.
Apache License 2.0
195 stars 75 forks source link

What is in the {}-dict.pkl for subreddits? #13

Open DGaffney opened 6 years ago

DGaffney commented 6 years ago

Hello! Working with your code right now to work on a sentiment scorer for The_Donald as part of a weekend project. I'm working through the code base right now and am finding that the [SUBREDDIT_NAME]-dict.pkl file is necessary but there's no structure I'm seeing that clear shows what is involved with generating this file - what is in this file? How do I go about generating it?

Thanks!

williamleif commented 6 years ago

Hi,

So this is somewhat unfortunate, but if I recall correctly that "{}-dict.pkl" is actually a vocabulary object from gensim's word2vec object (which no longer exists, since gensim did major revisions to their word2vec code).

Also, I apologize, it should be more clear from the documentation that the "reddit" directory is in this repo primarily as an artifact to record how I did the experiments in the paper, but it makes a lot of assumptions how the data is stored etc. If you want to generate something for "The_Donald" I would honestly say that you should ignore the "subreddit_run.py" and everything else in the "reddit" subdirectory and just follow the "Using the code" guidelines to do things from scratch.

The Reddit embeddings I used in the paper are also based on 2014 data, so you definitely would want to learn new word embeddings for The_Donald using more recent data. The workflow would then be:

(1) learn some word embeddings from The_Donald data (e.g., via gensim, or whichever method you prefer). Convert the learned embeddings to be compatible with the "representations/embedding.py" Embedding object (basically just need a numpy array of embeddings and a vocab mapping words to consecutive ids).
(2) specify seed lexicons of positive/negative words (these are just python dictionaries mapping seed words to numeric sentiment scores). (3) use one of the methods from polarity_induction_methods.py to induce sentiment for all the words in your Embedding object. This requires an Embedding object and the seed lexicons. If your vocab size (i.e., the number of words in your Embedding object) is really large, I would recommend the "densifier" method, since it is fastest.

Apologies that the code doesn't work "off-the-shelf" for Reddit... and that this is not better documented.

Cheers, Will

DGaffney commented 6 years ago

Hey Will!

Thanks for the thoughtful writeup - I feared something like this would be the answer, but good to know! This is a weekend project so I don't have the bandwidth to implement the above at this point, but I'll revisit soon! Thanks again.

Devin