yaronv / doc2vec

Converting documents into vectors using the word2vec algorithm
2 stars 3 forks source link

sample script for running the code #1

Open bhomass opened 6 years ago

bhomass commented 6 years ago

Hi there is very little documentation on how to feed parameters into the script, and which script does what. It looks like cnn_main.py is the important one. Can you please provide a sample script for running it. I imagine at least you will supply a directory name for the data set, but what other parameters are needed?

bhomass commented 6 years ago

I see all the parameters in config.py now. There is a lot of data files supplied. Any chance you could post some to serve as samples?

yaronv commented 6 years ago

Hi, actually what you need is only the embeddings_main.py script. The script get the following parameters: 1) a path to training documents (I use a sample of 500K docs out of a few millions reuters documents) - you can use your own documents 2) list of path to other sets of documets that you want to project (using tensorboard)

So I use the first parameter to train the model, And then project the other sets (each set is a different topic) NOTE: each document (both in the training set and in the other sets, has to be an XML file that looks like this:

the doc title the doc body
bhomass commented 6 years ago

So my real point of confusion is your use of positive and negative files. Doc2vec has the concept of negative sampling, but that’s taken care of by gesim. Could you explain your use of positive and negative datasets and how you generated them?

yaronv commented 6 years ago

This is the flow I was doing: I trained a Doc2Vec model on 500K reuters documents. Then I tool a few topics (a topic is like a class of documents that belong to the same subject) For each topic I had: 1) labeled negative documents - that's documents which I know that don't belong to the topic. 2) labeled positive documents - that's documents which I know that belong to the topic. Then, I used the trained model to project in tensorboard the negatives and positives and saw the different clusters. I was doing that to see if I can classify documents using doc2vec. So given a new document (unlabeled) and a topic, see to which cluster it is closer, the positives cluster or the negatives cluster. Hope that helped

bhomass commented 6 years ago

This is very interesting, what you are saying. However, I am clear how exactly you are doing it. First, your use of "topic" is the same as the lda topic (meaning a mixture of words)? if so, I can see classifying docs based on the dominant topic each has. So in that case, the labeled positive and negative documents are simply others docs sharing the same dominant topic, or not. Please confirm this understanding. Even if that's the case, why do you need to explicitly label them positive vs negative. Just assign their classification label would be sufficient. Unless, you are using the triplet loss concept somehow in deriving the clusters. Next. when you "project" them onto tensorboard, are you referring to tsne or pca clustering? What feature set are you using to compute similarities of the documents? is it the vector of each doc derived from doc2vec?

yaronv commented 6 years ago

Yes, the meaning of "topic" is like in LDA (I didn't run LDA at all) For each topic I have labeled positives and labeled negatives. I need to label them because my assumption was that I don't have an existing classifier for each topic. (and I also don't want to train one) I tried a nearest neighbors approach using doc2vec. Meaning: given a new unlabeled document, I check it's nearest neighbors, and see if they belong to the positive cluster or the negative cluster. In tensorboard you can use both TSNE and PCA - so choose what works better for you. The similarity used in tensorboard is cosine similarity. And yes, each document has a vector derived from the doc2vec model.

bhomass commented 6 years ago

Hi, I believe I understand your explanation. But that leaves one outstanding question unanswered. It seems you have a multinomial classification problem, not a binary one. Why wouldn't you just label each document by its dominant topic, say for 10 topics, there will be 10 classes. Then you can easily apply KNN to predict the class of any new doc, without creating positive and negative data sets for each single class.

yaronv commented 6 years ago

Your solution is correct, but didn't fit my situation. In my situation, I have hundreds of topics, and new topics are added every day. So I don't want to create a new model for all the existing topics each time I get a new topic to handle. In my solution, I create classifiers only to new topics. This way I know I don't harm the quality of old (existing) classifiers.

bhomass commented 6 years ago

ok, can't comment further, since you have a very unique and uncommon situation.

yaronv commented 6 years ago

OK, hope I helped