zbmed-semtec / word2doc2vec-doc-relevance

An approach exploring and assessing literature-based doc-2-doc recommendations using word2vec combined with doc2vec, and applying it to TREC and RELISH datasets
GNU General Public License v3.0
0 stars 0 forks source link

No embedding is generated by Word2Vec model of Gensim for some of the annotated terms with a known MeSHID #14

Closed Soudeh-Jahanshahi closed 9 months ago

Soudeh-Jahanshahi commented 9 months ago

For instance, in the article with PMID 22569528, no embedding is generated for "immediate early gene" (for none of its tokens) with ID MeSHD017781: The issue may occur due to just working with the provided sample data. Specifically it may not occur (or occur with very low likelihood) in case of using the complete Relish corpus.

rohitharavinder commented 9 months ago

Irrespective of the data that you use for training the model, be it the provided sample data or the entire RELISH corpus, this error may occur. The reason for this is most likely because of the min_count parameter. In our case, we always set it to 5. This means all those words within the corpus (sample data or the entire RELISH) that have a frequency of less than 5 will be ignored during the training. Such words end up having a embedding not associated to them.

If you look at the code below from the script generate_embeddings.py, you will notice that we use a try-except conditional to skip looking for embeddings for such words with a frequency of less than 5.

for word in article_doc[iteration]:
    try:
        embedding_list.append(word_vectors.wv[word])
    except:
        missing_words += 1

Let me know if you encountered this error while running the script, or if you were specifically looking for the embedding for the above mentioned word. If it is the first case, then I will look into what is causing this error.