Closed Soudeh-Jahanshahi closed 9 months ago
Irrespective of the data that you use for training the model, be it the provided sample data or the entire RELISH corpus, this error may occur. The reason for this is most likely because of the min_count
parameter. In our case, we always set it to 5. This means all those words within the corpus (sample data or the entire RELISH) that have a frequency of less than 5 will be ignored during the training. Such words end up having a embedding not associated to them.
If you look at the code below from the script generate_embeddings.py
, you will notice that we use a try-except
conditional to skip looking for embeddings for such words with a frequency of less than 5.
for word in article_doc[iteration]:
try:
embedding_list.append(word_vectors.wv[word])
except:
missing_words += 1
Let me know if you encountered this error while running the script, or if you were specifically looking for the embedding for the above mentioned word. If it is the first case, then I will look into what is causing this error.
For instance, in the article with PMID 22569528, no embedding is generated for "immediate early gene" (for none of its tokens) with ID MeSHD017781: The issue may occur due to just working with the provided sample data. Specifically it may not occur (or occur with very low likelihood) in case of using the complete Relish corpus.