yaatehr / learnedbloomfilter

MIT dsail research project
8 stars 0 forks source link

Add Glove Embeddings #2

Closed yaatehr closed 4 years ago

yaatehr commented 4 years ago

Add to the string dataset functionality. potentially use PCA to reduce embedding size try different levels of tokenization (character level, url delimeters, words with stopword removal try different methods of aggregation

yaatehr commented 4 years ago

commit hash 71be82ddb2e46f933abea9161fdb33515a848193

only averaging the vectors/ character for now. Am using PCA to reduce embedding size but haven't experimented with different ones. Will need to experiment with tokenization for the tweet dataset