Closed yaatehr closed 4 years ago
commit hash 71be82ddb2e46f933abea9161fdb33515a848193
only averaging the vectors/ character for now. Am using PCA to reduce embedding size but haven't experimented with different ones. Will need to experiment with tokenization for the tweet dataset
Add to the string dataset functionality. potentially use PCA to reduce embedding size try different levels of tokenization (character level, url delimeters, words with stopword removal try different methods of aggregation