rebalance kmeans clusters

yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.

https://parser.yzhang.site/

MIT License

829 stars 141 forks source link

rebalance kmeans clusters #26

Closed attardi closed 4 years ago

attardi commented 4 years ago

kmeans() sometimes produces large clusters, which cause to run out of CUDA memory when computing the embeddings.

The change is just in file parser/util/alg.py.

The other files contain unrelated changes, to allow using ELECTRA models or other from Huggingface.

attardi commented 4 years ago

Sorry, the requirements.txt should read:

transformers == 2.10.0

yzhangcs commented 4 years ago

Sorry, the requirements.txt should read:

transformers == 2.10.0

Sorry, the code does not support transformers with version 2.2 or higher right now. Since I employ a word-wise tokenization for BERT input, and BertTokenizer.encode adds special tokens like [PAD] and [SEP] to each tokenized unit by default since 2.2. This can lead to some unexpected behavior.

attardi commented 4 years ago

These are formatting errors. Shall I close this and resubmit?

attardi commented 4 years ago

It does not work as expected.