[Question] About subwords and bpe tokenization approach

loretoparisi commented 4 years ago

Thanks a lot for this works. According to function ReadWord- https://github.com/yumeng5/Spherical-Text-Embedding/blob/master/jose.c#L60

a word is defined as a sequence of characters with some delimiter (tab, space, etc.). Is it possibile to customize this approach with subwords like in fastText - https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L172

or some other approach like BPE. SentencePiece could be a way - https://github.com/google/sentencepiece

In this last case it would mean that we are going to replace each word (or better each BPE subword) with a unique index (BPE ids), so we need a encoding and later a decoding phase.

yumeng5 commented 4 years ago

Hi Loreto,

Thanks for the question. You are right that the current framework learns word-based embeddings (Word2Vec-like), instead of character n-gram/subword embeddings (fastText-like).

To incorporate subword information into embedding learning, fastText extends Word2Vec by decomposing a word embedding vector into the summation of all its character n-gram embedding vectors.

The above approach seems intuitive and straightforward in the Euclidean space, but it may not be easy to adapt it to the spherical space---In spherical text embedding, each embedding vector (word/paragraph) is constrained on the unit sphere (vector norm = 1), and it does not hold that the summation of unit vectors still has unit norm. For example, if "faster" is decomposed into "fast" and "er", then we have v_{faster} = v_{fast} + v_{er} according to fastText implementation, but the unit norm constraint ||v_{faster}|| = ||v_{fast}|| = ||v_{er}|| = 1 might be violated. In short, it requires better design than simply decomposing a word vector into the summation of subword vectors for spherical embeddings.

An easy fix would be to decompose a word vector into the normalized summation of its character n-grams, but it lacks a clear theoretical explanation for doing so, and thus we did not incorporate this kind of design into our current framework. That being said, if this method does lead to encouraging results, it might still be beneficial for subword embedding learning. I might come back at some point in the future to try it, but I'll have to leave it as future work for now.

Please let me know if you have further concerns/questions!

Best, Yu

loretoparisi commented 3 years ago

@yumeng5 thanks for the explanation. In the recent Language Models, the sub wording (as Word2Vec / FastText does) has been effectively replace by Byte Pair Encoding, that introduce subwords units encoded as codes, and available in several flavours like SentencePiece, WordPiece, etc. A very old but good one is subword-nmt, while a super blazing fast (via Rust) is HuggingFace tokenizers. That said, I wonder if the same concept of unit norm constraint

||v_{faster}|| = ||v_{fast}|| = ||v_{er}|| = 1

would be violated in case of using the BPE codes, or, maybe, due the nature of byte pair encoding, would be possibile to follow this approach.

yumeng5 commented 3 years ago

Hi @loretoparisi,

Thanks for bringing this up. According to my understanding, the BPE-like approaches essentially construct vocabularies with subword units instead of whole-words. And this subword segmentation step by itself does not prevent any assumptions from being made regarding the word embedding space (Euclidean, spherical, etc.) because the embedding learning procedure is independent of the content of the vocabulary.

The only case that makes applying the spherical embedding approach difficult is when we make additional assumptions regarding the subword embeddings that conflict with the spherical space constraints. As mentioned in my previous post, fastText assumes that the whole word embedding is the summation of its subword embeddings, (e.g., v_{faster} = v_{fast} + v_{er}), which conflicts with the unit norm constraint. However, if we drop this assumption and treat each subword unit as if they are independent words, there will be nothing preventing us from learning their spherical embeddings.

I hope this answers your question! Let me know if anything remains unclear.

Best, Yu

yumeng5 / Spherical-Text-Embedding

[Question] About subwords and bpe tokenization approach #1