yannvgn / laserembeddings

LASER multilingual sentence embeddings as a pip package
BSD 3-Clause "New" or "Revised" License
224 stars 29 forks source link

tokenize and apply bpe for one sentence #5

Closed Ravikiran2611 closed 4 years ago

Ravikiran2611 commented 4 years ago

can you provide a solution for this issue https://github.com/facebookresearch/LASER/issues/95

yannvgn commented 4 years ago

Well if your goal is to get the BPE-encoded version of your sentence, you could do like this, with laserembeddings:

from laserembeddings import Laser
from laserembeddings.preprocessing import Tokenizer, BPE

tokenizer = Tokenizer('en')
bpe = BPE(Laser.DEFAULT_BPE_CODES_FILE, Laser.DEFAULT_BPE_VOCAB_FILE)

bpe.encode_tokens(tokenizer.tokenize('He is inclined to be lazy.'))
# he is in@@ clin@@ ed to be la@@ zy .

But that's not really the point of this package.

Also note that for some languages, in some cases you might get slightly different results than with Facebook's original implementation. Please refer to the readme.

Ravikiran2611 commented 4 years ago

got it thanks !!!!!!!!!!! @yannvgn