How to know if a sentence is "out of vocabulary"?

yannvgn / laserembeddings

LASER multilingual sentence embeddings as a pip package

BSD 3-Clause "New" or "Revised" License

224 stars 29 forks source link

Closed aalloul closed 3 years ago

aalloul commented 4 years ago

Hi there,

as I'm trying to understand how LASER works I tried this

laser.embed_sentences( "wer we2dwdfw ewrwer", "nl")

and I got a result whose norm is 0.65.

My question is whether it makes sense to talk about out of vocabulary for LASER?

yannvgn commented 3 years ago

Hi @aalloul,

The sentences are first tokenized and broken into subwords before being sent to the embedding layer, so we can't really talk about out-of-vocabulary.

For questions about the model itself, may I redirect you to https://github.com/facebookresearch/LASER? (laserembeddings is only a Python port of Facebook's LASER).

yannvgn commented 3 years ago

I'm closing the issue, feel free to re-open if needed.