<unk>in vocab and embedding when pretrained word representations

piedralaves commented 1 year ago

Hi zhongkai:

Is it necesary to have an symbol in the vocab file and its corresponed vector row in the embedding matrix?

I want to know how seq2seqSharp deal with the words in a sequence that are no represented in the vocabulary and hence has no vector in the embedding matrix. Are they automatically homologated with and then, the vector are retrieved to the training?

Thanks a lot

Guille

zhongkaifu commented 1 year ago

Hi @piedralaves

Seq2SeqSharp has bulit-in token for OOV (UNK), and it can automatically replace tokens not in the vocabulary into it. If your vocabulary is built by Seq2SeqSharp from given training data set, it automatically adds the following tokens into the beginning of the vocabulary. \</s> \<s> \<unk> [SEP] [CLS]

Note that for older version, please check released source code that which tokens are built, because I recently modified some of them in newer version.

If your vocabulary is external (not built by Seq2SeqSharp), I would suggest you add above tokens into the beginning of the vocabulary.

If your embeddings are external (not built by Seq2SeqSharp), your vocabulary may not include above tokens, so I would suggest you add these tokens while loading embeddings into matrix, and let Seq2SeqSharp to learn them during training.

You could also add above tokens into your vocabulary for pretrained word representations in Txt2Vec.

Thanks Zhongkai Fu

piedralaves commented 1 year ago

shoud we write "[s]" and "[/s]" symbols in the training file?. I mean, should that file be as follows? (having [s] and [/s] the right symbols "<" and ">")

[s] hola buenos días mire haber si me podían informar sobre activar nuevamente un terminal [/s] [s] departamento de bajas por favor [/s] [s] consultar para dar de alta un numero de teléfono [/s] [s] dar de baja el contrato [/s]

or seq2seqsharp put them automatically in the training phase?

zhongkaifu commented 1 year ago

They are \<s> and \</s>. If lines in data set don't have them, Seq2SeqSharp will automatically add them to the lines, otherwise, it won't put them in the line. This is for both training and test.

piedralaves commented 1 year ago

Thanks a lot. G

zhongkaifu / Seq2SeqSharp

<unk>in vocab and embedding when pretrained word representations #51