Open piedralaves opened 2 years ago
Hi @piedralaves
Yes, you can do it by calling "public void LoadTextModel(string strFileName)"
Thanks Zhongkai Fu
Thanks a lot g
Any suggestion to save the model in binary format after calling LoadTextModel? I mean:
You could call "SaveModel" function to save your model in binary format.
My problem is that SaveModel has some arguments that I don't understand well.
public static void SaveModel(string strFileName, int vocab_size, int vector_size, List<vocab_word> vocab, double[] syn) {
This is the case of "double[] syn"
The object "model" does't have syn G
double[] syn is from the hidden layer and it's the embedding of these tokens. Here are some solutions you can do:
The model format is pretty straightforward, here is an example to load the model (set vqSize to 0), you will understand how the model format looks like, and then just implement some code to convert your model from txt format to this binary format.
BinaryReader br = new BinaryReader(sr.BaseStream);
//The number of words
int words = br.ReadInt32();
//The size of vector
vectorSize = br.ReadInt32();
int vqSize = br.ReadInt32();
for (int b = 0; b < words; b++)
{
Term term = new Term();
term.strTerm = br.ReadString();
term.vector = new float[vectorSize];
for (int i = 0; i < vectorSize; i++)
{
term.vector[i] = br.ReadSingle();
}
}
sr.Close();
Hi Zhongkai
I made something like the advise you gave in your response:
The problem is that I obtained a smaller binary model and a smaller weight matrix than the original model. Seems that it take into account the words of the small training sample of point one.
The trace are this:
Txt2VecConsole.exe -mode train -trainfile CIE_10.txt -modelfile vector_new.bin -vocabfile vocab.vocab -debug 1 -cbow 1 -iter 10 -pre-trained-modelfile cie.txt Alpha: 0,025 CBOW: 1 Sample: 0 Min Count: 5 Threads: 1 Context Size: 5 Debug Mode: 1 Save Step: 97656K Iteration: 10 Only Update Corpus Words: 0 Negative Examples: 5 Pre-trained model file: cie.txt info,08/09/2022 18:25:03 Starting training using file cie_10.txt info,08/09/2022 18:25:03 Loading vocabulary vocab.vocab from file... info,08/09/2022 18:25:03 Load vocabulary from pre-trained model file cie.txt info,08/09/2022 18:25:04 Apply the following options from pr-trained model file Txt2Vec.Model info,08/09/2022 18:25:04 Vector Size: 200 info,08/09/2022 18:25:04 Calculating how many words need to be train... info,08/09/2022 18:25:04 Total training words : 1584 info,08/09/2022 18:25:04 Initializing acculumate term frequency... info,08/09/2022 18:25:04 Acculumate factor: 1 info,08/09/2022 18:25:04 Acculumated total frequency : 1087 info,08/09/2022 18:25:04 Loading syn0 from pre-trained model... info,08/09/2022 18:25:04 Saving term and vector into model file... info,08/09/2022 18:25:04 Saving term and vector into model file...
The matrix to load:
The small dummie corpus:
The model is generated:
Thanks a lot and sorry for the answers. I want to research about embbedings with seq2seq and we need to load an embedding matrix transformed in text2vec format.
regards
G
Hi @piedralaves ,
Your model maybe get shrinked, since "Min Count" is 5. You could try to set "-min-count" to 0, and retry it.
Thanks Zhongkai Fu
Hi Zhongkai
I only want to know if is there any way in Txt2Vec for loading a plain text matrix to make a bin model for using in Txt2Vec. In other words. Having a well formed plain text matrix (according to Dump) been able to load in Txt2Vec .
Many thanks.
G