zhongkaifu / Txt2Vec

Txt2Vec is a toolkit to represent text by vector. It's based on Google's word2vec project, but with some new features, such incremental training, model vector and so on.
BSD 3-Clause "New" or "Revised" License
68 stars 30 forks source link

plain text matrix to load in TXT2VEC (Inverse to DUMP) #7

Open piedralaves opened 2 years ago

piedralaves commented 2 years ago

Hi Zhongkai

I only want to know if is there any way in Txt2Vec for loading a plain text matrix to make a bin model for using in Txt2Vec. In other words. Having a well formed plain text matrix (according to Dump) been able to load in Txt2Vec .

Many thanks.

G

zhongkaifu commented 2 years ago

Hi @piedralaves

Yes, you can do it by calling "public void LoadTextModel(string strFileName)"

Thanks Zhongkai Fu

piedralaves commented 2 years ago

Thanks a lot g

piedralaves commented 2 years ago

Any suggestion to save the model in binary format after calling LoadTextModel? I mean:

  1. Load a model from text matrix
  2. Save a binary of such model
zhongkaifu commented 2 years ago

You could call "SaveModel" function to save your model in binary format.

piedralaves commented 2 years ago

My problem is that SaveModel has some arguments that I don't understand well. public static void SaveModel(string strFileName, int vocab_size, int vector_size, List<vocab_word> vocab, double[] syn) { This is the case of "double[] syn"

The object "model" does't have syn G

zhongkaifu commented 2 years ago

double[] syn is from the hidden layer and it's the embedding of these tokens. Here are some solutions you can do:

  1. Create an empty or tiny data set as training set for incremental training.
  2. Load your model from txt file
  3. Kick off incremental training mode, but not actually run any training
  4. Save your model into binary file

The model format is pretty straightforward, here is an example to load the model (set vqSize to 0), you will understand how the model format looks like, and then just implement some code to convert your model from txt format to this binary format.

        BinaryReader br = new BinaryReader(sr.BaseStream);

        //The number of words
        int words = br.ReadInt32();
        //The size of vector
        vectorSize = br.ReadInt32();
        int vqSize = br.ReadInt32();

        for (int b = 0; b < words; b++)
        {
            Term term = new Term();
            term.strTerm = br.ReadString();
            term.vector = new float[vectorSize];

            for (int i = 0; i < vectorSize; i++)
            {
                    term.vector[i] = br.ReadSingle();
            }
        }
        sr.Close();
piedralaves commented 2 years ago

Hi Zhongkai

I made something like the advise you gave in your response:

  1. Create a small corpus for incremental
  2. Load the model from txt file
  3. Kick off incremental training mode, but without training
  4. Save the model in bin
  5. Test it with dump (write the matrix of the model of point 4)

The problem is that I obtained a smaller binary model and a smaller weight matrix than the original model. Seems that it take into account the words of the small training sample of point one.

The trace are this:

Txt2VecConsole.exe -mode train -trainfile CIE_10.txt -modelfile vector_new.bin -vocabfile vocab.vocab -debug 1 -cbow 1 -iter 10 -pre-trained-modelfile cie.txt Alpha: 0,025 CBOW: 1 Sample: 0 Min Count: 5 Threads: 1 Context Size: 5 Debug Mode: 1 Save Step: 97656K Iteration: 10 Only Update Corpus Words: 0 Negative Examples: 5 Pre-trained model file: cie.txt info,08/09/2022 18:25:03 Starting training using file cie_10.txt info,08/09/2022 18:25:03 Loading vocabulary vocab.vocab from file... info,08/09/2022 18:25:03 Load vocabulary from pre-trained model file cie.txt info,08/09/2022 18:25:04 Apply the following options from pr-trained model file Txt2Vec.Model info,08/09/2022 18:25:04 Vector Size: 200 info,08/09/2022 18:25:04 Calculating how many words need to be train... info,08/09/2022 18:25:04 Total training words : 1584 info,08/09/2022 18:25:04 Initializing acculumate term frequency... info,08/09/2022 18:25:04 Acculumate factor: 1 info,08/09/2022 18:25:04 Acculumated total frequency : 1087 info,08/09/2022 18:25:04 Loading syn0 from pre-trained model... info,08/09/2022 18:25:04 Saving term and vector into model file... info,08/09/2022 18:25:04 Saving term and vector into model file...

The matrix to load:

cie.txt

The small dummie corpus:

CIE_10.txt

The model is generated:

ggg.txt

Thanks a lot and sorry for the answers. I want to research about embbedings with seq2seq and we need to load an embedding matrix transformed in text2vec format.

regards

G

zhongkaifu commented 2 years ago

Hi @piedralaves ,

Your model maybe get shrinked, since "Min Count" is 5. You could try to set "-min-count" to 0, and retry it.

Thanks Zhongkai Fu