zhongkaifu / Seq2SeqSharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, x86, x64, ARM), multimodal model for text and images and so on.
Other
193 stars 38 forks source link

Contextual embeddings #64

Closed piedralaves closed 1 year ago

piedralaves commented 1 year ago

Hi zhongkai:

We want to write into a text file contextual embeddings in some time-stamps at test mode.

Something like:

Timestamp_5 to [.................contextualized vector....................] Timestamp_6 the [.................contextualized vector....................] Timestamp_7 Cat [.................contextualized vector....................]

We guess a contextualized vector of a input-word could be the hidden state in its timestamp or the output of this same timestamp. Do you think this last assertion is righ?

If right, What is the best way to do it?

Thanks a lot

zhongkaifu commented 1 year ago

Hi @piedralaves ,

You could use the outputs of any layer as contextualized vector, such as top layer, the second top layer or others. For example:

                encOutput = Encoder.Run(computeGraph, sntPairBatch, encoder, m_modelMetaData, m_shuffleType, srcEmbedding, posEmbedding, segmentEmbedding, srcTokensList, originalSrcLengths); // Shape: [batchsize * seqLen, embedding_dim]

You could reshape it by encOutput = g.View(encOutput, new long[]{batchsize, seqLen, embedding_dim});

then you can extract contextualized vector from it, such as [0,0] // the embeddings of the first token at the first batch [0,1] // the embeddings of the second token at the first batch ... [1,0] // the embeddings of the first token at the second batch [1,1] // the embeddings of the second token at the second batch ...

Thanks Zhongkai Fu

piedralaves commented 1 year ago

Sorry Zhongkai, do you mean the "Encoder.Run" called in the function "RunForwardOnSingleDevice"?

Could you expand a little more what do you mean with "reshape"?

Thanks a lot

G

zhongkaifu commented 1 year ago

yes. "reshape" means to return a tensor with the same data and number of elements as input , but with the specified shape

piedralaves commented 1 year ago

Hi Zhongkai,

I sopuse you mean something like:

// Compute src tensor
encOutput = Encoder.Run(computeGraph, sntPairBatch, encoder, m_modelMetaData, m_shuffleType, srcEmbedding, posEmbedding, segmentEmbedding, srcTokensList, originalSrcLengths);

 //getting the embeddings
int srcSeqLen = encOutput.Rows / m_options.BatchSize;
encOutput = computeGraph.View(encOutput, new long[] { m_options.BatchSize, srcSeqLen, m_options.SrcEmbeddingDim });

But, what is the method in encOutput to access for example to [0,0]?, that is, the embeddings of the first token at the first batch.

zhongkaifu commented 1 year ago

encOutput is WeightTensor, so you could check methods in .\Seq2SeqSharp\Tools\WeightTensor.cs, such as "public float GetWeightAt(long[] indices)" to get values from it. Or to use other methods for your requirement.

piedralaves commented 1 year ago

Hi Zhongkai,

We have done what you said previously. We now write vectors that seem to be hidden states in each timestamp.

public static void writeContextualEmbeddings(IComputeGraph computeGraph, IWeightTensor encOutput, int BatchSize, int embeddingDim, Seq2SeqOptions opts, List<List<String>> srcSnts)
        {
            long[] auxLong = new long[3]; //para pintar matriz de incrustación
            string auxString = "";
            //Esto es para intentar pintar las incrustaciones contextualizadas
            int srcSeqLen = encOutput.Rows / BatchSize;
            IWeightTensor encOutput2;

            Random rnd = new Random();

            encOutput2 = computeGraph.View(encOutput, new long[] { BatchSize, srcSeqLen, embeddingDim });
            //hay que hacer una función que coja encOutput2 y pinte lo que se quiera en customTools
            //encOutput2 is WeightTensor, so you could check methods in .\Seq2SeqSharp\Tools\WeightTensor.cs,
            //such as "public float GetWeightAt(long[] indices)" to get values from it.Or to use other methods for your requirement.
            using (StreamWriter writer = new StreamWriter(opts.ValidCorpusPaths + "/" + "ContextualEmbeddings" + rnd.Next().ToString() +".txt"))
            {
                for (int i = 0; i < encOutput2.Rows; i++) //número de frases en el lote
                {
                    List<String> sentenceList = srcSnts[i];

                    for (int j = 0; j < srcSeqLen; j++) //tamaño de las frases
                    {

                        auxString = auxString + sentenceList[j];

                        for (int k = 0; k < embeddingDim; k++) //dimensiones de las incrustaciones
                        {
                            auxLong[0] = i;//para pintar matriz
                            auxLong[1] = j;//para pintar matriz
                            auxLong[2] = k;//para pintar matriz

                            auxString = auxString + " " + encOutput2.GetWeightAt(auxLong).ToString();

                        }
                        writer.WriteLine(auxString);
                        auxString = "";
                    }

                    writer.WriteLine("-----------------------------");

                }
            }
        }

But we want to ensure that our code works properly. For this purpose, we ask you some questions:

  1. What is the object encOutput? We know is a tensor, but, does it contain the output of nodes of the hidden layer for a batch?

  2. When you said, "top layer, the second top layer", do you mean the different hidden layers if they are more that one?

  3. If the object encOutput contains the output of a hidden layer for a bacht, how can we access to the outputs of the nodes of the output layer (softmax)? that is, how can we get the prediction in each timestamp? (with vectors of the size of vocabulary with the probabilities provided by a softmax function?

Thanks a lot, and sorry for these questions.

zhongkaifu commented 1 year ago

Your code looks good to me.

For your questions:

  1. encOutput is the output tensor of encoder. Its shape is [batchSize * sequence_length, hidden_size]
  2. Yes. For encoder, top layer is usually softmax layer, the second top layer is the top layer of hidden layer, the third top layer is the second top layer of hidden layer, ...
  3. the question is already answered in above. If you want to use softmax result, you can directly use encOutput.
piedralaves commented 1 year ago

We did not understand the sentence "If you want to use softmax result, you can directly use encOutput". Until now, we get the hidden states (the output of hidden layer) in each timestamp using the function above in the test mode. Such hidden states are extrated from encOutput, aren`t they?. These hidden states are vectors that have the same dimensionality as embeddings or nodes of the hidden layer.

But, how to get the output of the top layer (the output layer)? (with vectors of the size as the vocabulary size). In other words, how to get the final vectors which represents the output words in the sequence?

Probably, we misunderstood something. Sorry.

The question, is how to use encOutput to get the output of the top layer. Could you please give a little code?

zhongkaifu commented 1 year ago

I see. Sorry that I misunderstand your question. Here is a few code for you.

''' IFeedForwardLayer encoderFFLayer = ... // You need to initialize it and it's tranable IWeightTensor ffLayer = encoderFFLayer.Process(encOutput, batchSize, g); IWeightTensor probs = g.Softmax(ffLayer, inPlace: true); ''' Then the probs tensor is softmax output tensor you asked. You could also check RunForwardOnSingleDevice method in Seq2SeqSharp\Applications\SeqSimilarity.cs which is used to calculate sentences similarity.

Thanks Zhongkai Fu

piedralaves commented 1 year ago

Hi Zhongkai,

With respect to the last answer. We have some work done. We are now writing the softmax embeddings in each timestamp in the gptConsole decoder (the only part on it) and we also write them in the encoder of sequence2sequenceConsole. But... how to write the softmax embeddings of the decoder part in the sequence2sequence console?. The encOutput object is from the encoder part and don't seem to be a direct decOutput object in RunForwardOnSingleDevice .

We guess your code above needs one decOutput to generate the softmax embeddings of the decoder part.

IWeightTensor ffLayer = decodeFFLayer.Process(**decOutput**, batchSize, g);
IWeightTensor probs = g.Softmax(ffLayer, inPlace: true);

For example, in the gptConsole we get the decOutput from (decOutput, _) = decoder.Decode(inputEmbs, tgtSelfTriMask, batchSize, g); in Decoder.cs

What is the best way to get decOutputfor the decoder part in sequence2sequence? (specially for DecodeAttentionLSTM)

Thanks a lot

PD: As you know, we are also working in some tentative of weight updating in the terms we said to you. We are now exploring some results. We will keep you informed.

zhongkaifu commented 1 year ago

Hi @piedralaves ,

For transformer model, you could check code here: https://github.com/zhongkaifu/Seq2SeqSharp/blob/840dcea4f766aeab5fe03d7fbfa315d9cd6971a5/Seq2SeqSharp/Applications/Decoder.cs#L303-L317

For LSTM attention model, you could check code here: https://github.com/zhongkaifu/Seq2SeqSharp/blob/840dcea4f766aeab5fe03d7fbfa315d9cd6971a5/Seq2SeqSharp/Applications/Decoder.cs#L605-L608

Thanks Zhongkai Fu

piedralaves commented 1 year ago

Hi again, Zhongkai,

I have been working in what you said, but have some questions about a tentative.

I have a code (the code below) in order to encode a batch with only one sentence in test mode. This sentence is just what is produced by the decoder in RunForwardOnSingleDevice on seq2seqConsole in test mode. What we want is to get a dOutput with the resul for feed the softmax layer and get the softmax probabilities in each time-stamp of this sentence.

But I guess in:

decoder.Decode(inputsM, attPreProcessResult, batchSize, g);

inputsM should have a format, shouldn't it?

The code is as follows. Any help are very wellcome. We apreciate very much your code.

public static void writeExpectancyEmbeddingsAttentionLSTM(List<List<int>> outputSnts, List<List<String>> Snts, Seq2SeqOptions opts, IComputeGraph g, IWeightTensor encOutputs, AttentionDecoder decoder, IFeedForwardLayer decoderFFLayer, IWeightTensor tgtEmbedding, Vocab tgtVocab, int batchSize, int embeddingDim, bool isTraining = false)
        {

            int eosTokenId = tgtVocab.GetWordIndex(BuildInTokens.EOS, logUnk: true);

            // Initialize variables accoridng to current mode
            var originalOutputLengths = isTraining ? BuildInTokens.PadSentences(outputSnts, eosTokenId) : null;
            int seqLen = outputSnts[0].Count;

            // Pre-process for attention model
            AttentionPreProcessResult attPreProcessResult = decoder.PreProcess(encOutputs, batchSize, g);
            List<IWeightTensor> inputs = new List<IWeightTensor>();

            for (int i = 0; i < seqLen; i++)
            {
                //outputSnts[0] is a list of words produced by the decoder in RunForwardOnSingleDevice for a sentence in the encoder in test mode
                inputs.Add(g.Peek(tgtEmbedding, 0, outputSnts[0][i]));              

            }
            //how to format a tensor to be decoded? A tensor of a batch with only one sentence in outputSnts[0]
            //we think get success doing that in GPTconsole: 
            IWeightTensor inputsM = g.Concate(inputs, 0);
            IWeightTensor dOutput = decoder.Decode(inputsM, attPreProcessResult, batchSize, g);

            //writing the softmax output in each timestamp. A void that is working
            CustomTools.writeExpectancyEmbeddings(decoderFFLayer, g, dOutput, batchSize, embeddingDim, opts, Snts);

        }
zhongkaifu commented 1 year ago

Hi @piedralaves

The shape of inputsM should be (batchsize * sequenceLength, EmbeddingDim). If your batch size is 1, it will be (sequenceLength, EmbeddingDim)

Thanks Zhongkai Fu

piedralaves commented 1 year ago

Sorry Zhongkai , (sequenceLength, EmbeddingDim) is what I have in the above code, but it fails. Thanks

zhongkaifu commented 1 year ago

What's error message did you get ? Can you please put error message and call stack here ?

piedralaves commented 1 year ago

Inconsistent tensor sizes. 0: (4, 15) 1: (1, 15) 2: (1, 15) en TensorSharp.Core.TensorConcatenation.ConcatTensorSize(Int32 ndim, Int32 dimension, Tensor[] tensors) en C:\Users\jorge\source\repos\Seq2SeqSharp-RELEASE_2_5_0\TensorSharp\Core\TensorConcatenation.cs: línea 80 en TensorSharp.Core.TensorConcatenation.Concat(Tensor result, Int32 dimension, Tensor[] inputs) en C:\Users\jorge\source\repos\Seq2SeqSharp-RELEASE_2_5_0\TensorSharp\Core\TensorConcatenation.cs: línea 26 en TensorSharp.Ops.Concat(Tensor result, Int32 dimension, Tensor[] inputs) en C:\Users\jorge\source\repos\Seq2SeqSharp-RELEASE_2_5_0\TensorSharp\Ops.cs: línea 39 en Seq2SeqSharp.Tools.ComputeGraphTensor.Concate(List1 wl, Int32 dim) en C:\Users\jorge\source\repos\Seq2SeqSharp-RELEASE_2_5_0\Seq2SeqSharp\Tools\ComputeGraphTensor.cs: línea 1622 en Seq2SeqSharp.Tools.ComputeGraphTensor.Concate(Int32 dim, IWeightTensor[] wl) en C:\Users\jorge\source\repos\Seq2SeqSharp-RELEASE_2_5_0\Seq2SeqSharp\Tools\ComputeGraphTensor.cs: línea 1584 en Seq2SeqSharp.LSTMAttentionDecoderCell.Step(IWeightTensor context, IWeightTensor input, IComputeGraph g) en C:\Users\jorge\source\repos\Seq2SeqSharp-RELEASE_2_5_0\Seq2SeqSharp\Layers\LSTMAttentionDecoderCell.cs: línea 54 en Seq2SeqSharp.AttentionDecoder.Decode(IWeightTensor input, AttentionPreProcessResult attenPreProcessResult, Int32 batchSize, IComputeGraph g) en C:\Users\jorge\source\repos\Seq2SeqSharp-RELEASE_2_5_0\Seq2SeqSharp\Networks\AttentionDecoder.cs: línea 95 en Seq2SeqSharp.Utils.CustomTools.writeExpectancyEmbeddingsAttentionLSTM(List1 outputSnts, List1 Snts, Seq2SeqOptions opts, IComputeGraph g, IWeightTensor encOutputs, AttentionDecoder decoder, IFeedForwardLayer decoderFFLayer, IWeightTensor tgtEmbedding, Vocab tgtVocab, Int32 batchSize, Int32 embeddingDim, Boolean isTraining) en C:\Users\jorge\source\repos\Seq2SeqSharp-RELEASE_2_5_0\Seq2SeqSharp\Utils\CustomTools.cs: línea 253 en Seq2SeqSharp.Seq2Seq.RunForwardOnSingleDevice(IComputeGraph computeGraph, ISntPairBatch sntPairBatch, DecodingOptions decodingOptions, Boolean isTraining) en C:\Users\jorge\source\repos\Seq2SeqSharp-RELEASE_2_5_0\Seq2SeqSharp\Applications\Seq2Seq.cs: línea 399 en Seq2SeqSharp.Tools.BaseSeq2SeqFramework1.<>c__DisplayClass47_0`1.b__0(Int32 i) en C:\Users\jorge\source\repos\Seq2SeqSharp-RELEASE_2_5_0\Seq2SeqSharp\Tools\BaseSeq2SeqFramework.cs: línea 770

zhongkaifu commented 1 year ago

The exception shows these input tensors have different row size. Can you please print out shape (Sizes field) of tensors in inputs before calling g.Concate ?

Thanks Zhongkai Fu

piedralaves commented 1 year ago

The inputs: image And inputsM:: image

piedralaves commented 1 year ago

Ok, let me revise my code. May be I will find the cause and make it work. G

zhongkaifu commented 1 year ago

The inputs: image And inputsM:: image

It seems it already passed 'g.Concate". Do you still get any other problem ?

piedralaves commented 1 year ago

No. It is ok now. I redesigned the procedure and it seems to work. Thanks a lot.

zhongkaifu commented 1 year ago

Glad to know it.