Moment of updating weights

piedralaves commented 1 year ago

Hi Zhongkai:

We want to do the following:

In the moment that the weights of the embedding matrix are updating, we want to update other words not in the sentence. The criterium by which we update them is by similarity with the words impacted in the updating. That is, if a word (the part of the embedding matrix that represents it -the vector-) is updated, some other words are also updated proportionally (by a coefficient). To do that, we need to evaluate the function or functions involved in the updating. Tentatively, we call this mechanism “family updating”, and will be deployed, if we can and it works, in order to help to deal with the phenomena called “systematic compositionality” in rule acquisition (“stimulus poverty” phenomena), which is reported in some studies to be a problem. Something that is recently say by Chomsky in New York Times. https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html We want to deal with it by two things:

An initialized “proto knowledge” in some part of embedding matrix.
A kind of mechanism as “family updating”.

If you are also interested in this kind of things, we can report you our first results and you can even participate in papers and white papers. Remember that we are interested in Cognitive aspects of the models, but also in applications to the resolution of technical problems. In any case, we understand that you are a busy person with important projects. Seq2seqSharp is one of them, that we really appreciate.

So first questions are: At what moment the weighs are updated? And, In what moment is completed? We want to manipulate the embedding at that times. We have explored some parts, but we want to ear your advises, if possible.

Thanks a lot for all.

zhongkaifu commented 1 year ago

Hi @piedralaves ,

For weights of those similar words also need to be updated, how do you calculate their gradients? Do they use the same gradients of the "key" words? Likes the formula: [gradient of similar word] = coefficient * [gradient of key word]. Is it correct? If so, how do you calculate the coefficient? Using cosine similarity or others?

The implementation of weights updates is under "Seq2SeqSharp\Seq2SeqSharp\Optimizer" folder, and now it has two optimizers: AdamOptimizer.cs and RMSPropOptimizer.cs Seq2seqSharp calls it after backward and gradient calculations. You can take a look code there and modify it if it's necessary. In BaseSeq2SeqFramework.cs file, it's called in TrainOneEpoch method. You can try to find "solver.UpdateWeights(models, processedLine, lr, m_regc, m_weightsUpdateCount);" there.

For that article in the above link, sorry that I don't have subscription on New York Times, so I cannot get it. Maybe it mentioned the problem you are trying to deal with and how to apply it in the real world. If you don't mind, could you please explain more details and findings you have?

Yes, I'm really busy on my daily works, but I would like to help you for any questions on Seq2SeqSharp, and discussion on NLP and machine learning problems in my spare time. :)

Thanks Zhongkai Fu

piedralaves commented 1 year ago

Thanks a lot.

Yes, we planned use the same gradients of the key words and calculate the coefficient based on cosine, as in vector space models. do you think is a right aproximation?

We will revise what you said and let you know.

We send you in this message some of the papers that talk about some of the things we are dealing with, beetween others.

Linguistic generalization and compositionality in.pdf

(https://github.com/zhongkaifu/Seq2SeqSharp/files/11081939/Linguistic.generalization.and.compositionality.in.pdf) zhongkaifu/Se Exploring_Processing_of_Nested_Dependencies_in_Neu.pdf q2SeqSharp/files/11081933/BaroniRNN.pdf)

Thanks a lot

zhongkaifu commented 1 year ago

Thanks @piedralaves . I will take a look.

It seems one of the challenge parts is how to deal with weights updates of ambiguous similar words in different contexts.

Thanks Zhongkai Fu

piedralaves commented 1 year ago

Let me think about it, but in principle, I guess that is not a big problem.

Each polisemous (ambiguous) word will considerably updated by different key words corresponding to their contexts.
If a polisemous (ambiguous) word is updated by a key word corresponding to a context, only the weights (the latent dimensions) of such contexts were eminently impacted.
In our tentative, part of the embedding matrix (part of the weights) are initializated with a proto knowledge "informing" if a word is an action or an object (both or elsemore) so the syntactic role are present in the similarity calculation. That is important for syntactic legitimity in sentence generation. This "baby proto knowledge" is the seed by which humans learn further implicit rules (some studies in skills growing propose it). We think it is also important for networks to deal with compositionality.
In functional terms, the embedding matrix can be an static representation of words, that is updated in a sequence of sample presentations. It doesn´t matter if a word is represented and even updated in ambiguous way in the embedding matrix. The important thing is it representation in a situated manner. For this reason, we are now writing the contextual embeddings. To take measures of our proposal too.

At this time, we are testing some issues to have a conceptual test, working like laboratory, but we want to have the "family updating" ready to put into the test a full version of compositionality and even the posibility of a better way of generalization not in dependence to the training sample.

Any remark to the points above will be appreciated. We are obviously very interested in your background.

zhongkaifu / Seq2SeqSharp

Moment of updating weights #66