zihangdai / xlnet

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Apache License 2.0
6.18k stars 1.18k forks source link

Word Embeddings #39

Closed gayatrivenugopal closed 5 years ago

gayatrivenugopal commented 5 years ago

Can we retrieve word embeddings from the model?

kimiyoung commented 5 years ago

Sure. See https://github.com/zihangdai/xlnet/blob/master/xlnet.py#L278

kottas commented 5 years ago

Could someone please elaborate on @kimiyoung answer? I would like to perform a "BERT-like" word embeddings extraction from the pretrained model.

gayatrivenugopal commented 5 years ago

My objective is also the same but I need the embeddings for a different language. For English, you could try to use an existing XLNet model and pass it to get_embedding_table to get the vectors. Not sure about this though...

Arpan142 commented 5 years ago

I'm new in this field. For embeddings if I want to use the 'Custom usage of XLNet' I have to tokenize my input file first using sentencepiece for the input_ids right?

SivilTaram commented 5 years ago

@kottas I think you'd like to acquire the "contextual word embedding" rather than the "vanilla word embedding", right?

kottas commented 5 years ago

@kottas I think you'd like to acquire the "contextual word embedding" rather than the "vanilla word embedding", right?

Right.

SivilTaram commented 5 years ago

@kottas Now there is no explicit interface for that purpose. However, I think the authors or some other developers familiar with tensorflow will publish the usage of contextual embedding. :)

kimiyoung commented 5 years ago

get_sequence_output() returns contextual embeddings, while get_embedding_table() returns non-contextual embeddings. An example of tokenization has also been added.

Arpan142 commented 5 years ago

@kimiyoung can you please tell me what input_mask do I have to provide for getting the word embeddings. I have provided None and it is giving an error saying 'Expected binary or unicode string, got [20135, 17, 88, 10844, 4617]' where [20135, 17, 88, 10844, 4617] is the 1st line sentencepiece token of my data

kimiyoung commented 5 years ago

If there's nothing to mask, you could set input_mask to None. I think this error has other causes. It might be helpful if you post more details.

Arpan142 commented 5 years ago

import sentencepiece as spm from prepro_utils import preprocess_text, encode_ids from absl import flags import sys

FLAGS = flags.FLAGS spiece_model_file='D://xlnet_cased_L-24_H-1024_A-16//xlnet-master//spiece.model' sp_model = spm.SentencePieceProcessor() xp=[] sp_model.Load(spiece_model_file)

with open('input.txt') as foo: text=foo.readline() while text: text = preprocess_text(text, lower=False) print(text)

ids = encode_ids(sp_model, text)

    ids = sp_model.EncodeAsPieces(text.encode('utf-8'))
    xp.append(ids)
    text=foo.readline()

import pickle with open('token1.pickle','wb') as get: pickle.dump(xp,get)

I used this code for tokenization and I used both unicode encoding and id encoding. Then I used this code for word embeddings .

import xlnet from data_utils import SEP_ID, CLS_ID from absl import flags import pickle import numpy as np import sys

SEG_ID_A = 0 SEG_ID_B = 1 SEG_ID_CLS = 2 SEG_ID_SEP = 3 SEG_ID_PAD = 4 import os import tensorflow as tf

def assign_to_gpu(gpu=0, ps_dev="/device:CPU:0"): def _assign(op): node_def = op if isinstance(op, tf.NodeDef) else op.node_def if node_def.op == "Variable": return ps_dev else: return "/gpu:%d" % gpu return _assign

flags.DEFINE_bool("use_tpu", False, help="whether to use TPUs") flags.DEFINE_bool("use_bfloat16", False, help="whether to use bfloat16") flags.DEFINE_float("dropout", default=0.1, help="Dropout rate.") flags.DEFINE_float("dropatt", default=0.1, help="Attention dropout rate.") flags.DEFINE_enum("init", default="normal", enum_values=["normal", "uniform"], help="Initialization method.") flags.DEFINE_float("init_range", default=0.1, help="Initialization std when init is uniform.") flags.DEFINE_float("init_std", default=0.02, help="Initialization std when init is normal.") flags.DEFINE_integer("clamp_len", default=-1, help="Clamp length") flags.DEFINE_integer("mem_len", default=70, help="Number of steps to cache") flags.DEFINE_integer("reuse_len", 256, help="Number of token that can be reused as memory. " "Could be half of seq_len.") flags.DEFINE_bool("bi_data", default=True, help="Use bidirectional data streams, i.e., forward & backward.") flags.DEFINE_bool("same_length", default=False, help="Same length attention") with open('token.pickle','rb') as new: tokens=pickle.load(new) input_ids=np.asarray(tokens) seg_ids=None input_mask=None FLAGS=flags.FLAGS FLAGS.use_tpu=False FLAGS.bi_data=False FLAGS(sys.argv)

xlnet_config = xlnet.XLNetConfig(json_path='D://xlnet_cased_L-24_H-1024_A-16//xlnet_config.json') run_config = xlnet.create_run_config(is_training=False, is_finetune=False,FLAGS=FLAGS) xlnet_model = xlnet.XLNetModel( xlnet_config=xlnet_config, run_config=run_config, input_ids=input_ids, seg_ids=seg_ids, input_mask=input_mask) embed=xlnet_model.get_embedding_table()

for both unicode encoding and id encoding the code was giving same error.

kimiyoung commented 5 years ago

You need to pass a placeholder into xlnet, and use a tf session to fetch the output from xlnet. In other words, you need to construct a computational graph first, and then do the actual computation on it. You may find the tutorials and guides useful.

cpury commented 5 years ago

Great job on this model and thanks for publishing the code!

Unfortunately, the code is not very nice to use for simple tasks. I've been trying to load the model and get the output for a single string. I gave up after 3 hours. There are just too many TF details that I have to deal with before I can even use the model...

It would be amazing if you could provide a simpler API and more modular helpers. I don't know why a lot of the helper functions take the FLAGS argument. Don't you want people to use your library outside of scripts?

kimiyoung commented 5 years ago

Thanks for your suggestion. We will try to improve the interface.

As for how to use it as is, if you look at the code here, the only thing that is created using FLAGS is just the run_config. Alternatively, you can directly construct a RunConfig.

cpury commented 5 years ago

Thanks! That example indeed looks simple, but this omitted part is my problem:

initialize instances of tf.Tensor, including input_ids, seg_ids, and input_mask

If you could give an example of how to do this for a single sample or a set of samples, that would be amazing. I tried with your data_utils and model_utils, but they are not well documented and mostly require a FLAGS object. I also tried following the logic of the classifier examples but just got lost in a maze.

matthias-samwald commented 5 years ago

I agree that it would be great to have a simple notebook showing us how to turn a string (phrase, sentence, paragraph etc) into numeric features!

Arpan142 commented 5 years ago

@kimiyoung I tried to use the 'custom usage of xlnet' for sentence embedding but I'm getting the vocabulary embeddings. My dataset contains around 27000 lines but the output I'm getting is of dimension 32000 X 1024. Any idea about what I'm doing wrong? any suggestion would be of great help to me.

Dhanasekar-S commented 5 years ago

@Arpan142 Exactly the same issue! embedding_table() gives the 32000 tokens from the actual trained model itself

Hazoom commented 5 years ago

@gayatrivenugopal I've just opened a Pull Request #151 with an helper script that does exactly what you need. It gets a file containing list of sentences and outputs a JSON file containing one line per sentence such that each line contains contextual word embedding for each token.

I hope it will be useful.

cpury commented 5 years ago

@Hazoom Awesome, thank you! It seems that answers all my questions. It would be great if it could get merged!

gayatrivenugopal commented 5 years ago

@gayatrivenugopal I've just opened a Pull Request #151 with an helper script that does exactly what you need. It gets a file containing list of sentences and outputs a JSON file containing one line per sentence such that each line:

  1. Contains contextual word embedding for each token.
  2. Contains a pooled vector from all the tokens, using the pooling strategy input parameter.

I hope it will be useful.

That's GREAT!!! Will try it out and let you know. Thank you!

Hazoom commented 5 years ago

@gayatrivenugopal @cpury Please check now again, I found a bug in the alignment between the real tokens to the padding tokens. Now it was fixed in my repository and in the PR itself.

hiwaveSupport commented 5 years ago

just tried running it with and I am getting the json output of word embeddings. I used gpu_extract script to get the word embeddings. Python 2.7

hiwaveSupport commented 5 years ago

@Hazoom -- how do I force the use of GPU using gpu_extract script? Currently I have 1 GPU but not sure how to specify using GPU as by default it uses on the CPUs. Thanks in advance.

gayatrivenugopal commented 5 years ago

@gayatrivenugopal @cpury Please check now again, I found a bug in the alignment between the real tokens to the padding tokens. Now it was fixed in my repository and in the PR itself.

Thanks a lot. This is extremely useful. Ran the script; got the json output successfully. Thanks again !