Open Dinxin opened 2 years ago
Thanks for asking! FYI, the pre-trained T5 model was derived directly from huggingface, and I believe it was trained on C4. I used the Amazon Review dataset to finetune both the extractor (whose initial weights are copied from the encoder) and the encoder.
You can find some results in the writeup: https://github.com/xiyan128/text_style_transfer_transformer/blob/master/writeup/%5BSEE%20THIS%5D%20final_submission.pdf
I'm sorry there's no available benchmark because it was a relatively short course project.
Thanks for your answer. One more question: what is the GPU configuration for this min TextSETTR model pre-training. I used 8 P40 GPUs to train this model, and it seemed that every epoch will cost about 3.5 hours. The 10-epoch training will cost 35 hours in total, exceeding the claimed "10-hours" read from your checkpoint filename.
Therefore, I am massively curious about the GPU configuration of your training and the total consumed time.
You can find the gpu configs from the second cell of the notebook. It was a P100 gpu.
I don't remember I "claimed" 10 epochs -- do you mean the max epoch was 10? The checkpoint was saved when it reaches 10 hours (~2 epochs, as I said in the write up).
Yeah, I mean the max epoch as 10.
So, did you use the 10-hours.ckpt
to obtain the results in your t5-extractor.ipynb
notebook ? Or use other?
The 10-hour checkpoint.
I trained the mini TextSETTR model using 8 P40 GPUs for 2 epochs (slightly less than 7 hours). I could not get satisfactory results like yours. Then I retrained the model for 4 epochs. The results seem to become better. Is it not enough for 2 epochs?
Here are the typical results:
Formal input: I hereby commit to never purchase anything from this institution in the future.
Informal output(2 epochs): i u, ive a and if d and and now
Informal output(4 epochs): i u gonna never purchase anything from this brand in ill never buy anything else
There could be some other variable that influenced the model output; for example, the author mentioned that "delta scale λ is an important hyperparameter to tune". From my experience, you might want to change λ based on different inputs. I do remember that I obtained the results in the notebook with a slightly different λ, and yes, on my end 2 epochs can indeed produce some interesting results.
class AmazonDataset(Dataset):
def __init__(self,data):
self.data = data
self.len = raw_data["sents"].apply(lambda x: len(x) - 1).sum()
The AmazonDataset
class implementation may be problematic. The variable raw_data
may be self.data
?
It's probably a typo but it didn't impact the correctness.
def drop_noise(sent, drop_rate=0.2):
"""
For drop noise, we drop each token in `sent` with probability `drop_rate`.
"""
# need to be optimized. During iteration, the `sent` will be changed.
# num_to_drop = int(((sent > 1).sum() * drop_rate))
for i in range(int(((sent > 1).sum() * drop_rate))):
# for i in range(num_to_drop):
randIdx = np.random.choice(np.where((sent > 1).cpu())[0])
sent = torch.cat((sent[:randIdx], sent[randIdx + 1:]))
return sent
It seems that the function to add noise to sentences may be problematic. During the iteration, the number of iterations may be changed because the content of the variable sent
will be changed.
The same situation will also happen to the function add_noise
.
Give me a counterexample if I'm wrong, but Python evaluates a for loop iterator only once.... Afaik extracting the loop iterator won't make a difference.
I know what you are meaning and get it. There is no bugs here.
The paper mentions that We provide these rates to the decoder as ranges ...
. How can the input ranges
makes a difference on the decoder?
I am just confused by the narrative of the paper and want to discuss it with you. The authors did not reply to me with this question.
Hello, I am curious about your work on the reimplementation of google's few-shot text style transfer model named TextSETTR. In your implementation, you choose Amazon Review dataset as your pre-training corpus and it can achieve satisfactory results in the formality transfer task. This is different from the original paper where it is the general-purpose model fine-tuned on C4 English dataset that can obtain satisfying results on varieties of style transfer tasks including emotiveness, formality, dialect, sentiment, and so on.
So, have you chosen the general domain dataset like C4 as your pre-training corpus to train the mini-TextSETTR model? If yes, can you obtain any satisfying results on the varieties of text style transfer tasks like the original paper (I think it may be worse than the original paper because of the scale of model and corpus) ?