Choice of Pre-training Corpus and Other Rereproduction Result

Dinxin commented 2 years ago

Hello, I am curious about your work on the reimplementation of google's few-shot text style transfer model named TextSETTR. In your implementation, you choose Amazon Review dataset as your pre-training corpus and it can achieve satisfactory results in the formality transfer task. This is different from the original paper where it is the general-purpose model fine-tuned on C4 English dataset that can obtain satisfying results on varieties of style transfer tasks including emotiveness, formality, dialect, sentiment, and so on.

So, have you chosen the general domain dataset like C4 as your pre-training corpus to train the mini-TextSETTR model? If yes, can you obtain any satisfying results on the varieties of text style transfer tasks like the original paper (I think it may be worse than the original paper because of the scale of model and corpus) ?

xiyan128 commented 2 years ago

Thanks for asking! FYI, the pre-trained T5 model was derived directly from huggingface, and I believe it was trained on C4. I used the Amazon Review dataset to finetune both the extractor (whose initial weights are copied from the encoder) and the encoder.

You can find some results in the writeup: https://github.com/xiyan128/text_style_transfer_transformer/blob/master/writeup/%5BSEE%20THIS%5D%20final_submission.pdf

I'm sorry there's no available benchmark because it was a relatively short course project.

Dinxin commented 2 years ago

Thanks for your answer. One more question: what is the GPU configuration for this min TextSETTR model pre-training. I used 8 P40 GPUs to train this model, and it seemed that every epoch will cost about 3.5 hours. The 10-epoch training will cost 35 hours in total, exceeding the claimed "10-hours" read from your checkpoint filename.

Therefore, I am massively curious about the GPU configuration of your training and the total consumed time.

xiyan128 commented 2 years ago

You can find the gpu configs from the second cell of the notebook. It was a P100 gpu.

I don't remember I "claimed" 10 epochs -- do you mean the max epoch was 10? The checkpoint was saved when it reaches 10 hours (~2 epochs, as I said in the write up).

Dinxin commented 2 years ago

Yeah, I mean the max epoch as 10.

Dinxin commented 2 years ago

So, did you use the 10-hours.ckpt to obtain the results in your t5-extractor.ipynb notebook ? Or use other?

xiyan128 commented 2 years ago

The 10-hour checkpoint.

Dinxin commented 2 years ago

I trained the mini TextSETTR model using 8 P40 GPUs for 2 epochs (slightly less than 7 hours). I could not get satisfactory results like yours. Then I retrained the model for 4 epochs. The results seem to become better. Is it not enough for 2 epochs?

Here are the typical results:

Formal input: I hereby commit to never purchase anything from this institution in the future.

Informal output(2 epochs): i u, ive a and if d and and now  

Informal output(4 epochs): i u gonna never purchase anything from this brand in ill never buy anything else

xiyan128 commented 2 years ago

There could be some other variable that influenced the model output; for example, the author mentioned that "delta scale λ is an important hyperparameter to tune". From my experience, you might want to change λ based on different inputs. I do remember that I obtained the results in the notebook with a slightly different λ, and yes, on my end 2 epochs can indeed produce some interesting results.

Dinxin commented 2 years ago

class AmazonDataset(Dataset):

    def __init__(self,data):
        self.data = data
        self.len = raw_data["sents"].apply(lambda x: len(x) - 1).sum()

The AmazonDataset class implementation may be problematic. The variable raw_data may be self.data?

xiyan128 commented 2 years ago

It's probably a typo but it didn't impact the correctness.

Dinxin commented 2 years ago

def drop_noise(sent, drop_rate=0.2):
    """
    For drop noise, we drop each token in `sent` with probability `drop_rate`.
    """
    # need to be optimized. During iteration, the `sent` will be changed.
    # num_to_drop = int(((sent > 1).sum() * drop_rate))
    for i in range(int(((sent > 1).sum() * drop_rate))):
    # for i in range(num_to_drop):
        randIdx = np.random.choice(np.where((sent > 1).cpu())[0])
        sent = torch.cat((sent[:randIdx], sent[randIdx + 1:]))
    return sent

It seems that the function to add noise to sentences may be problematic. During the iteration, the number of iterations may be changed because the content of the variable sent will be changed.

The same situation will also happen to the function add_noise.

xiyan128 commented 2 years ago

Give me a counterexample if I'm wrong, but Python evaluates a for loop iterator only once.... Afaik extracting the loop iterator won't make a difference.

Dinxin commented 2 years ago

I know what you are meaning and get it. There is no bugs here.

Dinxin commented 2 years ago

The paper mentions that We provide these rates to the decoder as ranges .... How can the input ranges makes a difference on the decoder？

WeChatWorkScreenshot_4b4e45f3-c545-473d-aed2-4ee24545c242

Dinxin commented 2 years ago

I am just confused by the narrative of the paper and want to discuss it with you. The authors did not reply to me with this question.

xiyan128 / text_style_transfer_transformer

Choice of Pre-training Corpus and Other Rereproduction Result #3