保存了相应的pth文件，但是载入的时候发现embedding的维度对不上

Aleczhang13 commented 3 years ago

尊敬的前辈您好~ 我尝试使用了您的代码在mind large 数据集上进行测试，但是发现在进行evaluate的时，发现有模型载入但是参数大小不对的情况。错误如下; File "src/evaluate.py", line 324, in model.load_state_dict(checkpoint['model_state_dict']) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 845, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for NAML: size mismatch for news_encoder.text_encoders.abstract.word_embedding.weight: copying a param with shape torch.Size([101359, 300]) from checkpoint, the shape in current model is torch.Size([101221, 300]). size mismatch for news_encoder.text_encoders.title.word_embedding.weight: copying a param with shape torch.Size([101359, 300]) from checkpoint, the shape in current model is torch.Size([101221, 300]).

yusanshi commented 3 years ago

请确保你 load 的是正确的 checkpoint。在 evaluate.py 中，代码会尝试从目标文件夹中读取序号最大的 checkpoint： https://github.com/yusanshi/NewsRecommendation/blob/master/src/train.py#L54-L64

def latest_checkpoint(directory):
    if not os.path.exists(directory):
        return None
    all_checkpoints = {
        int(x.split('.')[-2].split('-')[-1]): x
        for x in os.listdir(directory)
    }
    if not all_checkpoints:
        return None
    return os.path.join(directory,
                        all_checkpoints[max(all_checkpoints.keys())])

有可能是因为 checkpoint 文件夹里面，既有新的 checkpoint 又有旧的 checkpoint，而旧的 checkpoint 序号比新的大，且旧的 checkpoint 是某一次旧代码运行的结果，word embedding size 不一致。

你可以直接删掉整个 checkpoint 文件夹然后重新 train 和 test，看看是不是这个问题。

Aleczhang13 commented 3 years ago

请确保你 load 的是正确的 checkpoint。在 evaluate.py 中，代码会尝试从目标文件夹中读取序号最大的 checkpoint： https://github.com/yusanshi/NewsRecommendation/blob/master/src/train.py#L54-L64
def latest_checkpoint(directory):
    if not os.path.exists(directory):
        return None
    all_checkpoints = {
        int(x.split('.')[-2].split('-')[-1]): x
        for x in os.listdir(directory)
    }
    if not all_checkpoints:
        return None
    return os.path.join(directory,
                        all_checkpoints[max(all_checkpoints.keys())])
有可能是因为 checkpoint 文件夹里面，既有新的 checkpoint 又有旧的 checkpoint，而旧的 checkpoint 序号比新的大，且旧的 checkpoint 是某一次旧代码运行的结果，word embedding size 不一致。

你可以直接删掉整个 checkpoint 文件夹然后重新 train 和 test，看看是不是这个问题。

好的谢谢您的回答~ 我马上去尝试一下因为我看到您后面evaluate的时候是分成多个dataloader 进行处理的，会不会是这方面的问题呢~ 感谢您及时的回答

Aleczhang13 commented 3 years ago

请确保你 load 的是正确的 checkpoint。在 evaluate.py 中，代码会尝试从目标文件夹中读取序号最大的 checkpoint： https://github.com/yusanshi/NewsRecommendation/blob/master/src/train.py#L54-L64
def latest_checkpoint(directory):
    if not os.path.exists(directory):
        return None
    all_checkpoints = {
        int(x.split('.')[-2].split('-')[-1]): x
        for x in os.listdir(directory)
    }
    if not all_checkpoints:
        return None
    return os.path.join(directory,
                        all_checkpoints[max(all_checkpoints.keys())])
有可能是因为 checkpoint 文件夹里面，既有新的 checkpoint 又有旧的 checkpoint，而旧的 checkpoint 序号比新的大，且旧的 checkpoint 是某一次旧代码运行的结果，word embedding size 不一致。

你可以直接删掉整个 checkpoint 文件夹然后重新 train 和 test，看看是不是这个问题。

Using device: cuda:0 Evaluating model NAML Load saved parameters in ./checkpoint/NAML/ckpt-1.pth Traceback (most recent call last): File "src/evaluate.py", line 331, in model.load_state_dict(checkpoint['model_state_dict']) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 845, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for NAML: size mismatch for news_encoder.text_encoders.abstract.word_embedding.weight: copying a param with shape torch.Size([101359, 300]) from checkpoint, the shape in current model is torch.Size([101221, 300]). size mismatch for news_encoder.text_encoders.title.word_embedding.weight: copying a param with shape torch.Size([101359, 300]) from checkpoint, the shape in current model is torch.Size([101221, 300]).

您好，我尝试删除了所有的checkpoint，只保留了其中一个，而且直接使用了val数据集的数据进行eval，而没有使用test数据集，因为我看到train中其实也使用了eval（）对val数据集进行eval，却没有报错，就感觉很奇怪

yusanshi commented 3 years ago

你有修改 config.py 文件里面的一些 num* 值嘛？

跑 data 预处理那个文件的时候应该能看到提示的：

https://github.com/yusanshi/NewsRecommendation/blob/master/src/data_preprocess.py#L285-L287 https://github.com/yusanshi/NewsRecommendation/blob/master/src/config.py#L29

README 里面也说了哦😁

yusanshi commented 3 years ago

你这里把 num_words 改成 1 + 101358 即可

Aleczhang13 commented 3 years ago

你这里把 num_words 改成 1 + 101358 即可

谢谢您的耐心，我试试看~

yusanshi commented 3 years ago

删掉了包含你个人隐私的东西🤣

yusanshi / news-recommendation

保存了相应的pth文件，但是载入的时候发现embedding的维度对不上 #5