xuewyang / Fashion_Captioning

ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.
Other
81 stars 13 forks source link

About structure details and attribute learning #5

Open tangyuhao2016 opened 3 years ago

tangyuhao2016 commented 3 years ago

Thank you. I meet some problems when duplicating the model.

  1. Like the " the encoder is a pre-trained CNN, which takes an image as the input and extracts B image features, X={x0, x1,...xB}. Is that X means the feature map (batchsize 2048 14* 14) outputs from the last convolution layer of resnet101?

2.Like the figure3, the avg pooling of the feature map (batchsize 2048) is sent into the feed forward networks. How many layers consist in the FF, only one layer (2048 990) followed by a sigmoid or more layers? Is the output of the FF before sigmoid chosen as the Z or after sigmoid chosen as the Z?

  1. Whether the attribute learning has been pretrained seperately before and then added to fintune the caption model or attribute learning and caption model are learning together from the beginning?

4.When we get z, whether z is cat with y(word embedding) as input of caption model or z cat with the output of image features after attention model?

Looking forward to your reply.

xuewyang commented 3 years ago

Thank you. I meet some problems when duplicating the model.

  1. Like the " the encoder is a pre-trained CNN, which takes an image as the input and extracts B image features, X={x0, x1,...xB}. Is that X means the feature map (batchsize 2048 14* 14) outputs from the last convolution layer of resnet101?

2.Like the figure3, the avg pooling of the feature map (batchsize 2048) is sent into the feed forward networks. How many layers consist in the FF, only one layer (2048 990) followed by a sigmoid or more layers? Is the output of the FF before sigmoid chosen as the Z or after sigmoid chosen as the Z?

  1. Whether the attribute learning has been pretrained seperately before and then added to fintune the caption model or attribute learning and caption model are learning together from the beginning?

4.When we get z, whether z is cat with y(word embedding) as input of caption model or z cat with the output of image features after attention model?

Looking forward to your reply.

  1. Yes.
  2. I only use 1 layer. You are right with the understanding of z.
  3. The latter.
  4. The former.
  5. I updated the aligned dataset. See the new link on README.
tangyuhao2016 commented 3 years ago

Thank you. I meet some problems when duplicating the model.

  1. Like the " the encoder is a pre-trained CNN, which takes an image as the input and extracts B image features, X={x0, x1,...xB}. Is that X means the feature map (batchsize 2048 14* 14) outputs from the last convolution layer of resnet101?

2.Like the figure3, the avg pooling of the feature map (batchsize 2048) is sent into the feed forward networks. How many layers consist in the FF, only one layer (2048 990) followed by a sigmoid or more layers? Is the output of the FF before sigmoid chosen as the Z or after sigmoid chosen as the Z?

  1. Whether the attribute learning has been pretrained seperately before and then added to fintune the caption model or attribute learning and caption model are learning together from the beginning?

4.When we get z, whether z is cat with y(word embedding) as input of caption model or z cat with the output of image features after attention model? Looking forward to your reply.

  1. Yes.
  2. I only use 1 layer. You are right with the understanding of z.
  3. The latter.
  4. The former.
  5. I updated the aligned dataset. See the new link on README.

Thank you for updating. I find that the train.hdf5 is about 218GB before updating and 163GB after updating, the test.hdf5 is about 27 GB before updating and 19GB after updating, especially val.hdf5 is about 27 GB before and only 4 GB after. Is the number of images in training set, validation set and test set reduced?

xuewyang commented 3 years ago

Thank you. I meet some problems when duplicating the model.

  1. Like the " the encoder is a pre-trained CNN, which takes an image as the input and extracts B image features, X={x0, x1,...xB}. Is that X means the feature map (batchsize 2048 14* 14) outputs from the last convolution layer of resnet101?

2.Like the figure3, the avg pooling of the feature map (batchsize 2048) is sent into the feed forward networks. How many layers consist in the FF, only one layer (2048 990) followed by a sigmoid or more layers? Is the output of the FF before sigmoid chosen as the Z or after sigmoid chosen as the Z?

  1. Whether the attribute learning has been pretrained seperately before and then added to fintune the caption model or attribute learning and caption model are learning together from the beginning?

4.When we get z, whether z is cat with y(word embedding) as input of caption model or z cat with the output of image features after attention model? Looking forward to your reply.

  1. Yes.
  2. I only use 1 layer. You are right with the understanding of z.
  3. The latter.
  4. The former.
  5. I updated the aligned dataset. See the new link on README.

Thank you for updating. I find that the train.hdf5 is about 218GB before updating and 163GB after updating, the test.hdf5 is about 27 GB before updating and 19GB after updating, especially val.hdf5 is about 27 GB before and only 4 GB after. Is the number of images in training set, validation set and test set reduced?

No, In stead of (384x256), I now keep (256x256). But the truth is, the H of the images are 1.5 times of the W. Actually, I move some data from validation to training since we don't actually need 100K data for validation and it takes too long to validate.

xuewyang commented 3 years ago

Let me know if you can't download them. I may split them into several parts and upload to onedrive instead.

tangyuhao2016 commented 3 years ago

Thank you. I meet some problems when duplicating the model.

  1. Like the " the encoder is a pre-trained CNN, which takes an image as the input and extracts B image features, X={x0, x1,...xB}. Is that X means the feature map (batchsize 2048 14* 14) outputs from the last convolution layer of resnet101?

2.Like the figure3, the avg pooling of the feature map (batchsize 2048) is sent into the feed forward networks. How many layers consist in the FF, only one layer (2048 990) followed by a sigmoid or more layers? Is the output of the FF before sigmoid chosen as the Z or after sigmoid chosen as the Z?

  1. Whether the attribute learning has been pretrained seperately before and then added to fintune the caption model or attribute learning and caption model are learning together from the beginning?

4.When we get z, whether z is cat with y(word embedding) as input of caption model or z cat with the output of image features after attention model? Looking forward to your reply.

  1. Yes.
  2. I only use 1 layer. You are right with the understanding of z.
  3. The latter.
  4. The former.
  5. I updated the aligned dataset. See the new link on README.

Thank you for updating. I find that the train.hdf5 is about 218GB before updating and 163GB after updating, the test.hdf5 is about 27 GB before updating and 19GB after updating, especially val.hdf5 is about 27 GB before and only 4 GB after. Is the number of images in training set, validation set and test set reduced?

No, In stead of (384x256), I now keep (256x256). But the truth is, the H of the images are 1.5 times of the W. Actually, I move some data from validation to training since we don't actually need 100K data for validation and it takes too long to validate.

Thank you. But how can we aligned the hdf5 data with the raw images if I want to make visualization based on original images. In the former link, I can find the image id from the image_path.json.

xuewyang commented 3 years ago

Maybe you can try to get the caption first and then align it with the original data.

tangyuhao2016 commented 3 years ago

Maybe you can try to get the caption first and then align it with the original data.

Thank you very much. May I think that the current data split is the final official division?

xuewyang commented 3 years ago

Thank you. I meet some problems when duplicating the model.

  1. Like the " the encoder is a pre-trained CNN, which takes an image as the input and extracts B image features, X={x0, x1,...xB}. Is that X means the feature map (batchsize 2048 14* 14) outputs from the last convolution layer of resnet101?

2.Like the figure3, the avg pooling of the feature map (batchsize 2048) is sent into the feed forward networks. How many layers consist in the FF, only one layer (2048 990) followed by a sigmoid or more layers? Is the output of the FF before sigmoid chosen as the Z or after sigmoid chosen as the Z?

  1. Whether the attribute learning has been pretrained seperately before and then added to fintune the caption model or attribute learning and caption model are learning together from the beginning?

4.When we get z, whether z is cat with y(word embedding) as input of caption model or z cat with the output of image features after attention model? Looking forward to your reply.

  1. Yes.
  2. I only use 1 layer. You are right with the understanding of z.
  3. The latter.
  4. The former.
  5. I updated the aligned dataset. See the new link on README.

Thank you for updating. I find that the train.hdf5 is about 218GB before updating and 163GB after updating, the test.hdf5 is about 27 GB before updating and 19GB after updating, especially val.hdf5 is about 27 GB before and only 4 GB after. Is the number of images in training set, validation set and test set reduced?

No, In stead of (384x256), I now keep (256x256). But the truth is, the H of the images are 1.5 times of the W. Actually, I move some data from validation to training since we don't actually need 100K data for validation and it takes too long to validate.

Thank you. But how can we aligned the hdf5 data with the raw images if I want to make visualization based on original images. In the former link, I can find the image id from the image_path.json.

I think you can use the low resolution images but rescale to 384x256 for visualization. I personally don't think to get the original resolution is necessary. But if you need, probably try to align using the caption.

xuewyang commented 3 years ago

Maybe you can try to get the caption first and then align it with the original data.

Thank you very much. May I think that the current data split is the final official division?

Yes, I will also update my arxiv paper.

tangyuhao2016 commented 3 years ago

Let me know if you can't download them. I may split them into several parts and upload to onedrive instead.

Thank you, I have download the new hdf5 format data.

I want to ask a question. When learning caption together with attribute learning, how to fuse these two losses, add them directly or merge them in a certain proportion? In my experiment, direct addition did not lead to a substantial improvement. Now I employ XE training instead of RL.

tangyuhao2016 commented 3 years ago

Maybe you can try to get the caption first and then align it with the original data.

Thank you very much. May I think that the current data split is the final official division?

Yes, I will also update my arxiv paper.

I find that the new data split have a problem. In val split, I find about 400 items (about 3000 images) is the same as train. In test split, I find about 3000 items (about 20000 images) is the same as train and val. So I keep train split unchanged and remove the items in val and test split if they appear in train split. After revising, the val split has about 16000 images, the test split has about 82000 images.

I employ att2in model of ruotian's code based on the new data split:

The evaluate result is as follow:

B1:20.3 B2:8.9 B3:4.9 B4:3.3 M:7.6 R:17.5 C:24.2

I do not know where is the problem. I check out the data process and the model for several times.

When will you update the latest results in your arxiv paper? I think an authoritative baseline is very important for this task.

xuewyang commented 3 years ago

I am running the experiments now. Will update the paper in about 1 or 2 weeks. How do you know what there are same data in the val as in the train?

xuewyang commented 3 years ago

It's possible that some items have the same captions but they are actually different in some way, colors for example.

tangyuhao2016 commented 3 years ago

I am running the experiments now. Will update the paper in about 1 or 2 weeks. How do you know what there are same data in the val as in the train?

Because you provide attrs.json and caption json, I use these two as control conditions to aligned with the imgid from meat_all_129927.json. I think there are very few images that meanwhile meet both of these conditions. I look at some examples, found that their descriptions and attributes are exactly the same. For example, the item 63,89,90,256,297..., it seems appear in the train split and val split.

xuewyang commented 3 years ago

Yes, that is possible. When I scraped the image data, I found the same problem too. Some of the captions of different items are actually the same. It happens rarely. I think that is fine because the number is small.

tangyuhao2016 commented 3 years ago

It's possible that some items have the same captions but they are actually different in some way, colors for example.

May you check the data split whether there is a phenomenon of repetition or provide the specific data partition file according to the item id for train, val, test respectively to prevent the final experiment from being unfair? Could you open some results of your experiments?

tangyuhao2016 commented 3 years ago

Yes, that is possible. When I scraped the image data, I found the same problem too. Some of the captions of different items are actually the same. It happens rarely. I think that is fine because the number is small.

I find there are about 372 items of the attributes and captions are absolutely the same in the train and val,about 3000 images. And 1673 items of the attributes and captions are absolutely the same in the train and test, 45 items are the same in test and val.

xuewyang commented 3 years ago

Be aware of here. It should be self.imgs[i] / 255.

tangyuhao2016 commented 3 years ago

Be aware of here. It should be self.imgs[i] / 255.

Thank you. This problem I have revised before. 1.May you check the new data split if they are not mixed or provide the item id for train, val, test

2.In the experiment, I find large batch size will get higher performance, even the cider on train split is 210,the cider on val will increase slowly. How much batchsize and learning rate do you employ and may you employ experiment on multi-gpus?

xuewyang commented 3 years ago
  1. The item ids are different. So they are not mixed.
  2. I used batch size of 20. I am trying ruotian's repo.
tangyuhao2016 commented 3 years ago
  1. The item ids are different. So they are not mixed.
  2. I used batch size of 20. I am trying ruotian's repo.

1.It is strange. I exactly check the caption. json and attribute. json you provide again and find some of them are absolutely the same in train, val and test. The number is not small. Could you provide the item id for train val test? 2.Thank you. Do you have some result?I employ the att2in model on 2080ti,the training speed is so slowly with the batchsize 32 and after 30 epo the cider can get 30 but when I use p100 with batchsize 128 and after 18 epo the cider can get 36。

BrandonHanx commented 2 years ago

Same questions here. Could you please share the item ID for each image? I cannot match the caption to the image.

xuewyang commented 2 years ago

I will share all the data. I may not have much time to look into it. But I will try.