ykotseruba / PedestrianActionBenchmark

Code and models for the WACV 2021 paper "Benchmark for evaluating pedestrian action prediction"
https://openaccess.thecvf.com/content/WACV2021/papers/Kotseruba_Benchmark_for_Evaluating_Pedestrian_Action_Prediction_WACV_2021_paper.pdf
MIT License
54 stars 17 forks source link

Asking about C3D and PCPA #2

Closed OSU-Haolin closed 3 years ago

OSU-Haolin commented 3 years ago

Hi,

I am looking into your model and benchmark. Since you might be experienced in this topic, I'd like to ask about some details in C3D and PCPA model. From the aspect of coding, there are two operations: 1. directly using C3D model by calling C3D class; 2.keeping C3D model in PCPA class with removing other features: speed, box, pose as well as the attention modules. Will these two operations get same results? In my view, they both use only C3D model for prediction. Otherwise, do they have differences in detail?

I am interested in these models and want to do some exploration based on your repo. Could you answer it for me?

Thanks, Haolin

ykotseruba commented 3 years ago

Hi, C3D without the top layer is used as one of the processing streams in PCPA, the C3D network was not modified. If you remove other streams (RNN encoder-decoders for speed, box, pose) and attention modules from PCPA, you'll get a C3D. The results from the stripped PCPA and C3D should be the same as long as the dense layers are the same.

OSU-Haolin commented 3 years ago

Hi, C3D without the top layer is used as one of the processing streams in PCPA, the C3D network was not modified. If you remove other streams (RNN encoder-decoders for speed, box, pose) and attention modules from PCPA, you'll get a C3D. The results from the stripped PCPA and C3D should be the same as long as the dense layers are the same.

Thanks for your reply!

I still have one question about Input date in PCPA model.

in 'action_predict.py' line 3128: conv3d_model = self._3dconv()
network_inputs.append(conv3d_model.input)

I read the code carefully and know it feeds the 'local_box' into the C3D model. Since I previously used Pytorch and is not familiar with Tensorflow, I have a question: how does model know .input is the image data. Why not use like network_inputs.append(Input(shape=data_sizes[i], name='input_' + data_types[i])) conv3d_model = self._3dconv(input = network_inputs[0])
?

If I want to add a channel to feed local_context into a parallel C3D model in PAPC, can I use code as following? conv3d_model = self._3dconv()
network_inputs.append(conv3d_model.input) conv3d_model2 = self._3dconv()
network_inputs.append(conv3d_model2.input)

or how can I realize the following Model? now the feature is [local_box, local_context, speed, box, pose]

local_box -->C3D----------------------------------------fuse local_context --> C3D (new channel)-------------------fuse (new parallel C3D model) pose--->gru---------------------------------------------fuse speed-->gru--------------------------------------------fuse box-->gru--------------------------------------------- fuse

Could you help me with this?

Sincerely, Haolin

ykotseruba commented 3 years ago

Sorry, but only the existing codebase is currently supported, such as issues related to running the code or using the data, not creating new functionality.