The meaning of overlap - Githubissues

ykotseruba / PedestrianActionBenchmark

Code and models for the WACV 2021 paper "Benchmark for evaluating pedestrian action prediction"

https://openaccess.thecvf.com/content/WACV2021/papers/Kotseruba_Benchmark_for_Evaluating_Pedestrian_Action_Prediction_WACV_2021_paper.pdf

MIT License

54 stars 17 forks source link

The meaning of overlap #6

Closed xingchenzhang closed 3 years ago

xingchenzhang commented 3 years ago

Hi,

Thank you very much for your nice work!

Could you please tell what is the meaning of the 'overlap' in your paper and your code?

For example, in your paper you said 'The sample overlap is set to 0.6 for PIE and 0.8 for JAAD'. This is also set in the yaml file.

Many thanks again! Bests, Xingchen

ykotseruba commented 3 years ago

Hi Xingchen, Thank you for your interest in our work. To answer your question, from each pedestrian track we generate multiple observations of 16 frames (default observation length) within 1-2s time-to-event (TTE). The overlap parameter controls the step of the sliding window. At 0 overlap the samples will start at frames 0, 16, 32,... and so on, i.e. every observation sample starts after the previous one ends. At maximum overlap of 1, the samples will be collected starting at every frame. Therefore, the purpose of overlap is to increase the amount of training data. For smaller dataset such as JAAD we use a higher overlap of 0.8 to get a comparable number of training samples as generated from the PIE dataset with 0.6 overlap. In our implementation, function action_predict.py:get_data_sequence() is where observation samples are extracted from pedestrian tracks.

xingchenzhang commented 3 years ago

Hi,

Thank you so much for your quick and detailed reply! You are very kind.

I think now I know what overlap means and how you handle the training data.

If I understand correctly, you used the training data in this way:

Each pedestrian track you mentioned contains an event (cross or not) and 30 frames (1-2s before this event), say from frame 0 to frame 29.
You apply sliding window (controlled by 'overlap') within this 30 frames window prior the event.
If overlap is 1, then the samples will be collected starting at every frame. So we can generate 15 observations of 16 frames. (frame 0-15, 1-16, 2-17, 3-18, 4-19, 5-20, 6-21, 7-22, 8-23, 9-24, 10-25, 11-26, 12-27, 13-28, 14-39). These 15 observations have the same label: C or NC.

Could you please kindly let me know if I understand correctly?

Thank you very much again for your nice work!

Bests, Xingchen

ykotseruba commented 3 years ago

Hi Xingchen, You are welcome and yes, your understanding is correct. Yulia

xingchenzhang commented 3 years ago

Hi Yulia,

Thank you very much for your reply and confirmation!

Very nice work! Hope I can develop a new method using your data.

Bests, Xingchen

xingchenzhang commented 3 years ago

Hi Yulia,

Sorry, I just read your paper again. I found maybe I made a mistask in my previous reply.

Actually, for each pedestrian track you have 76 frames (16 for observation and 60 for TTE). You actually apply the sliding window on the first 46 frames rather than 30 frames, right? In my previous reply, I mentioned that you applied sliding window on the 30 frames window (1-2 seconds) because I forget the observation period.

Could you let me know if now I understand correctly?

Thanks a lot! Xingchen

ykotseruba commented 3 years ago

Hi Xingchen, If the observation is between 1-2s TTE, we start observing 60 frames before the event and stop at 30 before the event. In this case, we apply a sliding window within the 30 frame range. If a single TTE instead of a range is set, then only 16 frames ending at that TTE are collected.

Please take a look at the function action_predict.py:get_data_sequence(). Pedestrian tracks stored in dictionary d are already cropped so that they end at the event (crossing or not crossing). Lines 383-384 show how the first and last index of observation is computed.

For example, the track is 170 frames, i.e. the event happens 170 frames after the pedestrian appears on screen. start_idx = track length - observation length - 60 = 170-16-60 = 94 end_idx = track length - observation length - 30 = 170-16-30 = 125 The 16 frame segments are sampled within 30 frame range starting at start_idx and ending at end_idx+1 with the step determined by the overlap parameter (at 0.8 it is every 3 frames, or every frame when overlap is 1). Hope this clarifies your question. Yulia

xingchenzhang commented 3 years ago

Hi Yulia,

I apprecitate your detailed clarification very much!

Now I know what you mean. I previously thought in this case, the end_idx is 170-30 = 140. This is why I said the sliding window was applied to 46 frames window.

By the way, I am just curious why you guys did not use an end_idx 140 in this case. In your paper (on page 3) you said 'so the last frame of observation is between 1 and 2s (or 30-60 frames) prior to the crossing event start'. If you use 125 as the end_idx, so the last frame of observation is actually between 46-60 frames prior to the crossing event start.

Anyway, I am just curious about this. Maybe this is more practical in real case.

Many thanks again for your help!

Bests, Xingchen

ykotseruba commented 3 years ago

In the example above the first sample of 16 frames will start at 94 and end at 110 which is 60 frames TTE. The last sample starts at 125 and ends at 140 which is 30 frames TTE. In the lines 383-384 we already subtracted the observation length (16) frames to ensure that this is the case. This corresponds to the description in the paper. In other words, we count TTE at the last frame of the observation, not the first frame. If we started sampling at sequence_length - 60 then the TTE would be 46, not 60.

xingchenzhang commented 3 years ago

Thank you very much Yulia! Now it is very clear!

ykotseruba commented 3 years ago

You are welcome :)