After reading chapter 3.3 in FCSN several times, I can not figure out what exactly structure of the unsupervised part. Is that mean:
select Y frames: choose the top Y socres features with dimension:
batch * 2 * Y
apply a 11 conv to decode features above to reconstruct their orginal feature representations:
`batch 2 Y -> batch 10 * Y (shape of the output of conv8)`
merge the input frame-level feature vectors of thess selected Y frames using skip connection:
batch * 1024 * Y -> batch * 10 * Y
and then added by the output of step 2
obtain final reconstructed features of the Y frames:
batch * 10 * Y -> batch * 1024 * Y
After reading chapter 3.3 in FCSN several times, I can not figure out what exactly structure of the unsupervised part. Is that mean:
batch * 2 * Y
batch * 1024 * Y -> batch * 10 * Y
and then added by the output of step 2batch * 10 * Y -> batch * 1024 * Y