Any ideas about the structure of unsupervised SUM-FCN

After reading chapter 3.3 in FCSN several times, I can not figure out what exactly structure of the unsupervised part. Is that mean:

select Y frames: choose the top Y socres features with dimension: batch * 2 * Y
apply a 11 conv to decode features above to reconstruct their orginal feature representations: `batch 2 Y -> batch 10 * Y (shape of the output of conv8)`
merge the input frame-level feature vectors of thess selected Y frames using skip connection: batch * 1024 * Y -> batch * 10 * Y and then added by the output of step 2
obtain final reconstructed features of the Y frames: batch * 10 * Y -> batch * 1024 * Y

weirme / FCSN