Closed fuweifu-vtoo closed 2 years ago
We follow a feature-reconstruction paradigm. The reconstruction source and target are all features. These features are hard to visualize, especially hard to show the "shortcut problem". Therefore, we need some decoders to project these features to pixel space. So, these decoders are trained to project backbone-extracted features to pixel space.
Note that:
thanks for your helpful reply!
Thanks for your excellent work! But I wonder why decoders need to re-trained for visualization?