The open-set recognition capability can be achieved through textual queries by CLIP [22], but has little impact on the seen categories in training.

xinyu1205 / recognize-anything

Open-source and strong foundation image recognition models.

https://recognize-anything.github.io/

Apache License 2.0

2.93k stars 278 forks source link

The open-set recognition capability can be achieved through textual queries by CLIP [22], but has little impact on the seen categories in training. #99

Open tigerzjh opened 1 year ago

tigerzjh commented 1 year ago

论文里提到的这句话不理解。开放识别不是依靠的CLIP进行文本编码吗，这种编码方式训练的时候没有用到吧。就是训练的时候就是N个类别对应N个可学习参数。

xinyu1205 commented 1 year ago

Query2Label原始的训练方式是N个类别对应N个可学习参数，这样只能做seen类别的识别。我们将可学习参数改成了，每个类别经过固定的text encoder得到的textual queries，这样使得模型可以做open-set类别的识别。例如，seen类别只有dog，open-set可以做puppy的识别。因为dog和puppy的textual queries相似度比较接近。

tigerzjh commented 1 year ago

@xinyu1205 我的疑惑就是从你第二行回答的这种方式，仅仅实在推理的时候用的吧，没有参与训练吧。

xinyu1205 commented 1 year ago

是的，训练时只用了seen类别的textual queries，由于text encoder是固定的，这些textual queries也是固定的

tigerzjh commented 1 year ago

这样的话， 1）是不是open set的能力，主要靠CLIP + bert(image-tag recognition decoder) 的泛化能力？ 2）在用CLIP 的image encoder 蒸馏 RAM 的 image encoder 的时候，潜在的对齐了 1）中CLIP 的文本特征，提升了open set 的能力，可以这么理解吗？

xinyu1205 commented 1 year ago

你的理解是对的有一个小问题是，bert(image-tag recognition decoder)是可学习的，所以主要是靠CLIP的能力来完成open-set

tigerzjh commented 1 year ago

还有一个小问题哈，我们聚类的时候： 1）应该是图像特征吧，这个特征是RAM 自己的image encoder （swin_L）的输出？ 2）K-means++ 的类别数也就是我们RAM的类别数？