EmbodiedSAM: Online Segment Any 3D Thing in Real Time
Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
In this work, we presented ESAM, an efficient framework that leverages vision foundation models for online, real-time, fine-grained, generalized and open-vocabulary 3D instance segmentation.
Demos are a little bit large; please wait a moment to load them. Welcome to the home page for more complete demos and detailed introductions.
Method Pipeline:
For environment setup and dataset preparation, please follow:
For training and evaluation, please follow:
For visualization demo, please follow:
We provide the checkpoints for quick reproduction of the results reported in the paper.
Class-agnostic 3D instance segmentation results on ScanNet200 dataset:
Method | Type | VFM | AP | AP@50 | AP@25 | Speed(ms) | Downloads |
---|---|---|---|---|---|---|---|
SAMPro3D | Offline | SAM | 18.0 | 32.8 | 56.1 | -- | -- |
SAI3D | Offline | SemanticSAM | 30.8 | 50.5 | 70.6 | -- | -- |
SAM3D | Online | SAM | 20.6 | 35.7 | 55.5 | 1369+1518 | -- |
ESAM | Online | SAM | 42.2 | 63.7 | 79.6 | 1369+80 | model |
ESAM-E | Online | FastSAM | 43.4 | 65.4 | 80.9 | 20+80 | model |
Dataset transfer results from ScanNet200 to SceneNN and 3RScan:
Method | Type | ScanNet200-->SceneNN | ScanNet200-->3RScan | ||||
---|---|---|---|---|---|---|---|
AP | AP@50 | AP@25 | AP | AP@50 | AP@25 | ||
SAMPro3D | Offline | 12.6 | 25.8 | 53.2 | 3.9 | 8.0 | 21.0 |
SAI3D | Offline | 18.6 | 34.7 | 65.7 | 5.4 | 11.8 | 27.4 |
SAM3D | Online | 15.1 | 30.0 | 51.8 | 6.2 | 13.0 | 33.9 |
ESAM | Online | 28.8 | 52.2 | 69.3 | 14.1 | 31.2 | 59.6 |
ESAM-E | Online | 28.6 | 50.4 | 71.0 | 13.9 | 29.4 | 58.8 |
3D instance segmentation results on ScanNet dataset:
Method | Type | ScanNet | SceneNN | FPS | Download | ||||
---|---|---|---|---|---|---|---|---|---|
AP | AP@50 | AP@25 | AP | AP@50 | AP@25 | ||||
TD3D | offline | 46.2 | 71.1 | 81.3 | -- | -- | -- | -- | -- |
Oneformer3D | offline | 59.3 | 78.8 | 86.7 | -- | -- | -- | -- | -- |
INS-Conv | online | -- | 57.4 | -- | -- | -- | -- | -- | -- |
TD3D-MA | online | 39.0 | 60.5 | 71.3 | 26.0 | 42.8 | 59.2 | 3.5 | -- |
ESAM-E | online | 41.6 | 60.1 | 75.6 | 27.5 | 48.7 | 64.6 | 10 | model |
ESAM-E+FF | online | 42.6 | 61.9 | 77.1 | 33.3 | 53.6 | 62.5 | 9.8 | model |
Open-Vocabulary 3D instance segmentation results on ScanNet200 dataset: | Method | AP | AP@50 | AP@25 |
---|---|---|---|---|
SAI3D | 9.6 | 14.7 | 19.0 | |
ESAM | 13.7 | 19.2 | 23.9 |
Both students below contributed equally and the order is determined by random draw.
Both advised by Jiwen Lu.
We thank a lot for the flexible codebase of Oneformer3D and Online3D, as well as the valuable datasets provided by ScanNet, SceneNN and 3RScan.
@article{xu2024esam,
title={EmbodiedSAM: Online Segment Any 3D Thing in Real Time},
author={Xiuwei Xu and Huangxing Chen and Linqing Zhao and Ziwei Wang and Jie Zhou and Jiwen Lu},
journal={arXiv preprint arXiv:2408.11811},
year={2024}
}