Multi-bbox inference - Githubissues

yformer / EfficientSAM

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

Apache License 2.0

2.16k stars 153 forks source link

Multi-bbox inference #11

Open sulaimanvesal opened 11 months ago

sulaimanvesal commented 11 months ago

Thanks for sharing this repo.

On the demo collab file, how we can pass multi bounding boxes to the model as prompt?

I have a widget which gets the bboxs by the users, and I want to pass it to the model like FastSAM.

sulaimanvesal commented 11 months ago

One more question, the CPU version on a core-i7 with an input size of 1024x512 is quite slow. FastSAM-S (ultralytics) on the same machine and input size has an inference time around 400ms.

Inference using:  efficientsam_ti_cpu.jit
Input size: torch.Size([3, 512, 1024])
Preprocess Time: 79.8783 ms
Inference Time: 6939.1549 ms

yformer commented 11 months ago

@klightz, can you help @sulaimanvesal for taking multi bounding boxes to the model as prompt.

yformer commented 11 months ago

@sulaimanvesal, For EfficientSAM, we resize the input image to the size of 1024x1024 for model input. The preprocess and the postprocess are all included in the torchscript model. You also need to include that for FastSAM-S. Actually the demo we hosted on our server now is using cpu, Intel(R) Xeon(R) Platinum 8339HC CPU @ 1.80GHz, which seems not that slow even for efficientsam_s_cpu.jit.

sulaimanvesal commented 11 months ago

@yformer thanks for the reply. @klightz would you please let us know to perform multi-bounding boxes as prompt? similar to FastSAM?

sulaimanvesal commented 11 months ago

hi @yformer

Any update on how to running multi bounding boxes? thank you.

yformer commented 11 months ago

@balakv504, can you provide one example for using multi-bounding boxes as prompt?

balakrishnanv commented 11 months ago

The input_point to the model has shape [batch_size, num_masks, num_points, 2]. For multi bounding box, you feed in a tensor of shape [1, num_bounding_boxes, 2, 2] (assuming you are querying one image). For EfficientSAM, the encoder will be run only once and decoder is batched inference. Happy to provide an example in the colab if you have issues using this API.

sulaimanvesal commented 11 months ago

Thanks @balakrishnanv ! it would be great to see an example. It would be good not only for my case but for many others.

glennliu commented 11 months ago

I met the same issue and I find one example in Grounded-Segment-Anything repo. here.

They set batched_points in [B,num_box,2,2], and batched_points_labels in [B,num_box,2]. One box points label is set to 2, while the other is 3. But I don't understand how to decide the batched_points_labels here.

glennliu commented 11 months ago

I just find the related code. So, for bounding box, we can just set the label to [2,3], similar to the example in Grounded-SAM. It should work.

balakv504 commented 11 months ago

We will add an example for multiple bbox inference soon. Thanks for your patience. @glennliu Yes that is correct. Thanks for pulling that out.

sulaimanvesal commented 10 months ago

@yformer I am pinging in case any of the authors made an effort to provide a simple example of multi-bboxes. I know it's not that hard!