[Possible Idea] Use features instead of geometric center for simpler and faster position recognition

xyjbaal / FPCC

MIT License

44 stars 11 forks source link

[Possible Idea] Use features instead of geometric center for simpler and faster position recognition #10

Closed Levaru closed 2 years ago

Levaru commented 2 years ago

@xyjbaal I'm a layman when it comes to neural networks but as far as I understood your paper and code, the process is basically this:

Generate or obtain a pointcloud dataset, where each point is assigned a score depending on it's distance to the center of the part that it belongs to.
Train the model.
Use the model to infer a new pointcloud, which tries to assign a center score to every point. This creates clusters of points with high center scores.
Apply a score threshold to obtain those clusters.
Apply a radius threshold to collect points around the clusters and assign them a segment (I believe that more is happening at this step but I don't really understand it).
The result is a pointcloud with two additional columns that represent the center score and segment number of each point.

My problem is, that I have a part that is shaped like a crate (not exactly but the principle is the same). Take a look at this sectional view:

+                               +
L                               L
L                               L
H             Center            H
H             +---+             H
H             | X |             H
H             +---+             H
H                               H
L                               L
L                               L
L                               L
+LLLLLLLLLLHHHHHHHHHHHLLLLLLLLLL+

H: high center score L: low center score X: center point

When trying to segment a bin full of those parts, this will pose a problem. For example: a part is positioned in such a way, that the bottom and two sides are visible. After the inference there will be 3 point clusters for the same part. Depending on how I assign the score and distance thresholds, I will either get 2 to 3 different segments for the same part or one segment but with additional points from other parts.

All of this led me to an idea that could possibly simplify and even speed up the matching process that follows after the segmentation:

What if we use multiple geometric features instead of the center score when training the model in Step 1 from the list above? Lets take for example every corner of the part and depending on its distance to it, each point gets a "feature score" assigned to it. After the inference, we would ideally get multiple point clusters for each part, that we can extract with a score threshold and then determine the centroid. In theory the result would be a really lightweight pointcloud consisting of those centroids, where one could then perform ICP matching against the known corners of the part.

Here is the process again step by step:

Generate or obtain a pointcloud dataset, where each point is assigned a score depending on it's distance to a geometric feature (eg. corner, edge, surface center, etc.) of the part that it belongs to.
Train the model.
Use the model to infer a new pointcloud, which tries to assign a feature score to every point. This creates clusters of points around those geometric features.
Apply a score threshold to obtain those clusters.
Calculate the centroids of those clusters.
The result is a pointcloud consisting only of those centroids that represent the position of a geometric feature (ideally).
Perform point matching via ICP (or something similar) against the known ideal positions of the geometric features.

I wanted to ask you if this idea is even feasible and if so, what would need to be done to achieve it. Maybe this could be a topic for your new paper :)

xyjbaal commented 2 years ago

I'm honored that you would pay attention to my work.

I'm happy to discuss it with you. Forgive my poor english.

I have uploaded a slide on prediction, which I hope will be helpful. (NMS picks only one point (the center point) to correspond to an object.)
Is the middle of your part empty, like a ring? If the middle is mostly empty, then the way to find the center point will be difficult to apply to your object.
I believe your idea is right and feasible. You can add m more branches like the center score branch to predict the score of m key points.

The method (center score + NMS) is somewhat similar to the keypoint prediction. (But I haven't read the related papers on the 3D key point prediction, so please forgive me if I misunderstood it.)

I have a question about your idea, if there are 2 close parts, A and B, then we find out 3 key points of each part, and we name them a1,a2,a3, b1,b2, and b3. Then how to determine that a1 and b2 belong to different parts?

Maybe I should update my knowledge of 3d keypoint prediction. (I remember the human body key point prediction on image seems to have a method of distinguishing each key point belongs to different people.)

Levaru commented 2 years ago

Thank you for the quick reply! Your English is very good :) If you are unsure about some sentences, then I can highly recommend translating with DeepL. It's way better than Google Translate but I haven't tried it with Japanese yet.

The slide is very helpful, I see that my understanding was mostly correct.
Yes the middle is empty but I believe that this is a general problem for shapes like this one. A closed box or a cube would have the same problem, since we only see the outer surface. The parts need to be thin in one(eg. disk) or two(eg. cylinder) dimensions. Or maybe I misunderstood something.
What do you mean when saying branch? An additional neural network layer?

I have a question about your idea, if there are 2 close parts, A and B, then we find out 3 key points of each part, and we name them a1,a2,a3, b1,b2, and b3. Then how to determine that a1 and b2 belong to different parts?

You wouldn't be able to determine it directly. Basically the idea is that we don't segment the parts anymore but instead use the DGCNN backbone to predict and collect the keypoints. The (distance-)relationship between the keypoints (a1-a2)&(a2-a3)&(a3-a1) should provide a better match than (b2-a2)&(a2-a3)&(a3-b2) for example. Of course there could be a situation (inaccuracies, missing or bad points, etc.) where (b2-a2)&(a2-a3)&(a3-b2) would be a better match but I hope that the probability for that is low.

Another problem (but that is easily solved) is that some positions are indeterminate. For example: We have a pointcloud where only the top surface of a box is visible. We find the keypoints for the 4 corners. The matching algorithm could now position the box upside down (flipped to the wrong side). This can be easily solved though, by checking if the keypoints would be even visible with this box position. I made a quick sketch to show what I mean:

FPCCDiagram

The method (center score + NMS) is somewhat similar to the keypoint prediction. (But I haven't read the related papers on the 3D key point prediction, so please forgive me if I misunderstood it.) [...] Maybe I should update my knowledge of 3d keypoint prediction. (I remember the human body key point prediction on image seems to have a method of distinguishing each key point belongs to different people.)

I'm also not familiar with 3d keypoint prediction. I'm just trying to increase the speed and accuracy of my binpicking project. Right now I'm using an object detection model in the 2D image to segment out the parts inside the pointcloud. But this requires a lot of images and it takes ages to label them so I was researching for other segmentation methods and that's how I found your paper. The ability to automatically simulate and label several thousand pointclouds is amazing, compared to drawing countless bounding boxes for multiple days.

xyjbaal commented 2 years ago

Thank you. DeepL is really better than Google Translate.

I suddenly realized I had made a big mistake. The five parts used in the paper are flat or thin, so there is no problem using the geometric center point as a reference point to calculate the center score for other points. BUT if it is a cube box as you described, there is a big problem. If it is a box-shape or a cube-shape object, we should use the center of the visible part as the reference point to calculate the center score. If it is a circle-shaped object, I think my method is not feasible.

I drew a sketch to explain my meaning. (I didn't even know we could upload images)

Image annotation is really time-consuming. This is why I try to segment parts directly from point clouds.

At the moment, I have 2 ideas about how to up speed.

In fact, DGCNN is a relatively slow feature extractor. You may consider replacing it with another one. I've improved DGCNN, and the new feature extractor can be about twice as fast. However, I found that sparse convolution can be more faster. I wonder if we can use sparse convolution as a feature extractor. Sparse convolution seems to run only in Linux and on specific types of GPU.
Our final goal is to estimate the pose. Whether ICP or other methods, there is a process of downsampling. So consider downsampling the point cloud of the whole scene first, then segmenting it, and take the result of the segmentation to ICP to calculate the pose. (without downsampling again)

xyjbaal commented 2 years ago

What do you mean when saying branch? An additional neural network layer?

Yes, add several structures like the center score branch to predict other key points.

Some of my thoughts:

I use the center point as a reference point because it's well defined, and the local information of the center point is not easily disturbed by the points of other parts.
If I use the corner as the key point, the local information of the corner can be easily disturbed by other parts' points. So I wonder if the network can learn it.
It may be feasible if multiple points near the middle of the object are used as reference points.

Levaru commented 2 years ago

If it is a box-shape or a cube-shape object, we should use the center of the visible part as the reference point to calculate the center score.

So it's not necessary for the "center" to be a geometric center? How would this work if the "box" has a trapezoid shape or any other similar shape that is not symmetric, on which side do I put the "center"? If the surface with the center point is on the bottom, then there wouldn't be any point with a high center score visible. Or am I misunderstanding something?

(P.S.: After reading through your paper again I finally got it (I think). My example with the trapezoid shape and the resulting occlusion is stupid, because in that case one should actually use the geometric center.)

What I'm still unsure about is how to choose d_max for the dataset generation and training. If it's too small I will risk the center points being occluded in certain positions and if it's too big the center point will be unprecise. The values in your code for the ring and gear shaft also don't seem to correspond to any dimension of the parts bounding box.

In fact, DGCNN is a relatively slow feature extractor. You may consider replacing it with another one. I've improved DGCNN, and the new feature extractor can be about twice as fast.

Yes, I noticed that it's slower than a 2D object detection model, but the time saved annotating images is really worth it. Is the improved DGCNN already included in this papers code or is it the one in your other repository?

However, I found that sparse convolution can be more faster. I wonder if we can use sparse convolution as a feature extractor. Sparse convolution seems to run only in Linux and on specific types of GPU.

I'm going to look into that, thank you for the tip!

Our final goal is to estimate the pose. Whether ICP or other methods, there is a process of downsampling. So consider downsampling the point cloud of the whole scene first, then segmenting it, and take the result of the segmentation to ICP to calculate the pose.

I'm already doing all of that. At the beginning I was actually performing the pose estimation on the whole pointcloud, which was quite slow and inaccurate. My current problem now is the segmentation accuracy.

I'm sorry if I'm bothering you with my questions. I believe your method is really great for some shapes and I'm just trying to figure out the limitations. And again, thank you very very much for your answers and thoughts, so far!

xyjbaal commented 2 years ago

So it's not necessary for the "center" to be a geometric center? How would this work if the "box" has a trapezoid shape or any other similar shape that is not symmetric, on which side do I put the "center"? If the surface with the center point is on the bottom, then there wouldn't be any point with a high center score visible. Or am I misunderstanding something?

I think that if it's a box, maybe you should use the center of the visible part as a reference point instead of the geometric center. e.g. If face A of the box is visible, then the reference point is the center of face A. If face B of the box is visible, then the reference point is the center of face B. center

This is what I did when segmenting a new object, the wave-dissipating block. https://www.fudotetra.co.jp/solution/block/tetrapod/

What I'm still unsure about is how to choose d_max for the dataset generation and training. If it's too small I will risk the center points being occluded in certain positions and if it's too big the center point will be unprecise. The values in your code for the ring and gear shaft also don't seem to correspond to any dimension of the parts bounding box.

The center score calculated by the equation: s_i = 1 - (||p_i - c_i||/d)^β, here I set d = d_max. This is because d_max is more general and easy to give a clear definition. I would have considered d as a hyperparameter and given an optimal for each part. As you said, it is possible to use different d for different parts to get the best result. It is obvious that it is not reasonable to use d_max of gear to calculate the center socres.

What we need to consider is a balancing problem.

Problem with a large d:

In the prediction phase, you need a larger r_nms to suppress other points. This could cause the center of the closer object to be suppressed by another. For example, two gears that are parallel and next to each other.

Problem with a small d:

causing many points to have a score of 0, and the network cannot learn it.
In the prediction phase, some object points will be obscured ( as you said)

The above problems are weaknesses of the NMS algorithm, and I'm trying to modify it. So far, it seems that you can try more different d. I think you can gradually reduce d, as I did in the ablation experiment. Use a smaller d for the data set. Find a suitable set of (d, Screening radius r_nms, and a minimum threshold θ_th).

Is the improved DGCNN already included in this papers code or is it the one in your other repository?

The new network structure I expect to publish by September. But I don't think it will be faster than sparse convolution.

I'm sorry if I'm bothering you with my questions. I believe your method is really great for some shapes and I'm just trying to figure out the limitations. And again, thank you very very much for your answers and thoughts, so far!

I understand this. FPCC can't be applied to all parts. Hopefully, we can find a more general method for more objcets. I'm pleased to discuss this with you. Discussing ideas with people from all over the world is the fascinating of IT.

Levaru commented 2 years ago

I think that if it's a box, maybe you should use the center of the visible part as a reference point instead of the geometric center. e.g. If face A of the box is visible, then the reference point is the center of face A. If face B of the box is visible, then the reference point is the center of face B.

But if I have a full bin of those parts, both sides can be visible depending on the position. So I would need to train both reference points but this would mess up the segmentation, wouldn't it? In that case FPCC would not be suitable for parts where the geometric center is not on the part (like rings) and parts where surfaces are far away from the center (like boxes). Or do you mean that I would need to train two separate models and perform the inference twice?

There is one last thing that I don't understand (maybe it's a knowledge gap): In the paper you used a d_max value of 0.08 for the gear shaft. The resulting sphere with such a radius would look like this: screenshotfpcc

Looking at figure 8 this appears to be the case: screenshotfpcc2

In the paper, in section [3.2. Inference Phase] the following is mentioned:

Note that, we find the nearest center point c_k of point p_i in the feature embedding space, and then calculate Euclidean distance d(p_i, c_k) between p_i and c_k in 3D space. If d(p_i, c_k) exceeds d_max, p_i is regarded as noise, and an instance label will not be assigned to p_i

When applied to the gear shaft and with d_max=0.08 this would result in the ends never being assigned a label, no? But judging from the images in the paper (Figure 7 and Figure 8) this doesn't appear to be the case. How does that work?

Looking at the code in fpcc_test.py the max_feature_distance is set to None. I guess the images from the paper were generated with the same setting, at least this would explain it from my understanding.

I'm pleased to discuss this with you. Discussing ideas with people from all over the world is the fascinating of IT.

I agree :)

xyjbaal commented 2 years ago

FPCC isn't suitable for parts where the geometric center is not on the part (like rings). For the box, I think FPCC might be OK. Although the center point of each box is different, it might be feasible. center2

Looking at the code in fpcc_test.py the max_feature_distance is set to None. I guess the images from the paper were generated with the same setting, at least this would explain it from my understanding.

You understanding is correct. If you want to enable max_feature_distance, the value of max_feature_distance should be set strictly to the maximum radius of the object.

The result in the paper should not use max_feature_distance.

Levaru commented 2 years ago

You understanding is correct. If you want to enable max_feature_distance, the value of max_feature_distance should be set strictly to the maximum radius of the object. The result in the paper should not use max_feature_distance.

I see, this explains a lot. I was very confused at first because of the difference between the code and paper.

FPCC isn't suitable for parts where the geometric center is not on the part (like rings). For the box, I think FPCC might be OK. Although the center point of each box is different, it might be feasible.

I'm sorry but I still don't understand what you mean by using the center of the visible part. The visible part is always different depending on the object position inside the bin.

xyjbaal commented 2 years ago

Yes, the center point of each object is different. You can find the center (c1, c2. c3,...) of the visible part of each object, and then calculate the center score of the points of 1st object by c1, 2nd object by c2, and so on. I'm not sure if this method will works for your objects. But I think it's worth a try.