forked from RussTedrake/manipulation
-
Notifications
You must be signed in to change notification settings - Fork 0
/
pose.html
1728 lines (1432 loc) · 97.1 KB
/
pose.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<title>Ch. 4 - Geometric Pose Estimation</title>
<meta name="Ch. 4 - Geometric Pose Estimation" content="text/html; charset=utf-8;" />
<link rel="canonical" href="http://manipulation.csail.mit.edu/pose.html" />
<script src="https://hypothes.is/embed.js" async></script>
<script type="text/javascript" src="htmlbook/book.js"></script>
<script src="htmlbook/mathjax-config.js" defer></script>
<script type="text/javascript" id="MathJax-script" defer
src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
</script>
<script>window.MathJax || document.write('<script type="text/javascript" src="htmlbook/MathJax/es5/tex-chtml.js" defer><\/script>')</script>
<link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
<script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
<script>hljs.initHighlightingOnLoad();</script>
<link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
</head>
<body onload="loadChapter('manipulation');">
<div data-type="titlepage" pdf="no">
<header>
<h1><a href="index.html" style="text-decoration:none;">Robotic Manipulation</a></h1>
<p data-type="subtitle">Perception, Planning, and Control</p>
<p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
<p style="font-size: 14px; text-align: right;">
© Russ Tedrake, 2020-2021<br/>
Last modified <span id="last_modified"></span>.</br>
<script>
var d = new Date(document.lastModified);
document.getElementById("last_modified").innerHTML = d.getFullYear() + "-" + (d.getMonth()+1) + "-" + d.getDate();</script>
<a href="misc.html">How to cite these notes, use annotations, and give feedback.</a><br/>
</p>
</header>
</div>
<p pdf="no"><b>Note:</b> These are working notes used for <a
href="http://manipulation.csail.mit.edu/Fall2021/">a course being taught
at MIT</a>. They will be updated throughout the Fall 2021 semester. <!-- <a
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture videos are available on YouTube</a>.--></p>
<table style="width:100%;" pdf="no"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=pick.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=clutter.html>Next Chapter</a></td>
</tr></table>
<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 3"><h1>Geometric Pose Estimation</h1>
<a href="https://deepnote.com/project/Ch-4-Geometric-Pose-Estimation-zGNA9TdORJqhlYOaPO3sSg/%2Fpose.ipynb" style="float:right; margin-top:-70px;background:white;border:0;" target="pose">
<img src="https://deepnote.com/buttons/launch-in-deepnote-white.svg"></a>
<div style="clear:right;"></div>
<p>In the last chapter, we developed an initial solution to moving objects
around, but we made one major assumption that would prevent us from using it
on a real robot: we assumed that we knew the initial pose of the object. This
chapter is going to be our first pass at removing that assumption, by
developing tools to estimate that pose using the information obtained from the
robot's depth cameras. The tools we develop here will be most useful when you
are trying to manipulate <i>known objects</i> (e.g. you have a mesh file of
their geometry) and are in a relatively <i>uncluttered</i> environment. But
they will also form a baseline for the more sophisticated methods we will
study.</p>
<p>The approach that we'll take here is very geometric. This is in contrast
to, and very complimentary with, approaches that are more fundamentally driven
by data. There is no question that deep learning has had an enormous impact
in perception -- it's fair to say that it has enabled the current surge in
manipulation advances -- and we will certainly cover it in these notes. But
when I've heard colleagues say that "all perception these days is based on
deep learning", I can't help but cry foul. There has been another surge of
progress in the last few years that has been happening in parallel: the
revolution in geometric reasoning, fueled by applications in autonomous
driving and virtual/augmented reality in addition to manipulation. Most
advanced manipulation systems these days combine both "deep perception" and
"geometric perception".</p>
<section><h1>Cameras and depth sensors</h1>
<p>Just as we had many choices when selecting the robot arm and hand, we
have many choices for instrumenting our robot/environment with sensors. Even
more so than our robot arms, the last few years have seen incredible
improvements in the quality and reductions in cost and size for these
sensors. This is largely thanks to the cell phone industry, but the race
for autonomous cars has been fueling high-end sensors as well.</p>
<p>These changes in hardware quality have caused sometimes dramatic changes
in our algorithmic approaches. For example, estimation can be much easier
when the resolution and frame rate of our sensors is high enough that not
much can change in the world between two images; this undoubtedly
contributed to the revolutions in the field of "simultaneous localization
and mapping" (SLAM) we have seen over the last decade or so.</p>
<p>One might think that the most important sensors for manipulation are the
touch sensors (you might even be right!). But in practice today, most of
the emphasis is on camera-based and/or range sensing. At very least, we
should consider this first, since our touch sensors won't do us much good if
we don't know where in the world we need to touch.</p>
<p>Traditional cameras, which we think of as a sensor that outputs a color
image at some framerate, play an important role. But robotics makes heavy
use of sensors that make an explicit measurement of the distance (between
the camera and the world) or depth; sometimes in addition to color and
sometimes in lieu of color. Admittedly, some researchers think we should
only rely on color images.</p>
<subsection><h1>Depth sensors</h1>
<p>Primarily RGB-D (ToF vs projected texture stereo vs ...) cameras and Lidar</p>
<p>The cameras we are using in this course are <a href="https://software.intel.com/en-us/realsense/d400">Intel RealSense D415</a>....</p>
<p>Monocular depth.</p>
<todo>How the kinect works in 2 minutes: https://www.youtube.com/watch?v=uq9SEJxZiUg</todo>
</subsection>
<subsection><h1>Simulation</h1>
<p>There are a number of levels of fidelity at which one can simulate a
camera like the D415. We'll start our discussion here using an "ideal"
RGB-D camera simulation -- the pixels returned in the depth image
represent the true geometric depth in the direction of each pixel
coordinate. In <drake></drake> that is represented by the
<code>RgbdSensor</code> system, which can be wired up directly to the
<code>SceneGraph</code>.</p>
<div>
<script src="htmlbook/js-yaml.min.js"></script>
<script type="text/javascript">
var sys = jsyaml.load(`
name: RgbdSensor
input_ports:
- geometry_query
output_ports:
- color_image
- depth_image_32f
- depth_image_16u
- label_image
- X_WB`);
document.write(system_html(sys, "https://drake.mit.edu/doxygen_cxx/classdrake_1_1systems_1_1sensors_1_1_rgbd_sensor.html"));
</script>
</div>
<p>The signals and systems abstraction here is encapsulating a lot of
complexity. Inside the implementation of that system is a complete
rendering engine, like one that you will find in high-end computer games.
Drake actually supports multiple rendering engines; for the purposes of
this class we will primarily use an OpenGL-based renderer that is suitable
for real-time simulation. In our research we also use Drake's rendering
API to connect to high-end <a
href="https://en.wikipedia.org/wiki/Physically_based_rendering">physically-based
rendering (PBR)</a> based on ray tracing, such as the <a
href="https://www.blender.org/features/rendering/">Cycles renderer
provided by Blender</a>. These are most useful when we are trying to
render higher quality images e.g. to <i>train</i> a deep perception
system; we will discuss them in the deep perception chapters.</p>
<example class="drake"><h1>Simulating an RGB-D camera</h1>
<p><a
href="https://deepnote.com/project/Ch-4-Geometric-Pose-Estimation-zGNA9TdORJqhlYOaPO3sSg/%2Fpose.ipynb"
style="background:none; border:none;" target="pose">
<img
src="https://deepnote.com/buttons/launch-in-deepnote-white.svg"></a></p>
<p>As a simple example of depth cameras in drake, I've constructed a
scene with a single object (the mustard bottle from the <a
href="https://github.com/RobotLocomotion/models/tree/master/ycb">YCB
dataset</a>), and added an <code>RgbdSensor</code> to the diagram. Once
this is wired up, we can simply evaluate the output ports in order to
obtain the color and depth images:</p>
<figure><img src="data/mustard.svg"/></figure>
<p><img style="height:200px;float:right;margin-left:10px"
src="data/mustard_meshcat.png"/> Please make sure you spend a minute
with the MeshCat visualization (<a href="data/mustard.html">available
here</a>). You'll see a camera and the camera frame, and you'll see the
mustard bottle as always... but it might look a little funny. That's
because I'm displaying both the ground truth mustard bottle model
<i>and</i> the point cloud rendered from the cameras. You can use the
MeshCat GUI to uncheck the ground truth visualization (image right), and you'll be able to see just the point cloud.</p>
<p>Remember that, in addition to looking at the source code if you like,
you can always inspect the block diagram to understand what is happening
at the "systems level". <a href="data/depth_camera_diagram.svg">Here is the diagram</a> used in this example.</p>
</example>
<p>In the <code>ManipulationStation</code> simulation of the entire robot
system for class, the <code>RgbdSensors</code> have already been added,
and their output ports exposed as outputs for the station.</p>
<subsection><h1>Sensor noise and depth dropouts</h1>
<p>Real depth sensors, of course, are far from ideal -- and errors in
depth returns are not simple Gaussian noise, but rather are dependent on
the lighting conditions, the surface normals of the object, and the
visual material properties of the object, among other things. The color
channels are an approximation, too. Even our high-end renderers can
only do as well as the geometries, materials and lighting sources that
we add to our scene, and it is very hard to capture all of the nuances
of a real environment. We will examine real sensor data, and the gaps
between modeled sensors and real sensors, after we start understanding
some of the basic operations.</p>
<figure><img style="width:95%"
src="data/Sweeney18a_fig3.png"/><figcaption>Real depth sensors are
messy. The red pixels in the far right image are missed returns -- the
depth sensor returned "maximum depth". Missed returns, especially on
the edges of objects (or on reflective objects) are a common sighting in
raw depth data. This figure is reproduced from
<elib>Sweeney18a</elib>.</figcaption></figure>
</subsection>
<subsection><h1>Occlusions and partial views</h1>
<p>The view of the mustard bottle in the example above makes one primary
challenge of working with cameras very clear. Cameras don't get to see
everything! They only get line of sight. If you want to get points from
the back of the mustard bottle, you'll need to move the camera. And if
the mustard bottle is sitting on the table, you'll have to pick it up if
you ever want to see the bottom. This situation gets even worse in
cluttered scenes, where we have our views of the bottle <i>occluded</i> by
other objects. Even when the scene is not cluttered, a head-mounted or
table-mounted camera is often blocked by the robot's hand when it goes to
do manipulation -- the moments when you would like the sensors to be at
their best is often when they give no information!</p>
<p>It is quite common to mount a depth camera on the robot's wrist in
order to help with the partial views. But it doesn't completely resolve
the problem. All of our depth sensors have both a maximum range and a
minimum range. The RealSense D415 returns matches from about 0.3m to
1.5m. That means for the last 30cm of the approach to the object --
again, where we might like our cameras to be at their best -- the object
effectively disappears!</p>
<figure><img style="width:80%" src="data/dishloading_cameras.svg">
<figcaption>When luxury permits, we try to put as many cameras as we can
into the robot's environment. In this dish-loading testbed at <a
href="https://www.tri.global/news/tri-taking-on-the-hard-problems-in-manipulation-re-2019-6-27">TRI</a>
where we put a total of eight cameras into the scene (it still doesn't
resolve the occlusion problem).</figcaption></figure>
</subsection>
</section>
<section><h1>Representations for geometry</h1>
<p>Depth image, point cloud, triangulated mesh, signed distance functions, (+ surfels, occupancy maps, ...)</p>
<p>The data returned by a depth camera takes the form of an image, where
each pixel value is a single number that represents the distance between the
camera and the nearest object in the environment along the pixel direction.
If we combine this with the basic information about the camera's intrinsic
parameters (e.g. lens parameters, stored in the <a
href="https://drake.mit.edu/doxygen_cxx/classdrake_1_1systems_1_1sensors_1_1_camera_info.html"><code>CameraInfo</code></a>
class in Drake) then we can transform this <i>depth image representation</i>
into a collection of 3D points, $s_i$. I use $s$ here because they are
commonly referred to as the "scene points" in the algorithms we will present
below. These points have a pose and (optionally) a color value or other
information attached. This is known as a <i>point cloud representation</i>.
By default, their pose is described in the camera frame, $^CX^{s_i}$, but
the <code>DepthImageToPointCloud</code> system in Drake also accepts $^PX^C$
to make it easy to transform them into another frame (such as the world
frame).</p>
<div>
<script src="htmlbook/js-yaml.min.js"></script>
<script type="text/javascript">
var sys = jsyaml.load(`
name: DepthImageToPointCloud
input_ports:
- depth_image
- color_image (optional)
- camera_pose (optional)
output_ports:
- point_cloud`);
document.write(system_html(sys, "http://drake.mit.edu/doxygen_cxx/classdrake_1_1perception_1_1_depth_image_to_point_cloud.html"));
</script>
</div>
<p>As depth sensors have become so pervasive the field has built up
libraries of tools for performing basic geometric operations on point
clouds, and that can be used to transform back and forth between
representations. We can implement many of of the basic operations directly
in Drake, but we will use the <a href="http://www.open3d.org/">Open3D</a>
library if we need more. Many older projects used the <a
href="http://pointclouds.org/">Point Cloud Library (PCL)</a>, which is now
defunct but still has some very useful documentation.</p>
<p>It's important to realize that the conversion of a depth image into a
point cloud does lose some information -- specifically the information
about the ray that was cast from the camera to arrive at that point. In
addition to declaring "there is geometry at this point", the depth image
combined with the camera pose also implies that "there is no geometry in the
straight line path between camera and this point". We will make use of this
information in some of our algorithms, so don't discard the depth images
completely! More generally, we will find that each of the different
geometry representations have strengths and weaknesses -- it is very common
to keep multiple representations around and to convert back and forth
between them.</p>
</section>
<!--
<section><h1>Working with point clouds</h1>
<p>First, let us get a few simulated point clouds to work with. We can use either <drake></drake> or Open3d to convert the depth image into a point cloud; we'll use Drake for this step here simply because we already have the camera parameters in the Drake format. Although we can call the conversion functions directly, we can also add a system that does the conversion for us in the systems framework:</p>
<table align="center" cellpadding="0" cellspacing="0">
<tr align="center">
<td><table cellspacing="0" cellpadding="0">
<tr>
<td align="right" style="padding:5px 0px 5px 0px">depth_image → </td></tr>
<tr>
<td align="right" style="padding:5px 0px 5px 0px">color_image (optional) → </td></tr>
<tr>
<td align="right" style="padding:5px 0px 5px 0px">camera_pose (optional) →</td></tr>
</table>
</td><td align="center" style="border:solid;padding-left:20px;padding-right:20px" bgcolor="#F0F0F0"><a class="el" href="http://drake.mit.edu/doxygen_cxx/classdrake_1_1perception_1_1_depth_image_to_point_cloud.html" title="Converts a depth image to a point cloud.">DepthImageToPointCloud</a></td><td><table cellspacing="0" cellpadding="0">
<tr>
<td align="left" style="padding:5px 0px 5px 0px">→ point_cloud </td></tr>
</table>
</td></tr>
</table>
<p>You can infer from the optional input ports on this system that point cloud representations can potentially include color values for each of the Cartesian points. The following example will get us a few point clouds to work with.</p>
<todo>Combine all point clouds via https://drake.mit.edu/pydrake/pydrake.systems.perception.html?highlight=pointcloud#pydrake.systems.perception.PointCloudConcatenation</todo>
<todo>Show the point cloud in Open3D's visualizer</todo>
There are a number of basic operations that one can do on point clouds. I'll give just a few examples here.
<subsection><h1>Surface normal estimation</h1>
<p>The simplest and most straight-forward algorithm for estimating the normals of a point cloud surface is to find the $k$ nearest neighbors, and fit a plane through those points using least-squares. Let's denote the nominal point (about which we are estimating the normal) $p_0$, and it's $k$ nearest neighboard $p_i, i\in[1,k].$ Let us form the $k \times 3$ data matrix: $$X = [ p_1 - p_0, p_2 - p_0, ..., p_k - p_0 ].$$ Then the principal components of this data matrix are given by the <a href="https://en.wikipedia.org/wiki/Singular_value_decomposition">singular value decomposition</a> of $X$ (or the eigenvectors of $X^T X$); if $X = U\Sigma V^T$, then the columns of $V$ are the principal directions, with the last column (associated with the smallest singular value) representing the normal direction. The length can be normalized to one; but keep in mind that the sign of this vector is ambiguous -- the $k$ nearest points don't give any information about what direction might be into the object or out of the object. This is a case where knowing the original camera direction can help -- for example, PCL offers a method <code>flipNormalTowardsViewpoint</code> that post-processes the normals.</p>
<figure>
<img src="http://www.pointclouds.org/assets/images/contents/documentation/features_normal.png" />
<figcaption>Normal estimation using $k$ nearest neighbords (image linked from the <a href="http://docs.pointclouds.org/trunk/group__features.html"></a>PCL documentation</a>)</figcaption>
</figure>
<todo>Add an example; or even support normals in the DepthImageToPointCloud. Example could be as simple as grabbing one point from the mustard point cloud, and finding it's normal (with everything plotted in matplotlib).</todo>
<p>What was interesting and surprising (to me) about this is not that the least-squares solution works. What is surprising is that this operation is considered commonplace and cheap -- for every point in the point cloud, we find $k$ nearest neighbors and do the normal estimation... even for very large point clouds. Making this performant typically requires data structures like the <a href="https://en.wikipedia.org/wiki/K-d_tree">k-d tree</a> to make the nearest neighbor queries efficient.</p>
<p>Another standard feature that can be extracted from the $k$ nearest neighbors is the local curvature. The math is quite similar. <todo>add it here</todo></p>
</subsection>
<subsection><h1>Plane segmentation</h1>
<p>Moving beyond local properties like surface normals and local curvature, we might choose to process a point cloud into primitive shapes -- like boxes or spheres. Another practical and natural segmentation is segmenting surfaces into planes... which we can accomplish with heuristic algorithms which group neighboring points with similar surface normals. We used this effectively in the DARPA Robotics Challenge.</p>
<todo>Insert DRC plane segmentation videos here, from https://drive.google.com/drive/folders/1gYMJ0djBXbevWDBpekkK58pcbtZFr0A0 </todo>
</subsection>
</section>
-->
<section><h1>Point cloud registration with known correspondences</h1>
<p>Let us begin to address the primary objective of the chapter -- we have a
known object somewhere in the robot's workspace, we've obtained a point
cloud from our depth cameras. How do we estimate the pose of the object,
$X^O$?</p>
<p><img style="height:90px;float:left;margin-right:10px"
src="data/icp_model_points_sketch.png"/>One very natural and well-studied
formulation of this problem comes from the literature on <a
href="https://en.wikipedia.org/wiki/Point_set_registration">point cloud
registration</a> (also known as point set registration or scan matching).
Given two point clouds, point cloud registration aims to find a rigid
transform that optimally aligns the two point clouds. For our purposes,
that suggests that our "model" for the object should also take the form of a
point cloud (at least for now). We'll describe that object with a list of
<i>model points</i>, $m_i$, with their pose described in the object frame:
$^OX^{m_i}.$ </p>
<p>Our second point cloud will be the <i>scene points</i>, $s_i$, that we
obtain from our depth camera, transformed (via the camera intrinsics) into
the camera coordinates, $^CX^{s_i}$. I'll use $N_m$ and $N_s$ for the
number of model and scene points, respectively.</p>
<p>Let us assume, for now, that we also know $X^C.$ Is this reasonable? In
the case of a wrist-mounted camera, this would amount to solving the forward
kinematics of the arm. In an environment-mounted camera, this is about
knowing the pose of the cameras in the world. Small errors in estimating
those poses, though, can lead to large artifacts in the estimated point
clouds. We therefore take great care to perform <a
href="station.html">camera extrinsics calibration</a>; I've linked to our
calibration code in the <a href="station.html">appendix</a>. Note that if
you have a mobile manipulation platform (an arm mounted on a moving base),
then all of these discussions still apply, but you likely will perform all
of your registration in the robot's frame instead of the world frame.</p>
<p>The second <b>major assumption</b> that we will make in this section is
"known correspondences". This is <i>not</i> a reasonable assumption in
practice, but it helps us get started. The assumption is that we know which
scene point corresponds with which model point. I'll use a correspondence
vector $c \in [1,N_m]^{N_s}$, where $c_i = j$ denotes that scene point $s_i$
corresponds with model point $m_j$. Note that we do not assume that these
correspondences are one-to-one. Every scene point corresponds to some model
point, but the converse is not true, and multiple scene points may
correspond to the same model point.</p>
<p>As a result, the point cloud registration problem is simply an (inverse)
kinematics problem. We can write the model points and the scene points in a
common frame (here the world frame), $$p^{m_{c_i}} = X^O \: {}^Op^{m_{c_i}}
= X^C \: {}^Cp^{s_i},$$ leaving a single unknown transform, $X^O$, for which
we must solve. In the previous chapter I argued for using differential
kinematics instead of inverse kinematics; why is my story different here?
Differential kinematics can still be useful for perception, for example if
we want to track the pose of an object after we acquire its initial pose.
But unlike the robot case, where we can read the joint sensors to get our
current state, in perception we need to solve this harder problem at least
once.</p>
<p>What does a solution for $X^O$ look like? Each model point gives us
three constraints, when $p^{s_i} \in \Re^3.$ The exact number of unknowns
depends on the particular representation we choose for the pose, but almost
always this problem is dramatically over-constrained. Treating each point
as hard constraints on the relative pose would also be very susceptible to
measurement noise. As a result, it will serve us better to try to find a
pose that describes the data, e.g., in a least-squares sense: $$\min_{X
\in \mathrm{SE}(3)} \sum_{i=1}^{N_s} \| X \: {}^Op^{m_{c_i}} - X^C \:
{}^Cp^{s_i}\|^2.$$ Here I have used the notation $\mathrm{SE}(3)$ for the
"<a href="http://planning.cs.uiuc.edu/node147.html">special Euclidean
group</a>," which is the standard notation saying the $X$ must be a valid
rigid transform.</p>
<p>To proceed, let's pick a particular representation for the pose to work
with. I will use 3x3 rotation matrices here; the approach I'll describe
below also has an equivalent in quaternion form<elib>Horn87</elib>. To
write the optimization above using the coefficients of the rotation matrix
and the translation as decision variables, I have: \begin{align} \min_{p,
R} \quad& \sum_{i=1}^{N_s} \| p + R \: {}^Op^{m_{c_i}} - X^C
\: {}^Cp^{s_i} \|^2, \\ \subjto \quad& R^T = R^{-1}, \quad \det(R) = +1.
\nonumber \end{align} The constraints are needed because not every 3x3
matrix is a valid rotation matrix; satisfying these constraints ensures that
$R$ <i>is</i> a valid rotation matrix. In fact, the constraint $R^T =
R^{-1}$ is almost enough by itself; it implies that $\det(R) = \pm 1.$ But a
matrix that satisfies that constraint with $\det(R) = -1$ is called an "<a
href="https://en.wikipedia.org/wiki/Improper_rotation">improper
rotation</a>", or a rotation plus a reflection across the axis of
rotation.</p>
<p>Let's think about the type of optimization problem we've formulated so
far. Given our decision to use rotation matrices, the term $p +
R{}^Op^{m_{c_i}} - X^C {^Cp^{s_i}}$ is linear in the decision variables
(think $Ax \approx b$), making the objective a convex quadratic. How about
the constraints? The first constraint is a quadratic equality constraint
(to see it, rewrite it as $RR^T=I$ which gives 9 constraints, each quadratic
in the elements of $R$). The determinant constraint is cubic in the
elements of $R$.</p>
<example id="2D_rotation_only"><h1>Rotation-only point registration in 2D</h1>
<p>I would like you to have a graphical sense for this optimization
problem, but it's hard to plot the objective function with 9+3 decision
variables. To whittle it down to the essentials that I can plot, let's
consider the case where the scene points are known to differ from the
model points by a rotation only (this is famously known as <a
href="https://en.wikipedia.org/wiki/Wahba%27s_problem">the Wahba
problem</a>). To further simplify, let's consider the problem in 2D
instead of 3D.</p>
<p>In 2D, we can actually linearly parameterize a rotation matrix with
just two variables: $$R = \begin{bmatrix} a & -b \\ b & a\end{bmatrix}.$$
With this parameterization, the constraints $RR^T=I$ and the determinant
constraints are identical; they both require that $a^2 + b^2 = 1.$</p>
<figure><img style="width:80%"
src="data/wahba2d.png"/></figure><figcaption>The $x$ and $y$ axes of this
plot correspond to the parameters $a$, and $b$, respectively, that define
the 2D rotation matrix, $R$. The green quadratic bowl is the objective
$\sum_i \| R \: {}^O p^{m_{c_i}} - p^{s_i}\|^2,$ and the red open cylinder is
the rotation matrix constraint. <a
href="https://www.geogebra.org/3d/pn628maw">Click here for the interactive
version.</a></figcaption>
<p>Is that what you expected? I generated this plot using just two
model/scene points, but adding more will only change the shape of the
quadratic, but not its minimum. And on the <a
href="https://www.geogebra.org/3d/pn628maw">interactive version</a> of the
plot, I've added a slider so that you control the parameter, $\theta$,
that represents the ground truth rotation described by $R$. Try it
out!</p>
<p>In the case with no noise in the measurements (e.g. the scene points
are exactly the model points modified by a rotation matrix), then the
minimum of the objective function already satisfies the constraint. But
if you add just a little noise to the points, then this is no longer true,
and the constraint starts to play an important role.</p>
</example>
<p>The geometry should be roughly the same in 3D, though clearly in much
higher dimensions. But I hope that the plot makes it perhaps a little less
surprising that this problem has an elegant numerical solution based on the
singular value decomposition (SVD).</p>
<p>What about translations? There is a super important insight that allows
us to decouple the optimization of rotations from the optimization of
translations. The insight is this: <i>the <b>relative</b> position between
points is affected by rotation, but not by translation</i>. Therefore, if we
write the points relative to some canonical point on the object, we can
solve for rotations alone. Once we have the rotation, then we can back out
the translations easily. For our least squares objective, there is even a
"correct" canonical point to use -- the <i>central point</i> (like the
center of mass if all points have equal mass) under the Euclidean
metric.</p>
<p>Therefore, to obtain the solution to the problem, \begin{align}
\min_{p\in\Re^3,R\in\Re^{3\times3}} \quad & \sum_{i=1}^{N_s} \| p + R
\: {}^Op^{m_{c_i}} - p^{s_i}\|^2, \\ \subjto \quad & RR^T =
I,\end{align} first we define the central model and scene points, $\bar{m}$
and $\bar{s}$, with positions given by $${}^Op^\bar{m} = \frac{1}{N_s}
\sum_{i=1}^{N_s} {}^Op^{m_{c_i}}, \qquad p^\bar{s} = \frac{1}{N_s}
\sum_{i=1}^{N_s} p^{s_i}.$$ If you are curious, these come from taking the
gradient of the <a
href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrangian</a>
with respect to $p$, and setting it equal to zero: $$\sum_{i=1}^{N_s} 2 (p +
R \: {}^Op^{m_{c_i}} - p^{s_i}) = 0 \Rightarrow p^* = \frac{1}{N_s}
\sum_{i=1}^{N_s} p^{s_i} - R^*\left(\frac{1}{N_s} \sum_{i=1}^{N_s}
{}^Op^{m_{c_i}}\right),$$ but don't forget the geometric interpretation
above. This is just another way to see that we can substitute the
right-hand side in for $p$ in our objective and solve for $R$ by itself.</p>
<p>To find $R$, compose the data matrix $$W = \sum_{i=1}^{N_s} (p^{s_i} -
p^\bar{s}) ({}^Op^{m_{c_i}} - {}^Op^\bar{m})^T, $$ and use SVD to compute $W
= U \Sigma V^T$. The optimal solution is \begin{gather*} R^* = U D V^T,\\
p^* = p^\bar{s} - R^* {}^Op^\bar{m},\end{gather*} where $D$ is the diagonal
matrix with entries $[1, 1, \det(UV^T)]$ <elib>Myronenko09</elib>. This may
seem like magic, but replacing the singular values in SVD with ones gives
the optimal projection back onto the orthonormal matrices, and the diagonal
matrix ensures that we do not get an improper rotation. There many
derivations available in the literature, but many of them drop the
determinant constraint instead of handling the improper rotation. See
<elib>Haralick89</elib> (section 3) for one of my early favorites.</p>
<example class="drake"><h1>Point cloud registration with known
correspondences</h1>
<figure><img style="width:70%; margin-top:-20px; margin-bottom:-20px"
src="data/pose_from_known_correspondences.svg"/> </figure>
<p>In the example code, I've made a random object (based on a random set
of points in the $x$-$y$ plane), and perturbed it by a random 2D
transform. The blue points are the model points in the model coordinates,
and the red points are the scene points. The green dashed lines represent
the (perfect) correspondences. On the right, I've plotted both points
again, but this time using the estimated pose to put them both in the
world frame. As expected, the algorithm perfectly recovers the ground
truth transformation.</p>
<p><a
href="https://deepnote.com/project/Ch-4-Geometric-Pose-Estimation-zGNA9TdORJqhlYOaPO3sSg/%2Fpose.ipynb"
style="background:none; border:none;" target="pose">
<img
src="https://deepnote.com/buttons/launch-in-deepnote-white.svg"></a></p>
</example>
<todo>Add exercise on the reflection case.</todo>
<todo>Confirm or disprove: i think that this is just the least squares
solution to the unconstrained problem, then projected back to the
orthonormal matrices (with the SVD).</todo>
<p>What is important to understand is that once the correspondences are
given we have an efficient and robust numerical solution to estimating the
pose.</p>
</section>
<section><h1>Iterative Closest Point (ICP)</h1>
<p>So how do we get the correspondences? In fact, if we were given the pose
of the object, then figuring out the correspondences is actually easy: the
corresponding point on the model is just the nearest neighbor / closest
point to the scene point when they are transformed into a common frame.</p>
<p>This observation suggests a natural iterative scheme, where we start with
some initial guess of the object pose and compute the correspondences via
closest points, then use those correspondences to update the estimated pose.
This is one of the famous and often used (despite its well-known
limitations) algorithms for point cloud registration: the iterative closest
point (ICP) algorithm.</p>
<p>To be precise, let's use $\hat{X}^O$ to denote our estimate of the object
pose, and $\hat{c}$ to denote our estimated correspondences. The
"closest-point" step is given by $$\forall i, \hat{c}_i = \argmin_{j \in
N_m}\| \hat{X}^O \: {}^Op^{m_j} - p^{s_i} \|^2.$$ In words, we want to
find the point in the model that is the closest in Euclidean distance to the
transformed scene points. This is the famous "nearest neighbor" problem,
and we have good numerical solutions (using optimized data structures) for
it. For instance, Open3D uses <a
href="https://github.com/mariusmuja/flann">FLANN</a><elib>Muja09</elib>.</p>
<p>Although solving for the pose and the correspondences jointly is very
difficult, ICP leverages the idea that if we solve for them independently,
then both parts have good solutions. Iterative algorithms like this are a
common approach taken in optimization for e.g. bilinear optimization or
expectation maximization. It is important to understand that this is a
local solution to a non-convex optimization problem. So it is subject to
getting stuck in local minima.</p>
<example class="drake"><h1>Iterative Closest Point</h1>
<figure>
<iframe style="border:0;height:300px;width:400px;" src="data/iterative_closest_point.html?height=240px" pdf="no"></iframe>
<p pdf="only"><a href="data/iterative_closest_point.html">Click here for the animation.</a></p>
</figure>
<p>Here is ICP running on the random 2D objects. Blue are the model
points, red are the scene points, green dashed lines are the
correspondences. I encourage you to run the code yourself.</p>
<p>I've included one of the animation where it happens to converge to the
true optimal solution. But it gets stuck in a local minima more often
than not! I hope that stepping through will help your intuition.
Remember that once the correspondences are correct, the pose estimation
will recover the exact solution. So every one of those steps before
convergence has at least a few bad correspondences. Look closely!</p>
<p><a
href="https://deepnote.com/project/Ch-4-Geometric-Pose-Estimation-zGNA9TdORJqhlYOaPO3sSg/%2Fpose.ipynb"
style="background:none; border:none;" target="pose">
<img
src="https://deepnote.com/buttons/launch-in-deepnote-white.svg"></a></p>
</example>
<!--<todo>Thanks to the geometric nature of the problem, many of these local
minima are graphical and intuitive. (Will insert a few examples
here).</todo> -->
<p>Intuition about these local minima has motivated a number of ICP
variants, including point-to-plane ICP, normal ICP, ICP that use color
information, feature-based ICP, etc.</p>
<!---
<p>Maybe more fundamental, though, is the fact that I don't think the ICP objective, even if we could optimize it perfectly is quite right. (TODO: add the example of a long thin book). And there is information that we have, which should aid in estimating the pose, which is not present in the point cloud (like the location of the cameras).</p>
-->
</section>
<!--
<section><h1>Pairwise distance</h1>
</section>
-->
<section><h1>Dealing with partial views and outliers</h1>
<p>The example above is sufficient to demonstrate the problems that ICP can
have with local minima. But we have so far been dealing with unreasonably
perfect point clouds. Point cloud registration is an important topic in
many fields, and not all of the approaches you will find in the literature
are well suited to dealing with the messy point clouds we have to deal with
in robotics.</p>
<example class="drake"><h1>ICP with messy point clouds</h1>
<p>I've added a number of parameters for you to play with in the notebook
to add outliers (scene points that do not actually correspond to any model
point), to add noise, and/or to restrict the points to a partial view.
Please try them by themselves and in combination.</p>
<p><a
href="https://deepnote.com/project/Ch-4-Geometric-Pose-Estimation-zGNA9TdORJqhlYOaPO3sSg/%2Fpose.ipynb"
style="background:none; border:none;" target="pose">
<img
src="https://deepnote.com/buttons/launch-in-deepnote-white.svg"></a></p>
<figure style="margin-top:-40px;">
<iframe style="border:0;height:300px;width:400px;" src="data/icp_outliers.html?height=240px" pdf="no"></iframe>
<p pdf="only"><a href="data/icp_outliers.html">Click here for the animation.</a></p>
<figcaption>A sample run of ICP with outliers (modeled as additional
points drawn from a uniform distribution over the
workspace).</figcaption>
</figure>
<figure>
<iframe style="border:0;height:300px;width:400px;" src="data/icp_noise.html?height=240px" pdf="no"></iframe>
<p pdf="only"><a href="data/icp_noise.html">Click here for the animation.</a></p>
<figcaption>A sample run of ICP with Gaussian noise added to the positions of the scene points.</figcaption>
</figure>
<figure>
<iframe style="border:0;height:300px;width:400px;" src="data/icp_partial_view.html?height=240px" pdf="no"></iframe>
<p pdf="only"><a href="data/icp_partial_view.html">Click here for the animation.</a></p>
<figcaption>A sample run of ICP with scene points limited to a partial view. They aren't all this bad, but when I saw that one, I couldn't resist using it as the example!</figcaption>
</figure>
</example>
<subsection><h1>Detecting outliers</h1>
<p>Once we have a reasonable estimate, $X^O$, then detecting outliers is
straight-forward. Each scene point contributes a term to the objective
function according to the (squared) distance between the scene point and
its corresponding model point. Scene points that have a large distance
can be labeled as outliers and removed from the point cloud, allowing us
to refine the pose estimate without these distractors. You will find that
many ICP implementations, such as <a
href="http://www.open3d.org/docs/release/python_api/open3d.pipelines.registration.registration_icp.html">this
one from Open3d</a> include a "maximum correspondence distance" parameter
for this purpose.</p>
<p>This leaves us with a "chicken or the egg" problem -- we need a
reasonable estimate of the pose to detect outliers, but for very messy
point clouds it may be hard to get a reasonable estimate in the presence
of those outliers. So where do we begin? One common approach that was
traditionally used with ICP is an algorithm called RANdom SAmple Consensus
(RANSAC) <elib>Fischler81</elib>. In RANSAC, we select a number of random
(typically quite small) subsets of "seed" points, and compute a pose
estimate for each. The subset with the largest number of "inliers" --
points that are below the maximum correspondence distance -- is considered
the winning set and can be used to start the ICP algorithm.
</p>
<todo>RANSAC example</todo>
<p>There is another important and clever idea to know. Remember the
important observation that we can decouple the solution for rotation and
translation by exploiting the fact that the <i>relative</i> position
between points depends on the rotation but is invariant to translation.
We can take that idea one step further and note that the <i>relative
<b>distance</b></i> between points is actually invariant to both rotation
<i>and</i> translation. This has traditionally been used to separate the
estimation of scale from the estimation of translation and rotation (I've
decided to ignore scaling throughout this chapter, since it doesn't make
immediate sense for our proposed problem). In <elib>Yang20a</elib> they
proposed that this can also be used to prune outliers even before we have
an initial pose estimate. If the distance between a <i>pair</i> of
points in the scene is unlike any distance between any pair of points in
the model, then one of those two points is an outlier. Finding the
maximum set of inliers by this metric would correspond to finding the
maximum clique in a graph; one can make strong assumptions to guarantee
that the maximum clique is the best match, or simply admit that this
could be a useful heuristic. If finding maximum cliques is too expensive
for large point clouds, conservative approximations may be used
instead.</p>
</subsection>
<subsection><h1>Point cloud segmentation</h1>
<p>Not all outliers are due to bad returns from the depth camera, very
often they are due to other objects in the scene. Even in the most basic
case, our object of interest will be resting on a table or in a bin. If
we know the geometry of the table/bin a priori, then we can subtract away
points from that region. Alternatively, we can <a
href="http://www.open3d.org/docs/release/tutorial/geometry/pointcloud.html#Crop-point-cloud">crop</a>
the point cloud to a region of interest. This can be sufficient for very
simple or uncluttered scenes, and will be just enough for us to accomplish
our goal for this chapter.</p>
<p>More generally, if we have multiple objects in the scene, we may still
want to find the object of interest (mustard bottle?). If we had truly
robust point cloud registration algorithms, then one could imagine that
the optimal registration would allow us to correspond the model points
with just a subset of the scene points; point cloud registration could
solve the object detection problem! Unfortunately, our algorithms aren't
strong enough.</p>
<figure><img src="data/labelfusion_multi_object_scenes.png"/>
<figcaption>A multi-object scene from <a
href="http://labelfusion.csail.mit.edu/">LabelFusion</a>
<elib>Marion17</elib>.</figcaption>
</figure>
<p>As a result, we almost always use some other algorithm to "segment" the
scene in a number of possible objects, and run registration independently
on each segment. There are numerous geometric approaches to
segmentation<elib>Nguyen13</elib>. But these days we tend to use neural
networks for segmentation, so I won't go into those details here.</p>
<todo> normal estimation, plane fitting, etc.</todo>
</subsection>
<subsection><h1>Generalizing correspondence</h1>
<p>The problem of outliers or multiple objects in the scene challenges our
current approach to correspondences. So far, we've used the notation $c_i
= j$ to represent the correspondence of <i>every</i> scene point to a
model point. But note the asymmetry in our algorithm so far (scene
$\Rightarrow$ model). I chose this direction because of the motivation of
partial views from the mustard bottle example from the chapter opening.
Once we start thinking about tables and other objects in the scene,
though, maybe we should have done model $\Rightarrow$ scene? You should
verify for yourself that this would be a minor change to the ICP
algorithm. But what if we have multiple objects <i>and</i> partial
views?</p>
<p>It would be simple enough to allow $c_i$ to take a special value for
"no correspondence", and indeed that helps. In that case, we would hope
that, at the optimum, the scene points corresponding to the model of
interest would be mapped to their respective model points, and the scene
points from outliers and other objects get labelled as "no
correspondence". The model points that are from the occluded parts of the
object would simply not have any correspondences associated with them.</p>
<p>There is an alternative notation that we can use which is slightly more
general. Let's denote the correspondence matrix $C\in \{0,1\}^{N_s \times
N_m}$, with $C_{ij} = 1$ iff scene point $i$ corresponds with model point
$j$. Using this notation, we can write our estimation step (given known
correspondences) as: $$\min_{X\in \mathrm{SE}(3)} \sum_{i=1}^{N_s}
\sum_{j=1}^{N_m} C_{ij} \|X \: {}^Op^{m_j} - p^{s_i}\|^2.$$ This
subsumes our previous approach: if $c_i = j$ then set $C_{ij} = 1$ and the
rest of the row equal to zero. If the scene point $i$ is an outlier, then
set the entire row to zero. The previous notation was a more compact
representation for the true asymmetry in the "closest point" computation
that we did above, and made it clear that we only needed to compute $N_s$
distances. But this more general notation will take us to the next
round.</p>
</subsection>
<subsection><h1>Soft correspondences</h1>
<p>What if we relax our strict binary notion of correspondences, and think
of them instead as correspondence weights $C_{ij} \in [0,1]$ instead of
$C_{ij} \in \{0, 1\}$? This is the main idea behind the "coherent point
drift" (CPD) algorithm <elib>Myronenko10</elib>. If you read the CPD
literature you will be immersed in a probabilistic exposition using a
Gaussian Mixture Model (GMM) as the representation. The probabilistic
interpretation is useful, but don't let it confuse you; trust me that it's
the same thing. To maximize the likelihood on a Gaussian, step one is
almost always to take the $\log$ because it is a monotonic function, which
gets you right back to our quadratic objective.</p>
<div>The probabilistic interpretation does give a natural mechanism for
setting the correspondence weights on each iteration of the algorithm, by
thinking of them as the probability of the scene points given the model
points: \begin{equation} C_{ij} = \frac{1}{a_i}
\exp^{-\frac{1}{2}\left\|\frac{X^O \: {}^Op^{m_j} - p^{s_i}}{\sigma^2}
\right\|^2},\end{equation} which is the standard density function for a
Gaussian, $\sigma$ is the standard deviation and $a_i$ is the proper
normalization constant to make the probabilities sum to one. It is also
natural to encode the probability of an outlier in this formulation (it
simply modifies the normalization constant). <details><summary>Click to
see the normalization constant, but it's really an implementation
detail.</summary> The normalization works out to be \begin{equation}a_i =
\sum_{j=1}^{N_m} \exp^{-\frac{1}{2}\left\|\frac{X^O \: {}^Op^{m_j} -
p^{s_i}}{\sigma^2} \right\|^2} + (2\pi\sigma^2)^\frac{3}{2}
\frac{w}{1-w}\frac{N_m}{N_s}, \end{equation} with $0 \le w \le 1$ the
probability of a sample point being an outlier
<elib>Myronenko10</elib>.</details></div>
<p>The CPD algorithm is now very similar to ICP, alternating between
assigning the correspondence weights and updating the pose estimate. The
pose estimation step is almost identical to the procedure above, with the
"central" points now $${^Op^{\bar{m}}} = \frac{1}{N_C} \sum_{i,j} C_{ij}
\: {}^Op^{m_j}, \quad p^{\bar{s}} = \frac{1}{N_C} \sum_{i,j} C_{ij} p^{s_i},
\quad N_C = \sum_{i,j} C_{ij},$$ and the data matrix now: $$W = \sum_{i,j}
C_{ij} \left(p^{s_i} - p^{\bar{s}}\right) \left({}^Op^{m_j} -
{}^Op^{\bar{m}}\right)^T.$$ The rest of the updates, for extracting $R$
using SVD and solving for $p$ given $R$, are the same. You won't see the
sums explicitly in the code, because each of those steps has a
particularly nice matrix form if we collect the points into a matrix with
one point per column.</p>
<p>The probabilistic interpretation also gives us a strategy for
determining the covariance of the Gaussians on each iteration.
<elib>Myronenko10</elib> derives the $\sigma$ estimate as: (TODO: finish
converting this part of the derivation)<!-- $$\sigma^2 = \frac{1}{3 N_C}
\sum_{i,j} C_{ij} \left[\left(p^{s_i} - p^{\bar{s}}\right)^T \left(p^{s_i}
- p^{\bar{s}}\right) + \right].$$ --> </p>
<todo>Example with CPD</todo>
<p>The word on the street is the CPD is considerably more robust than ICP
with its hard correspondences and quadratic objective; the Gaussian model
mitigates the effect of outliers by setting their correspondence weight to
nearly zero. But it is also more expensive to compute all of the pairwise
distances for large point clouds. In <elib>Gao18a</elib>, we point out
that a small reformulation (thinking of the scene points as inducing a
distribution over the model points, instead of vice versa) can get us back
to summing over the scene points only. It enjoys many of the robustness
benefits of CPD, but also the performance of algorithms like ICP.</p>
<todo>Example with FilterReg</todo>
</subsection>
<subsection><h1>Nonlinear optimization</h1>
<p>All of the algorithms we've discussed so far have exploited the SVD
solution to the pose estimate given correspondences, and alternate between
estimating the correspondences and estimating the pose. There is another
important class of algorithms that attempt to solve for both
simultaneously. This makes the optimization problem nonconvex, which
suggests they will still suffer from local minima like we saw in the
iterative algorithms. But many authors argue that the solution times
using nonlinear solvers can be on par with the iterative algorithms (e.g.
<elib>Fitzgibbon03</elib>).</p>
<figure><img style="width:80%" src="data/robust_kernels.png"><figcaption>A
variety of objective functions for robust registration. <a
href="https://www.geogebra.org/calculator/hcgacefe">Click here for the
interactive version</a>.</figcaption></figure>
<p>Another major advantage of using nonlinear optimization is that these
solvers can accommodate a wide variety of objective functions to achieve
the same sort of robustness to outliers that we saw from CPD. I've
plotted a few of the popular objective functions above. The primary
characteristic of these functions is that they taper, so that larger
distances eventually do not result in higher cost. Importantly, like in
CPD, this means that we can consider all pairwise distances in our
objective, because if outliers add a constant value (e.g. 1) to the cost,
then they have no gradient with respect to the decision variables and no
impact on the optimal solution. We can therefore write, for our favorite
loss function, $l(x)$, an objective of the form, e.g. $$\min \sum_{i,j}
\left[ l\left(\| X^O \: {}^Op^{m_j} - p^{s_i} \|\right)\right].$$ And what
are the decision variables? We can also exploit the additional
flexibility of the solvers to use minimal parameterizations -- e.g.
$\theta$ for rotations in 2D, or Euler angles for rotations in 3D. The
objective function is more complicated but we can get rid of the
constraints.</p>
<example><h1>Nonlinear pose estimation with known correspondences</h1>
<p>To understand precisely what we are giving up, let's consider a
warm-up problem where we use nonlinear optimization on the minimal
parameterizations for the pose estimation problem. As long as we don't
add any constraints, we can still separate the solutions for rotation
from the solutions from translation. So consider the problem:
$$\min_\theta \sum_{j} \left\| \begin{bmatrix} \cos\theta & -\sin\theta
\\ \sin\theta & \cos\theta \end{bmatrix} \left({}^Op^{m_{c_i}} -
{}^Op^{\bar{m}}\right) - \left( p^{s_i} -
p^{\bar{s}}\right)\right\|^2.$$ We now have a complex, nonlinear
objective function. Have we introduced local minima into the
problem?</p>
<p><img
style="height:100px;float:right;margin-left:10px;margin-right:10px"
src="data/model_points_circle.png"/> As a thought experiment, consider
the problem where all of our model points lie on a perfect circle. For
any one scene point $i$, in the rotation-only optimization, the worst
case is when our estimate $\theta$ is 180 degrees ($\pi$ radians) away
from the optimal solution. The cost for that point would be $4r^2$,
where $r$ is the radius of the circle. In fact, using the law of
cosines, we can actually write the squared distance for the point for
any error, $\theta_{err} = \theta - \theta^*,$ as $$\text{distance}^2 =
r^2 + r^2 - 2 r^2 \cos\theta_{err}.$$ And in the case of the circle,
every other model point contributes the same cost. This is not a convex
function, but every minima is a globally optimal solution (just wrapped
around by $2\pi$).</p>
<p>In fact, even if we have a more complicated, non-circular, geometry,
then this same argument holds. Every point will incur and error as it
rotates around the circle, but may have a different radius. The error
for all of the model points will decrease as $\cos\theta_{err}$ goes to
one. So every minima is a globally optimal solution. We haven't
actually introduced local minima (yet)!</p>
<!-- TODO: Is this still true in 3D? -->
<p>It is the correspondences, and other constraints that we might add to
the problem, that really introduce local minima.</p>
</example>
<!--
<example><h1>Nonlinear optimization for pose estimation</h1>
</example>
-->
<subsubsection><h1>Precomputing distance functions</h1>
<p>There is an exceptionally nice trick for speeding up the computation
in these nonlinear optimizations, but it does require us to go back to
thinking about the minimum distance from a scene point to the model, as
opposed to the pairwise distance between points. This, I think, it not
a big sacrifice now that we have stopped enumerating correspondences
explicitly. Let's slightly modify our objective above to be of the
form $$\min \sum_{i} \left[ l\left( \min_j \| X^O \: {}^Op^{m_j} -
p^{s_i} \|\right)\right].$$ In words, we can an apply arbitrary robust
loss function, but we will only apply it to the minimum distance
between the scene point and the model.</p>
<p>The nested $\min$ functions look a little intimidating from a