USOO914865OB2
(12) United States Patent (10) Patent No.: US 9,148,650 B2
Chandraker et al. (45) Date of Patent: Sep. 29, 2015
(54) REAL-TIME MONOCULAR VISUAL (58) Field of Classification Search
ODOMETRY USPC ............................................................ 348/46
See application file for complete search history.
(71) Applicant: NEC Laboratories America, Inc., (56) References Cited
Princeton, NJ (US)
U.S. PATENT DOCUMENTS
(72) Inventors: Manmohan Chandraker, Santa Clara,
CA (US); Shiyu Song, Princeton, NJ 6,442,476 B1* 8/2002 Poropat ........................... TO1/23
(US) 8,259.994 B1* 9/2012 Anguelov et al. 382/100
8,761,439 B1* 6/2014 Kumar et al. ...... ... 38.2/103
(73) Assignee: NEC Laboratories America, Inc., 2008/0089556 A1* 4/2008 Salgian et al. ...
2010, 0166294 A1*
... 38.2/103
7, 2010 Marrion et al. ............... 382,154
Princeton, NJ (US)
OTHER PUBLICATIONS
(*) Notice: Subject to any disclaimer, the term of this Tardif et al., “Monocular Visual Odometry in Urban Environments
patent is extended or adjusted under 35
U.S.C. 154(b) by 332 days. Using an Omnidirectional Camera'; Sep. 22-26, 2008; Intelligent
Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Con
(21) Appl. No.: 13/858,040 ference on, 2531-2538.*
ScaramuZZa et al., “Appearance-Guided Monocular Omnidirectional
(22) Filed: Apr. 6, 2013 Visual Odometry for Outdoor Ground Vehicles'. Sep. 16, 2008;
Robotics, IEEE Transactions on; vol. 24, Issue 5: 1015-1026.
(65) Prior Publication Data Nistér et al., “Visual Odometry for Ground Vehicle Applications';
Jan. 26, 2006; Journal of Field Robotics; vol. 23, 1-35.*
US 2014/OO78258 A1 Mar. 20, 2014 Scaramuzza, “l-Point-RANSAC Structure from Motion for Vehicle
Mounted Cameras by Exploiting Non-holonomic Constraints'; Apr.
7, 2011; International Journal of Computer Vision; vol. 95, Issue 1:
Related U.S. Application Data
(60) Provisional application No. 61/725,733, filed on Nov. * cited by examiner
13, 2012, provisional application No. 61/701,877, Primary Examiner — Jamie Atala
filed on Sep. 17, 2012.
Assistant Examiner — Patrick Demosky
(51) Int. C. (74) Attorney, Agent, or Firm — Joseph Kolodka
H04N I3/02 (2006.01) (57) ABSTRACT
G06K 9/00 (2006.01)
G06T 700 (2006.01) Systems and methods are disclosed for multithreaded visual
(52) U.S. C. odometry by acquired with a single camera on-board a
CPC .......... H04N 13/0203 (2013.01); G06T 7/0044 vehicle; using 2D-3D correspondences for continuous pose
(2013.01); G06T 7/0071 (2013.01); H04N estimation; and combining the pose estimation with 2D-2D
13/02 (2013.01); G06T 2207/30244 (2013.01); epipolar search to replenish 3D points.
G06T 2207/30252 (2013.01) 20 Claims, 9 Drawing Sheets
isit: video sequence fross moying vehicle
Monocular sige canner
Steady-state:
Architecture: Pose
Keyframe and
Keyframe + 3
Architectire
Error-Correcting Architectures
.
Recovery
U.S. Patent Sep. 29, 2015 Sheet 1 of 9 US 9,148,650 B2
U.S. Patent Sep. 29, 2015 Sheet 3 of 9 US 9,148,650 B2
&
U.S. Patent Sep. 29, 2015 Sheet 4 of 9 US 9,148,650 B2

U.S. Patent Sep. 29, 2015 Sheet 5 Of 9 US 9,148,650 B2
U.S. Patent Sep. 29, 2015 Sheet 6 of 9 US 9,148,650 B2
;?,
U.S. Patent Sep. 29, 2015 Sheet 7 Of 9 US 9,148,650 B2
U.S. Patent US 9,148,650 B2
3
§
3.
***
{-·?#
U.S. Patent US 9,148,650 B2
US 9,148,650 B2
1. 2
REAL-TIME MONOCULAR VISUAL mous outdoor driving applications. The architecture of the
ODOMETRY system addresses the challenge of robust multithreading even
for scenes with large motions and rapidly changing imagery.
This application is a non-provisional of and claims priority The design is extensible for three or more parallel CPU
to Provisional Application Ser. No. 61/701,877 filed on Sep. threads. The system uses 3D-2D correspondences for robust
17, 2012 and Ser. No. 61/725,733 filed on Nov. 13, 2012, the pose estimation across all threads, followed by local bundle
content of which is incorporated by reference. adjustment in the primary thread. In contrast to prior work,
epipolar search operates in parallel in other threads to gener
BACKGROUND ate new 3D points at every frame. This significantly boosts
10 robustness and accuracy, since only extensively validated 3D
The present invention relates to visual odometry. points with long tracks are inserted at keyframes. Fast-mov
In navigation, odometry is the use of data from the move ing vehicles also necessitate immediate global bundle adjust
ment of actuators to estimate change in position over time ment, which is triggered by the instant keyframe design in
through devices such as rotary encoders to measure wheel parallel with pose estimation in a thread-safe architecture. To
rotations. Visual odometry is the process of determining 15 handle inevitable tracking failures, a recovery method is pro
equivalent odometry information using sequential camera vided. Scale drift is corrected using a mechanism that detects
images to estimate the distance traveled. Visual odometry for (rather than assumes) local planarity of the road by combin
real-world autonomous outdoor driving has gained interest ing information from triangulated 3D points and the inter
for vehicular navigation purposes. While stereo Simulta image planar homography. The system is optimized to output
neous Localization And Mapping (SLAM) systems routinely pose within 50 ms in the worst case, while average case
achieve high accuracy and real-time performance, the chal operation is over 30 fps. Evaluations are presented on the
lenge remains daunting for monocular ones. Yet, monocular challenging KITTI dataset for autonomous driving, where the
systems are attractive for the automobile industry since they system achieves better rotation and translation accuracy than
are cheaper and the calibration effort is lower. Costs of con other systems.
Sumer cameras have steadily declined in recent years, but 25
cameras for practical visual odometry in automobiles are BRIEF DESCRIPTION OF THE DRAWINGS
expensive since they are produced in lesser Volume, must
Support high frame rates and be robust to extreme tempera FIG. 1 shows an exemplary multithreaded architecture
tures, weather and jitters. with 2D-3D correspondences for continuous pose estimation.
The challenges of monocular visual odometry for autono 30 FIG. 2 shows an exemplary pose-guided matching module
mous driving are both fundamental and practical. For to provide fast 3D-2D correspondences.
instance, it has been observed empirically and theoretically FIG.3 shows an exemplary epipolar constrained search.
that forward motion with epipoles within the image is a “high FIG. 4 shows an exemplary local bundle module.
error’ situation for SFM. Vehicle speeds in outdoor environ FIG. 5 shows an exemplary multithreaded keyframe archi
ments can be high, so even with cameras that capture imagery 35 tecture to handle insertion of new 3D points in the main
at high frame rates, large motions may occur between con thread.
secutive frames. This places severe demands on an autono FIG. 6 shows an exemplary collection and refining module
mous driving visual odometry system, necessitating exten that allows bundle adjustment using long tracks.
sive validation and refinement mechanisms that conventional FIG. 7 shows an exemplary real-time global bundle adjust
systems do not require. The timing requirements for visual 40 ment module in a thread-safe architecture with real-time pose
odometry in autonomous driving are equally stringent—a estimation.
pose must be output at every frame in a fixed amount of time. FIG. 8 shows an exemplary scale recovery process per
For instance, traditional systems may produce a spike in formed using a 1-point RANSAC between 3D points recon
timings when keyframes are added, or loop closure is per structed before and after the tracking failure.
formed. 45 FIG. 9 shows an exemplary scale correcting system that
combines scale estimates from 3D points and planar homog
SUMMARY raphy mappings.
In one aspect, Systems and methods are disclosed for mul DESCRIPTION
tithreaded visual odometry by acquired with a single camera 50
on-board a vehicle; using 2D-3D correspondences for con Visual odometry is an inherently sequential operation. This
tinuous pose estimation; and combining the pose estimation is especially true for autonomous navigation as opposed to
with 2D-2D epipolar search to replenish 3D points. indoor or desktop applications, where the possibility of
Advantages of the above aspect may include one or more of repeatedly viewing the same scene structures is high. For
the following. The system makes judicious use of a novel 55 rapidly changing points in the visible field of view, bundle
multithreaded design to ensure that motion estimates (and the adjustment must be per-frame rather than with the interrupt
underlying structure variables) become available only after mechanism of PTAM, else by the time the refined points
extensive validation with long-range constraints and thor become available, they are not useful any more. Thus, design
ough bundle adjustments, but without delay. Thus, the system ing a multithreaded system requires achieving a delicate bal
is optimized for worst-case timing scenarios, rather than the 60 ance between accuracy and latency.
average-case optimization for most traditional systems. In FIG. 1 shows an exemplary multithreaded architecture
particular, the multithreaded system produces pose outputs in with 2D-3D correspondences for continuous pose estimation.
at most 50 ms, regardless of whether a keyframe is added or One embodiment is a multithreaded structure-from-motion
scale correction performed. The average frame rate of the (SFM) architecture 100. The steady-state architecture 100
system is much higher, at above 30 fps. 65 includes a pose local bundle adjustment (LBA) system 101/
The system provides a real-time, accurate, large-scale 103 and an epipolar search unit 102. The multithreaded archi
monocular visual odometry system for real-world autono tecture uses 2D-3D correspondences for continuous pose
US 9,148,650 B2
3 4
estimation in 101/103. This is combined with 2D-2D epipolar refinements, so can be considered highly accurate. The sys
search to replenish 3D points. This architecture allows cam tem architecture at every frame in steady state operation is
era pose estimation and 3D reconstruction using fast moving illustrated in FIG. 1.
imagery acquired with a single camera on-board a vehicle, Around 2000 FAST corners with Shi-Tomasi filtering are
thereby enabling autonomous driving applications. 5 extracted from a typical outdoors image and ORB descriptors
The output of the pose LBA is provided to a keyframe and are computed. Using the pose of the previous frame, the pose
keyframe+ 1 unit 201/202. Unit 201 provides a collection and of the current frame is predicted, assuming constant Velocity.
refinding mechanism allows bundle adjustment using long The system explicitly computes the camera pose at each
tracks, as opposed to two-view estimations for previous frame using correspondences, the motion model is only used
works, while unit 202 handles the keyframe--1 where real 10
as guidance to expedite matching. The existing set of stable
time global bundle adjustment is conducted in a thread-safe 3D points are projected into the image using the predicted
architecture with real-time pose estimation. pose and the ORB descriptor for each is compared to those
The unit tales frame poses and 3D points and performs within a square of side 2r pixels (for exampler-15). Given
refinement on a global bundle to generate refined frame poses these 2D-3D point correspondences, the system computes the
and 3D points that are provided to the feature matching 15
engine of the pose LBA unit 101/103. In parallel, an error actual camera pose using perspective n-point (PnP) pose esti
correcting unit 300 provides error recovery 301 and firewall mation in a RANSAC framework. The particular implemen
302, as discussed in more details below. tation used is EPnP with a model size of 4 points. The
The multithreaded architecture allows elegant extension to RANSAC pose with the largest consensus set is refined using
as many threads as desired. Besides speed advantages, mul a Levenberg-Marquardt nonlinear optimization.
tithreading also greatly contributes to the accuracy and The system can easily handle other choices for matching,
robustness of the system. As an example, consider the epipo in particular, the system has achieved similar results using
lar contrained search. A single-thread version of a system that normalized cross-correlation (NCC) instead of ORB. But
relies on 2D-3D correspondences might update its stable assocaiting a descriptor like ORB with a 3D point can have
point set by performing an epipolar search in the frame pre 25 ancillary benefits, as observed in the following sections.
ceding a keyframe. However, the support for the 3D points Feature and descriptor extraction, pose-guided matching
introduced by this mechanism is limited to just the triplet used and pose estimation are all easily parallelizable across mul
for the circular matching and triangulation. By moving the tiple threads, using a shared memory multiprocessing plat
epipolar search to a separate thread and performing the cir form such as OpenMP. Across three threads, the timings for
cular matching at every frame, the system may supply 3D 30 various components of the pose module are Summarized in
points with tracks of length up to the distance from the pre Table 1 below:
ceding keyframe. Clearly, the set of long tracks provided by
the epipolar thread in the multithread system is far more likely TABLE 1
to be free of outliers. FAST corner detection + Shi-Tomasi. 1 ms
FIG. 2 shows one exemplary pose-guided matching mod 35 ORB descriptor extraction 5 ms
ule 100 (FIG. 1) to provide fast 3D-2D correspondences. At Pose-guided matching 1 ms
steady state, the system has access to a stable set of 3D points. PnP (RANSAC, 500 iterations) 15 ms
The poses of the previous frames have undergone multiple Nonlinear pose refinement 1 ms
non-linear optimizations and are considered accurate. Each
frame is processed by three modules: the pose module esti 40 The pose module 101 is shown in FIG. 3. As shown in FIG.
mates the pose of the current frame and occupies all available 3, the pose module has the following functions: detection of
threads; 2) the epipolar update module uses epipolar con features and ORB descriptors; pose-guided matching (PGM);
strained search (ECS) and triangulation T to prepare new 3d and pose estimation (PnP).
points to be added; and 3) the local bundle adjustment (LBA) If the application scenario involves scenes where the same
unit performs non-linear optimization over a sliding window 45 set of 3D points is viewed for extended periods of time, then
of a few previous frames and takes place in the main thread. the pose module by itself would be sufficient to maintain the
Pose-guided matching with fast 3D-2D correspondences is camera pose. However, in outdoor applications like autono
supported by the architecture of FIG. 2 for every steady state mous navigation, 3D scene points rapidly move out of view
frame. The modules are depicted in their multithreading within a few frames. Thus, the stable set of points used for
arrangement, in correct synchronization order but not to 50 pose computation must be continually updated, which is the
scale. Camera images are provided to a detection module, task entrusted to the epipolar search module 102 of FIG. 1.
then provided to a PGM unit for pose-guided matching, and FIG. 4 shows in more details the epipolar constrained
then to a perspective n-point (PnP) pose estimation unit. The search module 102. As depicted in FIG.4, the epipolar search
outputs of the PnP unit are provided to the LBA unit, then to module is parallelized across two threads and follows pose
a finding unit Rand an update motion model U. The output of 55 estimation at each frame. The mechanism for epipolar search
the PnP unit are also provided to an ECS unit for epipolar is illustrated in FIG. 2. Let the most recent prior keyframe be
constrained search and then to a triangulation unit T. frame 0. After pose computation at frame n, for every feature
To initialize, the system extracts FAST corners with ORB f in the keyframe at location (Xo yo), the system considers a
descriptors and matches between consecutive frames using square of side 2r centered at (Xo yo) in framen. The system
locality sensitive hashing (LSH). With sufficient baseline 60 considers the intersection region of this square with a recti
(around 5 frames), a set of 3D points is initialized by relative linear band p pixels wide, centered around the epipolar line
pose estimation, triangulation and bundle adjustment. Each corresponding to f in framen. The ORB descriptors for all
frame during initialization is processed within 10 ms. FAST corners that lie within this intersection region are com
At steady state, the system has access to a stable set of at pared to the descriptor forf. The closest match, f, is found
least N 3D points that have undergone extensive bundle 65 in terms of Hamming distance. This epipolar matching pro
adjustment in prior frames (in one implementation N-100). cedure is also repeated by computing the closest matchtof, in
The preceding poses have also undergone multiple non-linear frame n-1, call it f. A match is accepted only if f also
US 9,148,650 B2
5 6
matches f. Note that only two sets of matches with respect to descriptors and camera poses from the most recent N frames.
frames (0, n) and (n-1, n) must be computed at the frame n, It also stores images for those N frames, for display and
since the matches between (0, n-1) have already been com debugging purposes. In the system, N=10. The match cache is
puted at frame n-1. a list of tables, one element corresponding to each frame. The
In one embodiment, the parameter r is automatically 5 key into the table is the identity of a 3D point visible in the
determined by the size of the motion, the system uses frame and the stored entries are the identities of the corre
r min{1200||col, 10}, where () is the differential rotation sponding 2D features in various frames.
between frames n-1 and n. Since pose estimates are highly The local bundle adjustment module also has other func
accurate due to continuous refinement by bundle adjustment, tions. After bundle adjustment, the system has a chance to
epipolar lines are deemed accurate and a stringent value of 10 re-find lost 3D points using the optimized pose. Since the
p=3 can be used to impose the epipolar constraint. The Ham system spends considerable effort in maintaining a high
ming distance computation for 256-bit ORB descriptors in a quality set of 3D points for pose computation, it is worthwhile
region of interest is performed as a block, with a fast SSE to incura Small overhead to recover any temporarily lost ones
implementation. To rapidly search for features that lie within (due to image artifacts like blur, specularities or shadows). In
the above region of interest, the detected features in an image 15 fact, a stable 3D point is permanently discarded only when its
are stored in a lookup table data structure. The key into the projection using the current pose falls outside the image
table is the column index of the feature and within each boundaries. Since the bundle adjusted pose is highly accurate,
bucket, features are stored in sorted row order. Across two the system can perform re-finding by matching ORB descrip
threads, this allows circular matching for a triplet of images, tors on FAST corners within a very tight square of side 2r,
with up to 500 features in each, in 10-15 ms. As opposed to pixels (with r, 10). This ensures re-finding is rapidly
brute-force searching, the lookup table results in speedups by achieved within 1 ms.
up to a factor of 10, especially in the autonomous driving One implementation uses the publicly available SBA pack
application where the images traditionally have wide aspect age for bundle adjustment. In parallel, the motion model for
ratios (to cover greater field of view while limiting uninfor predicting the pose of the next frame is also updated in this
mative regions like sky). 25 module. The timings for the parallel epipolar update and local
The features that are circularly matched in frame n are bundle adjustment modules are summarized in Table 2.
triangulated with respect to the most recent keyframe (frame
0). This two-view triangulation requires approximately 2 ms TABLE 2
perframe. The reconstructed 3D point is back-projected in all
the frames 1,..., n-1 and is retained only if a matching ORB 30 Epipolar update and local bundle timings insteady state
descriptor is found within a very tight square of side 2r, pixels parallel modules
(set r=3). This acts as a replacement for a more accurate, but Module Operation Timing
expensive, multiview triangulation and is satisfactory since
epipolar search produces a large number of 3D points, but 2*Epipolar Constrained search 10-15 ms
Update
only the most reliable ones may be used for pose estimation. 35
Triangulation 1-3 ms
However, these 3D points are not added to the stable point 3*Local Bundle Windowed bundle 10-20ms
cloud yet. For that they must first undergo a local bundle adjustment
adjustment and be collected by the main thread at a keyframe, Re-find 3D points 1 ms
which are aspects explained in the following sections. Update motion model Oms
The epipolar constrained search is implemented on a 40
thread of its own to produce per-frame 2D-2D correspon FIG. 6 shows an exemplary multithreaded keyframe archi
dences. For current frame n, only 3D points that are validated tecture 201 to handle insertion of new 3D points in the main
against all frames 1 to n-1 are retained. Only persistent 3D thread. The system architecture for keyframes is similar to
points that survive for greater than L frames may be collected that of FIG. 1, with the addition of a collection and refinding
by the next keyframe. 45 module C+R. It collates persistent 3D points tracked over at
The advantage of the above approach is that the system can least frames in the epipolar thread and re-finds them in the
construct long tracks, so when new 3D points are inserted, current frame using the output of the pose module. The LBA
they are guaranteed to be accurate. To boost robustness, each is now different from that for steady state, since its cache has
3D point is validated againstall the frames in real-time, while been updated with 3D points and their corresponding 2D
prior systems could only do this in computational off-cycles. 50 locations in all the relevant frames on the epipolar thread.
If the application scenario involves scenes where the set of The system cannot maintain steady state indefinitely, since
3D points being viewed remains unchanged, then the pose 3D points are gradually lost due to tracking failures or when
module by itself would be sufficient to maintain the camera they move out of the field of view. The latter is an important
pose over extended periods. However, in outdoor applications consideration in "forward moving systems for autonomous
like autonomous navigation, 3D scene points rapidly move 55 driving (as opposed to “browsing systems such as PTAM),
out of view within a few frames. Thus, the stable set of points so the role of keyframes is very important in keeping the
used for pose computation must be continually updated, system alive. The purpose of a keyframe is threefold:
which is the task entrusted to the epipolar search module. Collect 3D points with long tracks from the epipolar
FIG. 4 shows an exemplary local bundle module 103. The thread, refine them with local bundle adjustment and add
local bundle adjustment refines cameras and 3D points. Data 60 to the set of stable points in the main thread.
structures are implemented to collect and refine 3D points Trigger global bundle adjustment based on previous few
from the epipolar thread. keyframes that refines 3D points and keyframe poses.
To refine camera poses and 3D points incorporating infor Provide the frame where newly added 3D points “reside'.
mation from multiple frames, the system implements a slid The modules that define operations at a keyframe are illus
ing window local bundle adjustment. The key data structure is 65 trated in FIG. 6. The pose module remains unchanged from
the local bundle cache, which is composed of a frame cache the steady state. It is followed by a collection stage, where3D
and a match cache. The frame cache stores feature locations, points triangulated at each frame in the epipolar thread are
US 9,148,650 B2
7 8
gathered by the main thread. Only persistent 3D points that only in a few cameras. Thereafter, the system resumes steady
stem from features matched over at least L frames are col state operation until the next keyframe, unless a recovery or
lected (our circular matching for epipolar search ensures this firewall condition is triggered. The following sections explain
is easily achieved by seeking 3D points only from at least L those concepts in detail.
frames after the previous keyframe). Note that this mecha FIG.8 shows an exemplary system architecture for a recov
nism imposes two necessary conditions for a point to be ery frame. Data from the pose module is sent to feature
considered for inclusion into the stable set it must be visible matching (FM), triangulation (T) and then sent to BA-1 and
in at least two keyframes and must be tracked over at least L BA-2 forbundle adjustments. The data from triangulation and
frames. While stringent, these conditions inherently enhance bundle adjustments are then sent to Sunit for scale recovery.
the chances that only reliable 3D points are added into the 10
On occasions, the system might encounter a frame where
main thread. In the system, L-3 regardless of environment. pose-guided matching fails to find any features (due to imag
The collected 3D points must reside on a keyframe for all ing artifacts or a sudden large motion). In such a situation, a
Subsequent operations, so a re-finding operation is performed
by projecting them using the estimated pose for the frame and recovery mode is triggered, as illustrated in FIG. 8. In this
finding the best ORB match in a circular region of radius 10 15 case, the frame where system recovery initiates is n and k can
pixels. Now the existing stable 3D points, the collected 3D be the immediately preceding keyframe. During recovery, the
points from the epipolar thread, their projections in all the frames (n, n-1) are matched by comparing ORB descriptors
frames within the local bundle cache and the corresponding over the entire image using fast LSH and accepting only
cameras undergo local bundle adjustment. The bundle adjust bidirectional matches. Relative pose is computed using the
ment at keyframes differs from steady state operation, but 5-point algorithm in a robust RANSAC framework and inlier
adding long tracks into the bundle adjustment at keyframes is matches are triangulated. However, scale information is lost
a reason why the system can avoid more expensive multiview in the process. So, the system considers 3D points observed
triangulation at each frame in the epipolar thread. The refined between frames (n-1,k). Both the sets of 3D points are moved
3D points are now ready to be added to the stable set. to the coordinate system of frame n-1 and a 1-point RANSAC
The modules that define operations at the frame immedi 25 is performed. The hypothesis for the RANSAC is the ratio of
ately after a keyframe are illustrated in FIG. 7. The pose the norms of the sampled 3D point in the two sets. The
module re-finds the (new) set of stable 3D points. The pose corrected scale factor between frames (n, n-1) is assigned as
module is now split across only two threads, in order to the average ratio in the largest consensus set. To ensure that
accommodate a global bundle adjustment in the main thread. 3D points used for scale recovery are as accurate as possible,
This bundle adjustment involves the previous K keyframes 30 two instances of bundle adjustments are run in parallel—one
and their associated 3D points, in order to introduce long between frames (n, n-1) and another between frames (n-1,
range constraints to better optimize the newly added set of 3D k).
points. For the system, choosing K=5 allows the global The system is designed to keep repeating the recovery
bundle adjustment to finish within 15 ms. There are two mechanism until a stable set of 3D points is found. In the
reasons a more expensive bundle adjustment involving a 35 experiments, the system always recovers a stable set of 3D
much larger set of previous keyframes (or even the whole points after only one frame of recovery. For sequences in the
map) is not necessary to refine 3D points with long-range KITTI dataset, recovery is required on an average once in
constraints. First, the imagery in autonomous driving appli 1500 frames. In one embodiment of unit 301, scale recovery
cations is fast moving and does not involve repetitions, so is performed using a 1-point RANSAC between 3D points
introducing more keyframes into the global bundle adjust 40 reconstructed before and after the tracking failure.
ment yields at best marginal benefits. Second, the goal is FIG. 9 shows an exemplary scale correcting unit 302 that
instantaneous pose output rather than map-building, so even combines scale estimates from 3D points and planar homog
keyframes are not afforded the luxury of delayed output. This raphy mappings. Scale drift can be a significant problem in
is in contrast to parallel systems like where keyframes may monocular visual odometry. Using global knowledge of the
produce a noticeable spike in the per-frame timings. 45 trajectory for Scale correction, such as loop closure, is not an
In FIG. 7, the previous K frames are provided to a GBA or option for practical autonomous driving applications.
global bundle adjustment unit. The GBA unit usually finishes Instead, the system uses the fact that the camera is mounted a
within the time consumed by the pose module. The cache fixed height above the road plane. Unlike prior methods, the
update module reconciles the 3D points modified by both PnP system is accurate enough to be able to hold scale for
and GBA, before it is used by LBA. Following global bundle 50 extended periods, so while the system computes Scale against
adjustment, the 3D coordinates of all the points are updated. the ground at every frame, the scale correction mechanism is
Note that overlapping sets of 3D points are used by both triggered sparingly. The system automatically determines
global bundle adjustment and pose modules in parallel, how when scale correction may be required—in practice, once in
ever, both may also cause this set to change (PnP may reject approximately 100 frames. Not requiring per-frame scale
3D points that are outliers, while bundle adjustment may 55 correction also ensures that the system is able to tide over
move the position of 3D points). To ensure thread safety, an regions where the road Surface may not be planar.
update module is included that reconciles changes in the 3D The scale detection is implemented in a separate thread. At
point cloud from both the prior parallel modules. The local every frame, the system considers a small region of interest
bundle adjustment module, which simply reads in 3D point closest to the car (mid-section of the lower half of the image).
identities, receives this updated set for optimization based on 60 Features within this region that can be matched to the previ
the N frames in the local bundle cache. In parallel with local ous frame are triangulated. This narrow baseline is used since
bundle adjustment, the epipolar search also makes use of the matching is harder on low-textured roads and the features in
updated keyframe pose. While the keyframe pose has seen a this region of interest rapidly move out of the field of view. A
global bundle adjustment, the pose of the Subsequent frame 3-point RANSAC is used to find the best-fitting plane to these
has not. This does not cause any inconsistency in practice 65 triangulated points and the distance of the plane to the camera
since poses tend to be much more stable than points—a cam is computed as h. If the known height of the camera mount
era is constrained by hundreds of points, but a point is visible ing is ho, the scale correction factor is
US 9,148,650 B2
9 10
Real-time detection of pedestrians and cars can also be done
ho for better handling of crowded scenes.
What is claimed is:
The system computes the planar homography between the 1. A method for multithreaded visual odometry, compris
set of road points in the two frames, using a 4-point RANSAC ing acquiring images with a single camera on-board a vehicle:
where hypotheses are generated by linear estimation of alge using 2D-3D correspondences for continuous pose estima
braic error. This is followed by Levenberg-Marquardt based tion with a pose local bundle adjustment (LBA) system
nonlinear optimization of the optimal symmetric reprojection 10
including a feature matching engine using three or more
error over the largest consensus set: parallel CPU threads with pose estimation across all
threads, followed by local bundle adjustment in the pri
2
mary thread;
minX|x - Hyll + |x,-H'x'',
i
(1) using a local bundle cache including a frame cache and a
15 match cache, wherein the frame cache stores feature
locations, descriptors and camera poses from most
where H is the planar homography that maps homogeneous recent frames and images, wherein the match cache
inlier point set X in the previous frame to the corresponding comprises a list of tables, one element corresponding to
set X in the current frame. The form of the homography H is: each frame with a key into the table being an identity of
a 3D point visible in the frame and stored entries are
identities of corresponding 2D features in the frames;
H = R + 1 -in'. 2
(2) combining the pose estimation with 2D-2D epipolar search
h2 to replenish 3D points; and
generating structure-from-motion (SFM).
25 2. The method of claim 1, comprising using the visual
where (R, t) is the relative pose, n is the unit normal of the odometry for autonomous driving applications.
proposed ground plane and his the distance of the plane from 3. The method of claim 1, comprising pose-guided match
the camera. The heighth may be recovered from Husing a ing to provide fast 3D-2D correspondences.
singular value decomposition of H.H. Again, the scale factor 4. The method of claim 1, comprising performing epipolar
may be computed as the ratio 30 constrained search to produce per-frame 2D-2D correspon
dences.
ho
5. The method of claim 4, comprising constructing long
tracks, and inserting new 3D points guaranteed to be accurate.
6. The method of claim 1, comprising validating each 3D
35 point against all frames in real-time.
Having computed Sands at current frame k, the system 7. The method of claim 1, comprising performing local
compares their values and consider them in mutual agreement bundle adjustment to refine cameras and 3D points.
when they differ by less than 5%. When in mutual agreement, 8. The method of claim 7, comprising collecting and refin
it is likely that the points in consideration actually belong to a ing 3D points from an epipolar thread.
planar ground and a potential scale factor, S. mean(S,S), is 40 9. The method of claim 1, comprising inserting new 3D
available. If |S–1|>0.1, the system polls for scale correction. points in a main thread.
First, it seeks to confirm the computed scales by Verifying 10. The method of claim 1, comprising collecting and
that the previous mutually agreed scale factor was within 5% refinding mechanism allows bundle adjustment using long
ofs. If not, it waits for the next mutually agreed scale factor tracks.
to confirm S. Once confirmed, the system inserts a firewall 45 11. The method of claim 1, comprising performing real
whereby the 3D points and camera poses in the bundle cache time global bundle adjustment in a thread-safe architecture
are scale corrected. The frames preceding the bundle cache with real-time pose estimation.
are now fixed and play no further part in future estimation (in 12. The method of claim 1, comprising handling failures
effect, their data structures are emptied). The ground detec due to tracking or scale drift.
tion and firewall estimation require about 30 ms. 50 13. The method of claim 1, comprising using a 1-point
The multithreaded system enables large-scale, real-time, RANSAC between 3D points reconstructed before and after a
monocular visual odometry, targeted towards autonomous tracking failure.
driving applications with fast-changing imagery. The multi 14. The method of claim 1, comprising scale correcting by
threaded design can boost both the speed and accuracy for combining scale estimates from 3D points and planar homog
handling challenging road conditions. The system is opti 55 raphy mappings.
mized to provide pose output in real-time at every frame, 15. The method of claim 1, comprising planar homography
without delays for keyframe insertion or global bundle adjust between a set of road points in two frames, using a 4-point
ment. This is achieved through a per-frame epipolar search RANSAC where hypotheses are generated by linear estima
mechanism that generates redundantly validated 3D points tion of algebraic error and followed by Levenberg-Marquardt
persistent across long tracks and an efficient keyframe archi 60 based nonlinear optimization of the optimal symmetric
tecture to perform online thread-safe global bundle adjust reprojection error over a largest consensus set:
ment in parallel with pose computation. Further, the system is
accurate enough to require only occasional scale correction,
for which an automatic mechanism is provided that detects minX|x - Hyll + |x, - Hall, 2
planarity of the ground to compute reliable scale factors. 65 i
Multithreaded bundle adjustment can be optimized for small
sized problems that arise in autonomous driving applications.
US 9,148,650 B2
11 12
where H is a planar homography that maps homogeneous CPU threads with pose estimation across all threads,
inlier point set X in a previous frame to a corresponding set X" followed by local bundle adjustment in the primary
in a current frame, where a form of a homography H is: thread:
a local bundle cache including a frame cache and a
1
match cache, wherein the frame cache stores feature
H = R + -in'.
in locations, descriptors and camera poses from most
recent frames and images, wherein the match cache
comprises a list of tables, one element corresponding
where (R, t) is a relative pose, n is a unit normal of a proposed 10
to each frame with a key into the table being an iden
ground plane and his a distance of a plane from a camera and tity of a 3D point visible in the frame and stored
heighth is recovered from Husing a singular value decom entries are identities of corresponding 2D features in
position of HH and a scale factor computed as a ratio the frames;
combining the pose estimation with 2D-2D epipolar
15
search to replenish 3D points.
ho 17. The system of claim 16, comprising code for autono
mous driving applications using the visual odometry.
18. The system of claim 16, comprising code for pose
16. A system for multithreaded visual odometry, compris guided matching to provide fast 3D-2D correspondences.
1ng: 19. The system of claim 16, comprising code to perform
a single camera on-board a vehicle; and epipolar constrained search to produce per-frame 2D-2D cor
a multi-threaded processor coupled to the camera, the respondences.
multi-threaded processor using 2D-3D correspon 20. The system of claim 19, comprising code for construct
dences for continuous pose estimation with a pose ing long tracks, and inserting new 3D points guaranteed to be
acCurate.
local bundle adjustment (LBA) system including a
feature matching engine using three or more parallel