0% found this document useful (0 votes)

17 views15 pages

GeoNet++: Enhanced Depth & Normal Estimation

The paper presents GeoNet++, a geometric neural network designed for joint depth and surface normal estimation from a single image, enhancing predictions through edge-aware refinement. It introduces depth-to-normal and normal-to-depth modules to improve the quality of predictions by leveraging geometric relationships, resulting in better 3D scene reconstructions. Additionally, a new evaluation metric, 3D geometric metric (3DGM), is proposed to assess the quality of depth predictions in 3D applications, with experimental results demonstrating the effectiveness of GeoNet++ on standard datasets.

Uploaded by

Ben

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views15 pages

GeoNet++: Enhanced Depth & Normal Estimation

Uploaded by

Ben

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

GeoNet++: Iterative Geometric Neural Network

with Edge-Aware Refinement for Joint Depth and
Surface Normal Estimation
Xiaojuan Qi, Zhengzhe Liu, Renjie Liao, Philip H. S. Torr, Raquel Urtasun, and Jiaya Jia

Abstract—In this paper, we propose a geometric neural network with edge-aware refinement (GeoNet++) to jointly predict both depth
and surface normal maps from a single image. Building on top of two-stream CNNs, GeoNet++ captures the geometric relationships
between depth and surface normals with the proposed depth-to-normal and normal-to-depth modules. In particular, the “depth-to-normal”
arXiv:2012.06980v1 [[Link]] 13 Dec 2020

module exploits the least square solution of estimating surface normals from depth to improve their quality, while the “normal-to-depth”
module refines the depth map based on the constraints on surface normals through kernel regression. Boundary information is exploited
via an edge-aware refinement module. GeoNet++ effectively predicts depth and surface normals with strong 3D consistency and sharp
boundaries resulting in better reconstructed 3D scenes. Note that GeoNet++ is generic and can be used in other depth/normal prediction
frameworks to improve the quality of 3D reconstruction and pixel-wise accuracy of depth and surface normals. Furthermore, we propose
a new 3D geometric metric (3DGM) for evaluating depth prediction in 3D. In contrast to current metrics that focus on evaluating pixel-wise
error/accuracy, 3DGM measures whether the predicted depth can reconstruct high-quality 3D surface normals. This is a more natural
metric for many 3D application domains. Our experiments on NYUD-V2 [1] and KITTI [2] datasets verify that GeoNet++ produces fine
boundary details, and the predicted depth can be used to reconstruct high-quality 3D surfaces. Code has been made publicly available.

Index Terms—Depth estimation, surface normal estimation, 3D point cloud, 3D geometric consistency, 3D reconstruction, edge-aware,
convolutional neural network (CNN), geometric neural network.

1 I NTRODUCTION

W E tackle the important problem of jointly estimating depth

and surface normals from a single RGB image. This 2.5D
geometric information is beneficial to various computer vision
tasks, including structure from motion (SfM), 3D reconstruction,
pose estimation, object recognition, and scene classification. Depth
and surface normals are typically employed in many application
domains that require 3D understanding of the scene, e.g., robotics,
virtual reality, and human-computer interactions, to name a few.
There exists a large body of work on depth [3], [4], [5], [6], [7],
[8], [9], [10], [11], [12], [13] and surface normal estimation [6],
[14], [15], [16], [13] from a single image. Most previous methods
independently perform depth and normal estimation, potentially
leading to inconsistent predictions and poor 3D surface reconstruc-
tions. As shown in Fig. 2 (d)– wall regions, the predicted depth Fig. 1. The geometric relationship between depth and surface normals.
map could be distorted in planar regions. Utilizing the fact that the The point cloud is obtained by casting depth values into 3D via the pinhole
surface normal does not change in such regions could help denoise camera model. Surface normals are estimated from the point cloud by
solving a system of linear equations; depth is constrained by the local
planar surfaces. plane determined by neighboring points and their surface normals.
This motivates us to exploit the geometric relationship between
depth and surface normals. We use the example in Fig. 1 as an
illustration. On the one hand, the surface normal is determined from their depth; on the other hand, depth is constrained by
by the tangent plane to the 3D points, which can be estimated the local surface of the tangent plane determined by the surface
normal. Several approaches have tried to incorporate geometric
• X. Qi is with the Department of Electrical and Electronic Engineering, relationships into traditional models via hand-crafted features [3],
University of Hong Kong, Hong Kong.
[17]. However, little research has been done in the context of neural
• Z. Liu is with DJI corporation, Shenzhen, China.
networks. This is the focus of our paper.
• R. Liao and R. Urtasun are with Uber ATG, University of Toronto, and
Vector Institute, Toronto, Canada. One possible design is to build a convolutional neural network
• P. Torr is with the Department of Engineering Science, University of Oxford, (CNN) to directly learn such geometric relationships from data.
Oxford, United Kingdom. However, our experiments in Sec. 8 demonstrate that existing
• J. Jia is with the Department of Computer Science and Engineering, The CNN architectures (e.g., VGG-16) can not predict good normals
Chinese University of Hong Kong, Hong Kong. from depth. We found that the training always converges to very
• X. Qi and Z. Liu share the first-authorship.
poor local minima, even with carefully tuned architectures and
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

(a) Input (b) Depth (DORN [18]) (c) Normal (DORN [18]) (d) 3D (DORN [18])

(e) 3D (GT) (f) Depth (Ours) (g) Normal (Ours) (h) 3D (Ours)

Fig. 2. Visual illustrations of depth/normal maps and the reconstructed 3D point cloud. (a) is the input image. (b) is the depth map from a state-
of-the-art approach DORN [18]. (c) is the surface normal derived from (b). (d) is the corresponding point cloud visualization of (b). (e) shows the
ground-truth point cloud. (f) is the depth map from our approach. (g) shows the surface normal derived from (f). (h) shows the corresponding point
cloud visualization of (f). The normal maps (DORN [18] and Ours) are computed from the corresponding point cloud shown in (d) and (h) respectively
using the least square fitting provided in [1] followed by TV-denoising.

hyperparameters. Another challenge stems from pooling operations Our final contribution is a new geometric-related evaluation
and large receptive fields, which makes current architectures metric for depth prediction which measures the quality of the 3D
perform poorly near object boundaries. We refer the reader to surface reconstruction. This metric directly measures the local
Fig. 2 (b) where the results are blurry on the black bounding box. 3D surface quality by casting the predicted depth into 3D point
This phenomenon is amplified when viewing the results in 3D. clouds. It is better correlated with the end goals of the 3D tasks.
As shown in Fig. 2 (d), the points inside the red bounding box Experimental results on NYUD-V2 [1] and KITTI [2] datasets
are scattered in 3D due to the blurry boundaries. It is therefore show that our GeoNet++ achieves decent performance, while being
problematic for robotic applications where obstacle detection and more efficient.
avoidance are needed for safety. Difference from our Conference Paper: This manuscript
The above facts motivate us to design a new architecture significantly improves the conference version [19]: (i) we introduce
that explicitly incorporates and enforces 3D geometric constraints an edge-aware propagation network to improve the prediction
considering object boundaries. Towards this goal, we propose Geo- at boundaries, facilitating the generation of better point clouds;
metric Neural Network with Edge-Aware Refinement (GeoNet++), (ii) we develop an iterative scheme to progressively improve the
which integrates geometric constraints and boundary information quality of predicted depth and surface normals; (iii) we propose
into a CNN. This contrasts with previous works [15], [5], [18], a new evaluation metric to measure 3D surface reconstruction
which focus on designing new network architectures [5] or loss accuracy; (iv) we conduct additional experiments and analysis on
functions more tailored for the task [18]. KITTI [2] dataset; (v) we empirically show that GeoNet++ can
The overall system (see Fig. 5) has a two-stream backbone be incorporated into previous methods [5], [12], [18] to further
CNN, which predicts initial depth and surface normals from a improve the results especially the 3D reconstruction quality; (vi)
single image respectively. With initial depth and surface normals, from both qualitative and quantitative perspectives, our results are
GeoNet++ (see Fig. 3) is utilized to incorporate geometric con- significantly better compared to [19], especially in 3D metrics.
straints by modeling depth-to-normal (see Fig. 4 (a) – (L)) and The rest of the paper is organized as follows. Sec. 2 reviews
normal-to-depth (see Fig. 4 (b) – (L)) mapping, and introduce the literature on depth and surface normal prediction. In Sec. 3,
boundary information with edge-aware refinement (see Fig. 4 we elaborate on our GeoNet++ model. We conduct experiments
(c)). Our “depth-to-normal” module relies on least-square and and show more detailed analysis in Sections 4 – 8. We draw our
residual sub-modules, while the “normal-to-depth” module updates conclusion in Sec. 9.
the depth estimates via kernel regression. Guided by the learned
propagation weights, our “edge-aware refinement module” sharpens
2 R ELATED W ORK
boundary predictions and smooths out noisy estimations. Our
framework enforces the final depth and surface normal prediction The 2.5D geometry estimation from a single image has been
to follow the underlying 3D constraints, which directly improves intensively studied. Previous works can be roughly divided into
3D surface reconstruction quality. Note that GeoNet++ can be two categories based on whether deep neural networks have been
integrated into other CNN backbones for depth or surface normal used.
prediction. Importantly, the overall system can be trained end-to- Traditional methods do not use deep neural networks and
end. mainly focus on exploiting low-level image cues and geometric
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

constraints. For example, [20] estimates the mean depth of the make the approach suffer from the heavy computational cost. In
scene by recognizing the structures presented in the image, and comparison, our GeoNet++ exploits the geometric relationship
inferring the scene scale. Based on Markov random fields (MRF), between depth and surface normal for general situations without
Saxena et al. [3] predicted a depth map given hand-crafted features making any planar or curvature assumption. Our model is more
of a single image. Vanishing points and lines are utilized in [21] for efficient compared to the iterative inference of CRF in [30].
recovering the surface layout [22]. Liu et al. [4] leveraged predicted Recently, the depth-normal consistency was utilized for depth
semantic segmentation to incorporate geometric constraints. A completion [33] from a single image and unsupervised depth-
scale-dependent classifier was proposed in [23] to jointly learn normal estimation [34] from monocular videos. In this paper, we
semantic segmentation and depth estimation. Shi et al. [24] showed focus on deploying geometric constraints to improve depth and
that estimating defocus blur is beneficial for recovering the depth surface normal estimation from a single image and analyzing its
map. Favaro et al. [25] proposed to learn a set of projection influence on 3D surface reconstruction.
operators from blurred images, which are further utilized to estimate
the 3D geometry of the scene from novel blurred images. In [17], a
3 G EO N ET ++
unified optimization problem was formed aiming at recovering the
intrinsic scene properties, e.g., shape, illumination, and reflectance In this section, we first introduce the overall architecture of our
from shading. Relying on specially designed features, above GeoNet++, and then elaborate on its components.
methods directly incorporate geometric constraints.
Many deep learning methods were recently proposed for single- 3.1 Overall Architecture
image depth and/or surface normal prediction. Eigen et al. [5] The overall architecture of GeoNet++ is illustrated in Fig. 3. Based
directly predicted the depth map by feeding the image to a CNN. on the initial depth map predicted by the backbone CNNs (Sec. 4.1),
Shelhamer et al. [26] proposed a fully convolutional network we apply the depth-to-normal module (Sec. 3.2) to transfer the
(FCN) to learn the intrinsic decomposition of a single image, initial depth map to the normal map as shown in Fig. 4 (a) – (L).
which involves inferring the depth map as the first intermediate This module refines the surface normals with the initial depth
step. Recently, Ma et al. [27] incorporated the physical rule in map considering geometric constraints. Similarly, given the initial
multi-image intrinsic decomposition for single image intrinsic surface normal estimation, we generate the depth using the normal-
decomposition. In [6], a unified coarse-to-fine hierarchical network to-depth module (Sec. 3.3). This enhances the depth prediction by
was adopted for depth/normal prediction. Continuous conditional incorporating the inherent geometric constraints to the estimation
random fields (CRFs) were proposed in [12] to fuse information of depth from normals. The depth/normal maps generated with
derived from CNN outputs. Fu et al. [18] introduced ordinal the above components are then adjusted via the depth (normal)
regression loss to help the optimization process and achieve better ensemble module (Sec. 3.4). Furthermore, by removing noisy
performance. In [11], continuous CRFs were built on top of results and refining boundary predictions, the edge refinement
CNN to smooth super-pixel-based depth prediction. For predicting network as described in Sec. 3.5 further improves the predictions
single-image surface normals, Wang et al. [14] incorporated local, and generates the refined results as shown in Fig. 3 (d)-(e). Finally,
global, and vanishing point information in designing the network GeoNet++ can be applied iteratively by taking the refined results
architecture. Reconstruction loss has been exploited in [28] for from previous iteration as inputs as described in Sec. 3.6.
unsupervised depth estimation from a single image. Following
work [29] introduced left-right consistency constraint. A skip-
connected architecture has also been proposed in [15] to fuse hidden 3.2 Depth-to-Normal Module
representations of different layers for surface normal estimation. Learning geometrically consistent surface normals from depth via
All these methods regard depth and surface normal predic- directly applying neural networks is surprisingly hard as discussed
tions as independent tasks, thus ignoring their basic geometric in Sec. 8. To this end, we propose a depth to normal transformation
relationship that also influences the quality of the reconstructed module that explicitly incorporates depth-normal consistency into
surface. Recently, a few works [30], [31], [32], [8] jointly reason deep neural networks. We start our discussion with the least square
multiple tasks. Wang et al. [30] designed CRFs to fuse semantic module, viewed as a fix-weight neural network. We then describe
segmentation prediction and depth estimation. Xu et al. [31] the residual sub-module that aims at smoothing and combining the
proposed a hierarchical framework to first predict depth, surface results with initial normals as in Fig. 4 (a) – (L).
normal, edge maps, and semantic segmentation, and then fuse them
Pinhole camera model. As a common practice, we adopt the
together for the final depth and semantic map prediction. Zhang et
pinhole camera model. We denote (ui , vi ) as the location of
al. [32] learned depth prediction and semantic segmentation with
pixel i in the 2D image. Its corresponding location in 3D space
the recursive refinement to progressively refine the predicted
is (xi , yi , zi ), where zi is the depth. Based on the geometry of
depth and semantic segmentation. All these approaches focus
perspective projection, we have
on modifying either CNN architectures or loss functions to make
the model better fit the data without explicitly considering the xi = (ui − cx ) ∗ zi /fx ,
geometric property. In contrast, our approach explicitly incorporates yi = (vi − cy ) ∗ zi /fy . (1)
geometric constraints by designing edge-aware geometric modules,
which is orthogonal to previous works. where fx and fy are the focal length along the x and y directions
The most related work to ours is that of [30], which has a CRF respectively. cx and cy are coordinates of the principal points.
with a 4-stream CNN, considering the consistency of predicted Least square sub-module. We formulate the inference of surface
depth and surface normal in planar regions. Nevertheless, it may normals from a depth map as a least-square problem. Specifically,
fail when planar regions are uncommon in images. Moreover, the for any pixel i, given its depth zi , we first compute its 3D
iterative inference in CRF and the Monte Carlo sampling strategy coordinates (xi , yi , zi ) from its 2D coordinates (ui , vi ) relying on
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

Iterative

Depth to Normal Normal Ensemble Edge Refinement

(a) Initial Depth (d) Refined Normal

(c) Input Image

Normal to Depth Depth Ensemble Edge Refinement

(b) Initial Normal (e) Refined Depth
Iterative

Fig. 3. The overall structure of GeoNet++. GeoNet++ takes as inputs initial depth estimation (a), initial normal estimation (b), and an input image
(c). The initial depth and surface normal are firstly refined with depth-to-normal and normal-to-depth modules. Then, the depth (normal) ensemble
module is adopted to combine results from the initial estimation. It is followed by the edge refinement module, which reduces noise and refines
boundaries. GeoNet++ can be applied for multiple times by iteratively taking the refined normal and depth as inputs.

the pinhole camera model. To compute the surface normal of pixel the geometric relationship between depth and surface normals,
i, we need to determine the tangent plane, which crosses pixel i in our network circumvents the aforementioned difficulty in learning
3D space. We follow the traditional assumption that pixels within geometrically consistent depth and surface normals. Note that
a local neighborhood lie on the same tangent plane. In particular, the module can be incorporated and jointly fine-tuned with other
we define the set of neighboring pixels, including pixel i itself, as
networks that predict depth maps from raw images.
Ni = {(xj , yj , zj )||ui − uj | < β, |vi − vj | < β, |zi − zj | < γzi } ,

where β and γ are hyperparameters controlling the size of 3.3 Normal-to-Depth Module
neighborhood along x, y , and depth axes respectively. With Now we turn our attention to the normal-to-depth module. For
these pixels on the
tangent plane, the surface normal estimate any pixel i, given its surface normal (nix , niy , niz ) and an initial
n = nx , ny , nz should satisfy the over-determined linear system estimate of depth zi , the goal is to refine its depth.
of equations First, note that given the 3D point (xi , yi , zi ) and its surface
2 normal (nix , niy , niz ), we can uniquely determine the tangent
An = b, subject to knk2 = 1. (2) plane Pi , which satisfies the following equation
where nix (x − xi ) + niy (y − yi ) + niz (z − zi ) = 0. (5)
 
x1 y1 z1
 x2 As explained in Sec. 3.2, we assume that pixels within a small
y2 z2  neighborhood of i lie on this tangent plane Pi . This neighborhood
K×3
A= . ..  ∈ R , (3)
 
.. Mi is defined as
 .. . .  n o
xK yK zK Mi = (xj , yj , zj ) n>
j ni > α, |ui − uj | < β, |vi − vj | < β ,

and b ∈ RK×1 is a constant vector. K is the size of Ni , i.e., the where β is a hyperparameter controlling the size of the neighbor-
set of neighboring points. The least square solution of this problem, hood along the x and y axes, α is a threshold to rule out spatially
which minimizes kAn − bk2 , can be computed in closed form as close points, which are not approximately coplanar, and (ui , vi )
(A> A)−1 A> 1 are the coordinates of pixel i in the 2D image.
n= , (4) For any pixel j ∈ Mi , if the depth zj is given, we can compute
k(A> A)−1 A> 1k 2 0
the depth estimate of pixel i as zji relying on Eqs. (1) and (5) as
where 1 ∈ Rk is a vector with all-one elements. It is not surprising njx xj + njy yj + njz zj
0
that Eq. (4) can be regarded as a fix-weight neural network, which zji = . (6)
(ui − cx )njx /fx + (vi − cy )njy /fy + njz
predicts surface normals given the depth map.
To refine the depth of pixel i, we then use kernel regression to
Residual sub-module. This least-square module occasionally aggregate the estimation from all pixels in the neighborhood as
produces noisy surface normal estimation (see Fig. 4 (a): “Rough P 0
Normal”) due to issues like noise and improper neighborhood j∈M K(nj , ni )zji
ẑi = P i , (7)
size. To further improve the quality, we propose a residual module, j∈Mi K(nj , ni )
which consists of a 3-layer CNN with skip-connections as shown
in Fig. 4 (a). The goal is to smooth the noisy estimation from the where ẑi is the refined depth, ni = nix , niy , niz and K is
least square module. the kernel function. We use linear kernels (i.e., cosine similarity)
to measure the similarity between ni and nj , i.e., K(nj , ni ) =
Overall architecture. The architecture of the depth-to-normal n>j ni . In this case, the smaller the angle between normals ni and
module is illustrated in Fig. 4 (a) – (L). By explicitly leveraging nj is, which means the higher probability that pixels i and j are
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

Initial Normal Initial Depth Rough Normal Geo-refined Normal Geo-EN-refined Normal

Least Square Module Residual Module

(a) Depth-to-normal (L) and normal ensemble (R) modules. L: left shaded box, R: right shaded box.

Depth Estimation Kernel Regression

Initial Depth Initial Normal Geo-refined Depth Geo-EN-refined Depth

(b) Normal-to-depth (L) and depth ensemble (R) modules. L: left shaded box, R: right shaded box.

Recusive
Propagator
Residual Maps

Geo-EN-Refined Depth Input Canny Edge Weight Maps

Final Depth
Exatrctor
Canny Edge
Weight Map Predictor Recursive Propagator

(c) Edge-aware refinement module for depth. Residual (weight) maps include “left to right”, “right to left”, “top to bottom”, “bottom to top”.

Recusive
Propagator
Residual Maps

Geo-EN-Refined Normal Input Canny Edge Weight Maps

Final Normal
Exatrctor
Canny Edge
Weight Map Predictor Recursive Propagator

(d) Edge-aware refinement module for surface normal. Residual (weight) maps include “left to right”, “right to left”, “top to bottom”, “bottom to top”.
Fig. 4. GeoNet++ components. (a) The depth-to-normal module (L) first estimates “Rough Normal” from the “Initial Depth” with least square fitting;
normals are then refined by the residual module producing “Geo-refined Normal”; a normal ensemble network (R) is utilized to fuse the initial and
Geo-refined normals generating “Geo-EN-refined normal”. (b) The normal-to-depth module (L) takes the “Initial Depth” and “Initial Normal” as inputs;
the normal map helps propagate the initial depth prediction to neighbors; depth estimates are aggregated by the kernel regression module producing
“Geo-refined Depth”. The depth ensemble module (R) taking “Geo-refined Depth” and “Initial Depth” as inputs further improves prediction generating
“Geo-EN-refined Depth”. (c) The edge-aware refinement module first constructs direction-aware propagation “Weight Maps” by combining low-level
edges with “Residual Maps”; the recursive propagator utilizes the learned weight maps to refine “Geo-EN-refined Depth” producing “Final Depth”; (d)
the edge-aware refinement module for surface normal. Please zoom in to see more details.

0
in the same tangent plane, the more contribution the estimate zji ensemble module illustrated in Fig. 4 (a) – (R) for surface normal
makes to the estimate of ẑi . and Fig. 4 (b) – (R) for depth. In the following, we detail the depth
The above process is illustrated in Fig. 4 (b) – (L). It can be ensemble module. The normal ensemble module shares a similar
viewed as a voting process where every pixel j ∈ Mi gives a architecture.
“vote” to determine the depth of pixel i. By utilizing the geometric The depth ensemble module takes as inputs “Initial Depth”
relationship between surface normal and depth, we efficiently from the backbone network and “Geo-refined Depth” (Fig. 4 (b))
improve the quality of depth estimate without any weights to learn. from the geometric module, and produces a refined depth – “Geo-
EN-refined Depth”, as shown in Fig. 4 (b). To enlarge the receptive
3.4 Depth (Normal) Ensemble Module field of the ensemble module, the input is firstly processed with 3
To further enhance the prediction quality, the “Initial Depth convolution layers with a dilation rate of 2, kernel size 3 × 3, and
(Normal)” from the backbone network and the “Geo-refined Depth” channel number 128. This is followed by another 2 dilation-free
from the geometric refinement are combined together with the convolution layers with kernel size 3 × 3 and channel number 128.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

3.5 Edge-aware Refinement Module

Depth
We design an edge-aware refinement module (see Fig. 4 (c) & Fig. 4 Backbone
(d)) to further enhance the prediction inspired by [35]. This module (b) Initial Depth GeoNet++ (d) Final Normal
enhances the boundary prediction and removes noisy predictions
Normal
by gradually aggregating the information from neighboring pixels. (a) Input Image Backbone
This process is guided by a set of learned weight maps (see Fig. 4: (c) Initial Normal (e) Final Depth
“Weight Maps”). The edge-aware refinement contains two sub-
modules, i.e., the weight map predictor and the recursive propagator. Fig. 5. Our end-to-end trainable full system with backbone architecture
This module uses the same architecture for both the depth and and GeoNet++.
normal prediction so that we only elaborate on the details for the
depth as below.
normal maps from previous iterations can further serve as the
Weight map predictor. The weight map predictor contains a inputs to GeoNet++ for iterative refinement. Note that we only
canny edge operator extracting low-level edge information and a apply this iteratively during the inference. In the training phase,
residual network to learn edge-aware propagation weights. The GeoNet++ is only applied once to reduce the memory consumption
output is fused by element-wise summation as shown in Fig. 4 (c). and improve the training efficiency.
The residual network contains 3 convolutional layers with ReLU
nonlinearity, dilation rate 2, kernel size 3 × 3, and channel number End-to-end network training. Our full system is shown in
32, followed by 3 convolution layers without dilation using the Fig. 5. The backbone network produces the initial depth and
same parameter setting as above. Finally, a convolution layer with surface normal maps, which are further refined with GeoNet++ by
kernel size 1 × 1 and channel number 4 is adopted to produce the incorporating the geometric constraint and the edge information.
edge weight maps W ∈ RH×W ×4 , where H and W are image The whole system can be trained end-to-end. In the following, we
spatial size, and the 4 channels represent propagation weights in explain our loss function for training the full system. We denote the
gt
four directions, i.e., left to right (L→R), right to left (R→L), top initial, refined, and ground-truth depth of pixel i as zi , ẑi and zi
gt
to bottom (T →B ), and bottom to top (B→T ). The learned edge respectively. Similarly, we denote ni , n̂i , and ni initial, refined,
weight maps for depth refinement are illustrated in Fig. 4 (c). The and ground-truth surface normals respectively. The overall loss
corresponding maps for surface normal are shown in Fig. 4 (d), function is the summation of two losses, one for the depth and
where a larger value means a higher chance to be near boundaries. one for the normals, L = ldepth + lnormal . The depth loss ldepth is
expressed as
Recursive propagator. The recursive propagator takes as inputs !
the edge weight maps W and the input signal X (depth or normal 1 X gt 2
X gt 2
maps in our case), and recursively refines the input signal for T ldepth = zi − zi 2 + η ẑi − zi 2 , (9)
M i i
times by applying the following operations
L→R : 1,t
S(i,j) t
= (1 − W(i,j,1) )Xi−1,j t
+ W(i,j,1) Xi,j , with M the total number of pixels. The surface normal loss is
2,t 1,t 1,t
!
R→L : S(i,j) = (1 − W(i,j,2) )Si+1,j + W(i,j,2) Si,j , 1 X gt 2
X gt 2
3,t 2,t 2,t
lnormal = ni − ni 2 + λ n̂i − ni 2 . (10)
T →B : S(i,j) = (1 − W(i,j,3) )Si,j−1 + W(i,j,3) Si,j , M i i
4,t 3,t 3,t
B→T : S(i,j) = (1 − W(i,j,4) )Si,j+1 + W(i,j,4) Si,j , Here λ and η are hyperparameters which balance the contribution
(t+1) 4,t
X(i,j) = S(i,j) , (8) of individual terms. We will explain their values in the following
section.
where X 0 = X . S 1,t , S 2,t , S 3,t , and S 4,t represent intermediate
results after “L→R”, “R→L”, “T →B ”, and “B→T ” propaga-
tion at step t respectively. 4 E XPERIMENTAL S ETUP
At each step, the recursive propagator takes as input the 4.1 Backbone Networks
previous iteration result X t and produces a new estimation
We validate the effectiveness of our proposed GeoNet++ on top of
X t+1 employing a weighted summation of depth diffused from
several baseline architectures.
neighboring pixels and the current depth values. The weights
are determined by the learned edge-aware weight maps W . In Baseline network. In most experiments, we utilize a modified
regions near boundaries, the learned weights Wi,j are large to VGG-16 [37], i.e., deeplab-LargeFOV [38] with dilated convo-
avoid blurring and preserve sharp results. On the other hand, lution [38] and global pooling [39], [40] for initial depth and
the learned weights are small in non-boundary regions removing surface normal prediction. This is our baseline backbone network
noisy predictions. The edge-aware weights enable us to separate for comparison with VGG-based methods. It is also adopted for
computational intensive two-dimensional propagation into four ablation studies. We utilize this baseline network to produce initial
one-dimensional propagations, which is more efficient without depth and/or surface normal in the experiments if not specified.
sacrificing the quality [35], [36]. The above propagation procedure
is executed for T times to incorporate long-range dependencies. State-of-the-art approaches. To further evaluate the effectiveness
We use T = 3 in our experiments. of the system, we also adopt state-of-the-art methods to produce the
initial prediction of depth and surface normal. We experiment with
3.6 Iterative Inference and Training Details Multi-scale CNN V1 [5], Multi-scale CNN V2 [6], FCRN [10],
Multi-scale CRF [12], and DORN [18] for initial depth estimation.
Iterative inference. GeoNet++ can be applied iteratively to further For initial normal estimation, we employ the initial normal map
improve the results as shown in Fig. 3. The refined depth and from SkipNet [15].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

TABLE 1 matrix AT A, which might be inaccurate if the condition number is

Performance of surface normal prediction on NYUD-V2 test set. large. Setting λ = 0.01 mitigates this effect.
“Baseline” refers to using VGG-16 network with global pooling to directly
predict surface normal from raw images. “SkipNet [15] + GeoNet++”
means building GeoNet++ on top of the normal result of [15]. “Baseline + 4.4 2D Pixel-wise Metrics
Loss” indicates that we only use a geometry-aware loss function as [34].
Following [5], [10], [12], we adopt four metrics to evaluate
Error Accuracy
the resulting depth map quantitatively. They are root mean
mean median rmse 11.25◦ 22.5◦ 30◦ square error (rmse), mean log 10 error (log 10), mean relative
error (rel), and pixel accuracy as percentage of pixels with
3DP [41] 35.3 31.2 - 16.4 36.6 48.2
3DP (MW) [41] 36.3 19.2 - 39.2 52.9 57.8 max(zi /zigt , zigt /zi ) < δ for δ ∈ [1.25, 1.252 , 1.253 ]. The
UNFOLD [42] 35.2 17.9 - 40.5 54.1 58.9 evaluation metrics for surface normal prediction [14], [15], [6] are
Discr. [43] 33.5 23.1 - 27.7 49.0 58.7 mean of angle error (mean), median of angle error (median), root
Multi-scale CNN V2 [6] 23.7 15.5 - 39.2 62.0 71.1
Deep3D [14] 26.9 14.8 - 42.0 61.2 68.2 mean square error (rmse), and pixel accuracy as percentage of pixels
SURGE [30] 20.6 12.2 - 47.3 68.9 76.6 with angle error below threshold t where t ∈ [11.25◦ , 22.5◦ , 30◦ ].
SkipNet [15] 19.8 12.0 28.2 47.9 70.0 77.8
SkipNet [15] + GeoNet++ 19.6 11.6 28.3 48.9 71.2 78.7
Our Baseline 19.4 12.5 27.0 46.0 70.3 78.9 5 C OMPARISON R EGARDING 2D M ETRICS
Our Baseline + GeoNet [19] 19.0 11.8 26.9 48.4 71.5 79.5
Our Baseline + Loss 19.0 11.8 26.9 48.3 71.4 79.5 In this section, we compare our GeoNet++ with existing methods
Our Baseline + GeoNet++ 18.5 11.2 26.7 50.2 73.2 80.7 in terms of depth and/or surface normal prediction regarding 2D
pixel-wise metrics on NYUD-V2 [1] and KITTI [2] datasets.

4.2 Datasets 5.1 Experiments on NYUD-V2 Dataset

We evaluate the effectiveness of our method on the NYUD-V2 [1]
and KITTI [44] datasets. Surface normal prediction. As shown in Tab. 1, our GeoNet
consistently outperforms previous approaches regarding all met-
NYUD-V2 dataset. This dataset contains 464 video sequences rics. GeoNet++ further improves the results by incorporating
of indoor scenes, which are further divided into 249 sequences the edge-aware refinement, ensemble modules, and the iterative
for training and 215 for testing. We sample 30, 816 frames from inference strategy. Since we use the same backbone network
the training video sequences as the training data. For the training architecture VGG-16, the improvement stems from our depth-
set, we use the in-painting method of [45] to fill in invalid or to-normal network. Our explicit formulation is also more effective
missing pixels in the ground-truth depth map. We then generate a than implicitly incorporating the constraints via a loss function
ground-truth surface normal map following the procedure of [14]. similar to [34]. Moreover, while taking the results from [15] and
KITTI dataset. This dataset captures various scenes for au- our own baseline depth network as the initial normal and depth
tonomous driving. We follow the setting of [6] and use 22, 600 prediction respectively, our GeoNet++ improves the surface normal
images from 32 scenes for training, and 697 images from the other produced by [15]. Especially, the model is more effective in lower
29 scenes for testing. For the KITTI dataset, we utilize Multi- threshold regimes of the metric, which are more challenging.
scale CNN V1 [5] and DORN [18] with author-released models to Depth prediction. In the task of depth prediction, we adopt
produce initial depth. We generate the ground-truth normals using VGG-16 which is most commonly adopted by state-of-the-art
the same procedure as in the NYUD-V2 dataset with the provided methods. As shown in Tab. 2, our GeoNet++ performs better than
LiDAR depth. state-of-the-art approaches regarding all evaluation metrics with
the VGG-16 architecture. Furthermore, GeoNet++ outperforms
4.3 Implementation Details GeoNet due to its edge-aware refinement, ensemble modules, and
iterative inference. Among all these methods, SURGE [30] is the
Our GeoNet++ is implemented in TensorFlow v1.2 [46]. For our only one that shares the same objective, i.e., jointly predicting
VGG baseline network, we initialize the two-stream CNNs with depth and surface normal. It builds a CRF on top of a VGG-16
networks pre-trained on ImageNet. Other baseline approaches are network. As shown in Tab. 2, using the same backbone network,
initialized with their corresponding pre-trained models, which are our GeoNet++ significantly outperforms SURGE. We argue that
fixed in the procedure of fine-tuning GeoNet++. We use Adam [47] this is due to the fact that our model does not impose assumptions
to optimize the network, and the norm of gradients are clipped, so on the surface shape and the underlying geometry. Our model also
that they are no larger than 5. The initial learning rate is 1e−4 . It performs favorably compared to modeling geometric constraint via
is adjusted following the polynomial decay strategy with power loss function similar to [34]. We also test the generalization ability
parameter 0.9. Random horizontal flip is utilized for augmentation. of GeoNet++ by directly taking the depth maps produced by Multi-
While flipping images, we multiply the x-direction of surface scale CRF [12], Multi-scale CNN V2 [6], Local Network [48],
normal maps with −1. and DORN [18] as the initial depth, and surface normals produced
The whole system is trained with batch-size 1 for 40K by our own baseline networks as the initial normal. Even without
iterations on the NYUD-V2 dataset and 80-k iterations on further fine-tuning, GeoNet++ consistently improves all baseline
the KITTI dataset. Hyperparameters {α, β, γ, λ, η} are set to results as shown in Tab. 2. Note that when FCRN [48] is end-to-end
{0.95, 9, 0.05, 0.01, 0.5} according to validation on 5% randomly fine-tuned with GeoNet++, the results are further improved.
split training data. λ is set to a small value due to numerical
instability when computing the matrix inverse in the least square Visual comparison. Fig. 8 depicts a visual comparison with
module – the gradient of Eq. (4) needs to compute the inverse of FCRN [10] and DORN [18] on depth prediction. Our GeoNet++
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

(a) Images (b) Deep3D [14] (c) MS CNN V2 [6] (d) SkipNet [15] (e) GeoNet [19] (f) Ours (GeoNet++) (g) Ground truth

Fig. 6. Visual comparisons of surface normal predictions using VGG-16 as the backbone architecture.

TABLE 2
Performance of depth prediction on NYUD-V2 test set. “Baseline” means using VGG-16 to directly predict depth from raw images. The backbone
architecture for FCRN [10] and DORN [18] are ResNet-50 and ResNet-101 respectively. The results for DORN [18] are derived by evaluating their
model. * denotes that GeoNet or GeoNet++ with the backbone network are end-to-end finetuned. We do not finetune GeoNet++ with DORN [18]
since the released Caffe code is not compatible with ours. “Baseline + Loss” indicates that we only use a geometry-aware loss function as [34].

Backbone Method Error Accuracy

rmse log 10 rel δ < 1.25 δ < 1.252 δ < 1.253

DepthTransfer [49] 1.214 - 0.349 0.447 0.745 0.897

SemanticDepth [23] - - - 0.542 0.829 0.941
DC-depth [7] 1.06 0.127 0.335 - - -
AlexNet or Global-Depth [50] 1.04 0.122 0.305 0.525 0.829 0.941
None-CNN NRF [9] 0.744 0.078 0.187 0.801 0.950 0.986
GCL/RCL [51] 0.802 - - 0.605 0.890 0.970
CNN + HCRF [8] 0.907 - 0.215 0.605 0.890 0.970
SURGE [30] 0.643 - 0.156 0.768 0.951 0.989
FCRN [10] 0.790 0.083 0.194 0.629 0.889 0.971
Multi-scale CRF [12] 0.688 0.073 0.175 0.741 0.934 0.982
Multi-scale CNN V2 [6] 0.641 - 0.158 0.769 0.950 0.988
Local Network [48] 0.620 - 0.149 0.806 0.958 0.987
VGG Multi-scale CNN V2 [6] + GeoNet++ 0.637 0.067 0.157 0.772 0.951 0.988
Multi-scale CRF [12] + GeoNet++ 0.683 0.072 0.173 0.746 0.935 0.983
Local Network [48] + GeoNet++ 0.615 0.061 0.147 0.810 0.959 0.987
Our Baseline 0.626 0.068 0.155 0.768 0.951 0.988
Our Baseline + GeoNet*[19] 0.608 0.065 0.149 0.786 0.956 0.990
Our Baseline + Loss [34] 0.615 0.065 0.150 0.782 0.954 0.989
Our Baseline + GeoNet++* 0.600 0.063 0.144 0.791 0.960 0.991
Multi-scale CRF [12] 0.586 0.052 0.121 0.811 0.954 0.988
FCRN [10] 0.584 0.059 0.136 0.822 0.955 0.971
ResNet DORN [18] 0.552 0.051 0.115 0.826 0.960 0.985
FCRN [10] + GeoNet++ 0.575 0.058 0.134 0.828 0.957 0.989
FCRN [10] + GeoNet++* 0.558 0.055 0.129 0.839 0.960 0.990
DORN [18] + GeoNet++ 0.527 0.049 0.113 0.862 0.965 0.989

generates more accurate depth maps with regard to the washbasin compare our normal prediction results with those of other methods,
and small objects on the table in the 2-nd and 4-th rows respectively. including Deep3D [14], Multi-scale CNN V2 [6], and SkipNet [15]
We also show the corresponding surface normal predictions to in Fig. 6. GeoNet++ produces results with nice details on, e.g.,
verify that our GeoNet++ takes advantage of them to improve the the chair, washbasin, and wall from the 1-st, 2-nd, and 3-rd rows
depth prediction. We refer the reader to look closely at the wall respectively. More results of joint prediction are shown in Fig. 13.
in the 1-st row of the figure. DORN [18] achieves decent depth From these figures, it is clear that our model does a better job than
prediction performance on the NYUD-V2 dataset. However, the previous approaches in terms of geometry estimation.
visual quality of results still has much room for improvement since
they are not piecewise smooth in planar regions. When the produced 5.2 Experiments on the KITTI Dataset
depth is refined by GeoNet++, the visual quality is significantly
improved as illustrated in Fig. 8 (columns e and f). We further We further conduct experiments on KITTI dataset to verify the
effectiveness of our model in outdoor scenes. The KITTI dataset
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

TABLE 3
Quantitative comparisons of depth predictions on the KITTI dataset. * denotes that we directly evaluate results or model released by respective
authors. “Multi-scale CNN V1* [5] + GeoNet++” indicates that we utilize Multi-scale CNN V1* [5] to produce the initial depth. “DORN [18] + GeoNet++”
represents that we utilize DORN [18] to produce the initial depth.

Method Error Accuracy

rmse log 10 rel δ < 1.25 δ < 1.252 δ < 1.253

Make3D [52] 8.73 0.361 0.280 0.601 0.820 0.926

Dis-cont Depth [11] 6.99 0.289 0.217 0.647 0.882 0.961
LRC [29] 4.94 0.206 0.114 0.861 0.949 0.976
Semi-depth [53] 4.62 0.189 0.113 0.862 0.960 0.986
Multi-scale CRF [12] 4.69 - 0.125 0.816 0.951 0.983
Multi-scale CNN V1* [5] 8.03 0.337 0.350 0.474 0.827 0.945
DORN [18]* 4.06 0.175 0.101 0.891 0.965 0.986
Multi-scale CNN V1* [5] + GeoNet++ 7.96 0.321 0.341 0.500 0.838 0.948
DORN [18] + GeoNet++ 4.10 0.172 0.094 0.897 0.968 0.986

TABLE 4
Performance of surface normal prediction on the KITTI dataset.

Error Accuracy
mean median rmse 11.25◦ 22.5◦ 30◦

Baseline 15.21 8.17 23.76 60.08 78.95 85.23

RMSE: 0.844 RMSE: 0.860 RMSE: 0.000 Ours 14.87 7.79 23.46 61.24 79.52 85.60

Local Network [48] takes around 24s to predict the depth map of
the same-sized image; Multi-scale CRF [54] takes around 2.25s
to process an image; SURGE [30]1 also takes longer time since
it has to go through four VGG-16 networks and requires multiple
mean-field inference steps.

6 B ENCHMARK D EPTH P REDICTION IN 3D

Previous metrics for depth prediction from a single image only
focus on 2D pixel-wise metrics without considering its usefulness
(a) DORN [18] (b) Ours (c) Ground truth
in reconstructing real 3D surface, which is very crucial in real-world
Fig. 7. The first row shows the depth prediction results with the corre- applications. The reconstruction of the 3D surface heavily depends
sponding root mean square error listed below. The second row shows on the corresponding 3D point cloud – 2D pixel-wise metrics
the corresponding 3D point clouds. The third row shows the estimated cannot directly measure the reconstruction quality in 3D. As shown
surface normal from the corresponding point clouds.
in Fig. 7, depth from DORN [18] (0.844) has slightly lower RMSE
error than ours (0.860), while the point cloud generated from our
only provides sparse depth annotations. We finetune our GeoNet++ prediction is obviously more structured than the one of DORN
on the training set of KITTI for 80-k iterations with batch size (Fig. 7 second row). Our normal estimation is also better (Fig. 7
1. The initial learning rate is 1e−4 and is adjusted following third row). The above examples indicate the necessity of a metric
a polynomial decay strategy with a power parameter of 0.9. regarding the 3D reconstruction quality.
Quantitative comparisons for depth and surface normals are shown Here we propose a complementary 3D geometric metric
in Tables 3 and 4 respectively. Our method outperforms the state- (3DGM) to evaluate “how much the predicted depth helps high-
of-the-art. Visual comparisons are shown in Fig. 9. Our GeoNet++ quality 3D surface reconstruction”. We also evaluate state-of-the-
improves boundary prediction results in non-planar regions (see art approaches using this 3D geometric metric. We utilize our
Fig. 9 first row: person and pole regions), and produces smooth GeoNet++ to further refine these results to verify that GeoNet++
results in planar regions (wall and road regions in the 1-st row of can generally improve the 3D reconstruction quality. In the
Fig. 9). The reconstructed normals from GeoNet++ have less noise following, we elaborate on our proposed 3D geometric metric
(2-nd row of Fig. 9). Corresponding point cloud visualization (wall and show more 3D visual examples.
regions and persons in the 3-rd row of Fig. 9) further validates that
GeoNet++ improves 3D reconstruction. 3D geometric metric. We now introduce the 3D geometric
metric (3DGM) for evaluating depth results by measuring its
usefulness in reconstructing 3D surfaces. Specifically, to compute
5.3 Running Time Analysis
the metric, we first cast the predicted depth map into its 3D position.
We test our GeoNet++ on a PC with an Intel i7-6950 CPU and Then, the corresponding surface normals are derived with the
a single TitanX GPU. When using VGG-16 as the backbone provided development kit and further denoised with TV-denoising
network, it takes 0.87 seconds to produce the final depth and
normal estimates for an image with size 480 × 640. In comparison, 1. We do not have the exact running time as there is no released code.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

(a) Images (b) Ground truth (c) FCRN [10] (d) FCRN [10] + Ours (e) DORN (f) DORN + Ours (g) Ours (normal)

Fig. 8. Visual illustrations on depth prediction. “FCRN [10] + Ours” indicates that we utilize FCRN [10] as the depth backbone in our system.
“DORN [10] + Ours” indicates that DORN [10] is utilized as the depth backbone in our system. Please zoom in to see more details.

Input DORN [18] (Depth) DORN [18] + GeoNet++ (Depth)

Ours (Normal) DORN [18] (Normal) DORN [18] + GeoNet++ (Normal)

GT (Depth) DORN [18] (Point Clouds) DORN [18] + GeoNet++ (Point Clouds)

Fig. 9. “DORN [18] + GeoNet++” represents that we utilize DORN [18] to produce the initial depth. The first row shows the depth prediction results. The
second row (column 1) shows normal map directly predicted by our approach for reference. The second row (columns 2-3) shows the corresponding
normal directly estimated from the generated depth. The third row (column 1) shows the depth ground truth from LIDAR (invalid fields are filled with
method [45] for visualization). The third row (columns 2-3) shows the corresponding point clouds.

following the method of [55]. We finally compare it with the

surface normal estimated from the ground truth depth produced TABLE 5
by Kinect or LIDAR. We evaluate methods [6], [10], [18], [12] Quantitative comparisons on the NYUD-V2 dataset in terms of 3D
regarding the new 3D metrics on both NYUD-V2 and KITTI geometric metric (3DGM).
datasets. Quantitative results on the NYUD-V2 dataset are shown
in Tab. 5. GeoNet++ consistently improves the 3D geometric Error Accuracy
metric by a large margin regardless of the choice of backbone mean median rmse 11.25◦ 22.5◦ 30◦
architectures. Similar results on the KITTI dataset are obtained as Our Baseline 42.39 37.61 50.81 12.09 28.97 39.68
shown in Tab. 6. Our Baseline + GeoNet++ 35.78 30.10 43.86 16.07 37.28 49.84
Multi-scale CNN [6] 34.52 26.20 44.32 20.35 43.84 55.58
Multi-scale CNN [6] + GeoNet++ 29.61 21.54 39.14 26.61 51.71 63.11
Qualitative visualization in 3D. The 3D qualitative com-
parisons of predicted depth maps on the NYUD-V2 dataset are Multi-scale CRF [12] 35.47 28.63 44.47 19.17 40.47 51.93
Multi-scale CRF [12] + GeoNet++ 31.79 25.30 40.12 21.65 45.14 57.28
shown in Fig. 10. GeoNet++ consistently improves the 3D point
FCRN [10] 30.32 21.74 40.28 28.46 51.26 61.61
cloud reconstruction quality. Planar regions are well preserved and FCRN [10] + GeoNet++ 28.27 19.88 38.19 30.06 54.67 65.46
details of small objects are clearer. DORN [18] is a top-performing
DORN [18] 32.94 25.49 42.94 25.61 45.55 56.10
approach for depth prediction in terms of 2D metrics. However, DORN [18] + GeoNet++ 29.39 20.74 39.88 30.08 52.95 63.45
under the 3D metric, the geometric constraint is not well satisfied,
resulting in problematic 3D visualization as well. After refined by
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

TABLE 6 TABLE 8
Quantitative comparisons on the KITTI dataset in terms of 3DGM. Ablation studies on depth prediction on the NYUD-V2 dataset. GeoNet:
depth-to-normal and normal-to-depth modules. Full Model: all the
modules with iterative inference.
Error Accuracy
mean median rmse 11.25◦ 22.5◦ 30◦
Method Error Accuracy
Multi-scale CNN [6] 38.00 29.44 47.00 11.42 39.53 50.74 rmse log 10 rel 1.25 1.252 1.253
Multi-scale CNN [6] + GeoNet++ 36.63 28.25 45.29 11.18 40.97 52.39
Our Baseline 0.626 0.068 0.155 0.768 0.951 0.988
DORN [18] 22.49 13.00 33.25 45.53 66.65 74.71 With GeoNet[19] 0.608 0.065 0.149 0.786 0.956 0.990
DORN [18] + GeoNet++ 21.89 12.31 32.59 46.84 68.99 76.73 With Ensemble 0.605 0.064 0.147 0.789 0.957 0.990
With Edge-aware 0.605 0.064 0.146 0.789 0.958 0.990
Full Model 0.600 0.063 0.144 0.791 0.960 0.991
TABLE 7
Ablation studies on surface normal estimation on the NYUD-V2 dataset. Canny (Low) 0.606 0.064 0.147 0.789 0.957 0.990
GeoNet: depth-to-normal and normal-to-depth modules. Full Model: all Canny (Mid) 0.605 0.064 0.146 0.789 0.958 0.990
the modules with iterative inference. Canny (High) 0.605 0.064 0.147 0.790 0.957 0.990

Error Accuracy
TABLE 9
mean median rmse 11.25◦ 22.5◦ 30◦
Ablation studies regarding 3DGM on the NYUD-V2 dataset. GeoNet:
Baseline 19.4 12.5 27.0 46.0 70.3 78.9 depth-to-normal and normal-to-depth modules. Full Model: all the
With GeoNet [19] 19.0 11.8 26.9 48.4 71.5 79.5 modules with iterative inference.
With Ensemble 18.9 11.8 26.9 48.3 71.9 79.8
With Edge-aware 18.6 11.3 26.7 50.0 72.7 80.4
Error Accuracy
Full Model 18.5 11.2 25.7 50.2 73.2 80.7
mean median rmse 11.25◦ 22.5◦ 30◦
Canny (Low) 18.8 11.4 26.9 49.4 72.2 80.0
Our Baseline 42.39 37.61 50.81 12.09 28.97 39.68
Canny (Mid) 18.6 11.3 26.7 50.0 72.7 80.4
With GeoNet[19] 35.02 29.12 43.33 17.60 39.04 51.36
Canny (High) 18.7 11.4 26.8 50.0 72.7 80.4
With Ensemble 35.16 28.78 43.82 17.95 39.58 51.87
With Edge-aware 34.96 29.14 43.09 17.05 38.87 51.37
Full Model 33.24 26.28 42.24 19.60 43.19 56.09
GeoNet++, the 3D quality has been significantly improved. This
further validates that 2D metrics are insufficient to fully measure
the depth quality, and GeoNet++ smooths prediction in planar “maxValue”: 200). Experimental results in Tab. 8 and Tab. 7 show
regions considering geometric constraints, and refines boundaries that the performance is robust to parameters of the Canny edge
with the weighted propagation. detector benefited from the learn-able residual module. We also
show visual comparisons in Fig. 11. Extremely low edge thresholds
will lead to noisy reconstructions (see Fig 11 white points). With
7 A BLATION S TUDIES high thresholds, the result will have more flying pixels in the
We evaluate the effectiveness of each component of GeoNet++ boundary regions compared to our settings. Our experimental
both quantitatively and qualitatively on the NYUD-V2 dataset. observation indicates that the visual quality is generally stable with
Tab. 8 shows the effectiveness of different components, including a large range of parameters around the “Mid” .
depth-to-normal, normal-to-depth, ensemble module, and edge-
aware refinement module in terms of 2D pixel-wise metrics. Tab. 7
shows the influence of different components for surface normal 8 CNN S AND G EOMETRIC C ONSTRAINTS
prediction. The ensemble module improves the performance via In this section, we verify our motivation by testing if CNNs can
fusing predictions from the geometric module and the backbone directly learn the mapping from depth to surface normal, i.e.,
network. The edge-aware refinement module improves the output implicitly learn the geometric constraints, so that the generated
by reducing the noise and making the boundary predictions more depth naturally produces high-quality surface reconstructions. To
accurate. Quantitative results in terms of 3DGM are shown in Tab. 9. this end, we train CNNs, which take the ground-truth depth and
Visual comparisons are given in Fig. 12. As can be seen, GeoNet, surface normal maps as inputs and supervision respectively. We
including depth-to-normal and normal-to-depth modules, smooths tried architectures including the first 4 layers, the first 7 layers, and
the prediction in planar regions (the wall region in Fig. 12 (c)) and the full version of VGG-16 network. Before feeding to the above
meanwhile preserves the details of small objects (the pillow and
counter in Fig. 12 (c)). The ensemble module further enhances the
result by combing the initial and the geometric predictions, making TABLE 10
it closer to ground truth (Fig. 12 (d)). The edge-aware module Performance evaluation of depth-to-normal on the NYUD-V2 test set.
VGG stands for the VGG-16 network. “LS” means our least square
refines the boundary (bed and counter in Fig. 12 (e)). For the module. “D-N” is our depth-to-normal network without the last 1 × 1
Canny edge detection, we use the function in OpenCV to compute convolution layer. Ground-truth depth maps are used as input.
the edge maps. The “minValue” threshold is set to be the mean of
pixel intensities of the dataset (around 100) and the “maxValue” Error Accuracy
is two times the mean value (around 200). Here, we validate the mean median rmse 11.25◦ 22.5◦ 30◦
robustness of our pipeline to the parameters of the Canny edge 4-layer 39.5 37.6 44.0 6.1 21.4 35.5
detector by changing the thresholds to be extremely low values 7-layer 39.8 38.2 44.3 6.5 21.0 34.2
(i.e., “minValue”: 0 and “maxValue”: 0 ) and extremely high values VGG 47.8 47.3 52.1 2.8 11.8 20.7
(i.e., “minValue”: 255 and “maxValue”: 255). “Mid” corresponds LS 11.5 6.4 18.8 70.0 86.7 91.3
D-N 8.2 3.0 15.5 80.0 90.3 93.5
to the values used in our experiments (i.e., “minValue”: 100 and
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

(a) Input (b) DORN [18] (c) DORN [18] + GeoNet++ (d) FCRN [10] (e) FCRN [10] + GeoNet++ (f) GT

Fig. 10. Visual comparison of 3D point clouds on the NYUD-V2 dataset. “FCRN [10] + GeoNet++” indicates that we utilize FCRN [10] as the depth
backbone in our system. “DORN [18] + GeoNet++” indicates that DORN [18] serves as the depth backbone for our system.

(a) Image (b) Point Cloud (Low) (c) Edge (Low) (d) Point Cloud (Mid) (e) Edge (Mid) (f) Point Cloud (High) (g) Edge (High)

Fig. 11. Comparisons of different canny parameters for surface reconstruction. Zoom in to see clearer.

(a) Input & GT (b) Baseline (c) With GeoNet (d) With Ensemble (e) With Edge-aware (f) Full Model

Fig. 12. Qualitative comparisons for ablation studies. GeoNet: depth-to-normal and normal-to-depth modules. Full Model: all modules with iterative
inference. First row: depth prediction results; Second row: surface normal generated from predicted depth in the first row; Third row: point cloud
visualization; Fourth row: surface normal directly predicted from the image.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

(a) Image (b) GT (D) (c) VGG (D) (d) Loss (e) Ours (D) (f) GT (N) (g) VGG (N) (h) Loss (N) (i) Ours (N)

Fig. 13. Visual comparisons on joint prediction of depth and surface normal with VGG-16 as the backbone architecture. GT stands for “ground truth".
“(D)" and “(N)" represent depth and surface normal respectively. Loss indicates that the geometric constraint is only adopted in constructing the loss. .

TABLE 11 Given the predicted depth map, we compute the transformed surface
Depth-to-normal consistency evaluation on the NYUD-V2 test set. “Pred” normal map using this trained standard module.
means that we transform predicted depth to surface normal and
compare it with the predicted surface normal. “GT” means that we With these preparations, we conduct experiments under the
transform predicted depth to surface normal and compare it with the following 4 settings. (1) Comparison between transformed normal
ground-truth surface normal. “Baseline” and “GeoNet” indicate that and predicted normal both generated by the baseline network.
predictions are from baseline and our model respectively. The backbone (2) Comparison between transformed normal and predicted nor-
network of our baseline is VGG-16.
mal both generated by our GeoNet. (3) Comparison between
transformed normal generated by the baseline network and the
Error Accuracy
mean median rmse 11.25◦ 22.5◦ 30◦ ground-truth normal. (4) Comparison between transformed normal
generated by GeoNet and the ground-truth normal. Here we also
Pred-Baseline 42.2 39.8 48.9 9.8 25.2 35.9 use the VGG-16 network as the backbone network.
Pred-GeoNet 34.9 31.4 41.4 15.3 35.0 47.7
GT-Baseline 47.8 47.3 52.1 2.8 11.8 20.7 These results are shown in Tab. 11. The “Pred” columns of
GT-GeoNet 36.8 32.1 44.5 15.0 34.5 46.7 the table show that our GeoNet++ can generate predictions of
depth and surface normal which are more consistent than those of
the baseline CNN. From the “GT” columns of the table, it is also
networks, the depth map is transformed into a 3-channel image obvious that compared to the baseline CNN, the predictions yielded
encoding {x, y, z} coordinates respectively. from our GeoNet++ are consistently closer to the ground-truth.
We provide the test performance on the NYUD-V2 dataset in
Tab. 10. All variants of CNNs converge to very poor local minima. 9 C ONCLUSION
We also show the test performance of the surface normal predicted
We have proposed Geometric Neural Network with Edge-Aware
by our depth-to-normal network. In particular, since the depth-to-
Refinement (GeoNet++) to jointly predict depth and surface normal
normal module contains least-square and residual modules, we
from a single image. Our GeoNet++ involves depth-to-normal and
also show the surface normal map predicted by the least square
normal-to-depth modules. It effectively enforces the geometric
module only denoted as “LS”. Tab. 10 reveals that the “LS” module
constraints that the prediction should obey regarding depth and
alone is significantly better than the vanilla CNN baselines in all
surface normal. They make the final prediction geometrically
metrics. Moreover, with the residual module, the performance of
consistent and more accurate. The ensemble network slightly
our module gets further boosted.
adjusts the results by fusing geometric refined predictions and initial
These preliminary experiments lead to the following important
predictions from backbone networks. The edge-aware refinement
findings:
network updates predictions in the planar and boundary regions.
1) Learning a mapping from depth to normal directly via The iterative inference is finally adopted to improve the prediction.
vanilla CNNs hardly respects the underlying geometric Our extensive experiments show that GeoNet++ achieves state-of-
relation. the-art results in terms of both 2D metrics and a newly proposed
2) Despite its simplicity, the least square module is very 3D geometric metric. In the future, we would like to apply our
effective in incorporating geometric constraints into neural GeoNet++ to geometric estimation tasks, such as 3D reconstruction,
networks, thus achieving better performance. stereo matching, and SLAM.
3) Our depth-to-normal network further improves the quality
compared to the least-square module alone. ACKNOWLEDGEMENTS
The work was supported in part by HKU Start-up Fund,
8.1 Geometric Consistency Seed Fund for Basic Research, the ERC grant ERC-2012-
We verify if the depth and surface normal maps predicted by our AdG321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1 and
geometric model, i.e., GeoNet, are consistent. To this end, we first EPSRC/MURI grant EP/N019474/1. We would also like to thank
train a standard depth-to-normal module using ground-truth depth the Royal Academy of Engineering and FiveAI. RL was supported
and surface normal maps and regard it as an accurate transformation. by Connaught International Scholarship and RBC Fellowship.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

R EFERENCES [27] W.-C. Ma, H. Chu, B. Zhou, R. Urtasun, and A. Torralba, “Single image
intrinsic decomposition without a single intrinsic image,” in Proc. 15th
[1] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation Eur. Conf. Comput. Vis., 2018, pp. 201–217.
and support inference from rgbd images,” in Proc. 12th Eur. Conf. Comput. [28] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn for single
Vis., 2012, pp. 746–760. view depth estimation: Geometry to the rescue,” in Proc. 14th Eur. Conf.
[2] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The Comput. Vis. Springer, 2016, pp. 740–756.
kitti dataset,” Int. J. Robotics Research, vol. 32, no. 11, pp. 1231–1237, [29] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular
2013. depth estimation with left-right consistency,” in Proc. IEEE Conf. Comput.
[3] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from single Vis. Pattern Recog., 2017, pp. 270–279.
monocular images,” in Proc. Advances Neural Inf. Process. Syst., 2006, [30] P. Wang, X. Shen, B. Russell, S. Cohen, B. Price, and A. L. Yuille, “Surge:
pp. 1161–1168. Surface regularized geometry estimation from a single image,” in Proc.
[4] B. Liu, S. Gould, and D. Koller, “Single image depth estimation from Advances Neural Inf. Process. Syst., 2016, pp. 172–180.
predicted semantic labels,” in Proc. IEEE Conf. Comput. Vis. Pattern [31] D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: Multi-tasks guided
Recog., 2010, pp. 1253–1260. prediction-and-distillation network for simultaneous depth estimation and
[5] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single scene parsing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018,
image using a multi-scale deep network,” in Proc. Advances Neural Inf. pp. 675–684.
Process. Syst., 2014, pp. 2366–2374. [32] Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, “Joint task-recursive
learning for semantic segmentation and depth estimation,” in Proc. 15th
[6] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic
Eur. Conf. Comput. Vis., 2018, pp. 235–251.
labels with a common multi-scale convolutional architecture,” in Proc.
IEEE Int. Conf. Comput. Vis., 2015, pp. 2650–2658. [33] Y. Zhang and T. Funkhouser, “Deep depth completion of a single rgb-
d image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp.
[7] M. Liu, M. Salzmann, and X. He, “Discrete-continuous depth estimation
175–185.
from a single image,” in Proc. IEEE Int. Conf. Comput. Vis., 2014, pp.
716–723. [34] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia, “Unsupervised learning
of geometry from videos with edge-aware depth-normal consistency,” in
[8] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards
Proc. AAAI Conf. Artificial Intell., 2018.
unified depth and semantic prediction from a single image,” in Proc. IEEE
[35] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille,
Conf. Comput. Vis. Pattern Recog., 2015, pp. 2800–2809.
“Semantic image segmentation with task-specific edge detection using
[9] A. Roy and S. Todorovic, “Monocular depth estimation using neural
cnns and a discriminatively trained domain transform,” in Proc. IEEE
regression forest,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016,
Conf. Comput. Vis. Pattern Recog., 2016, pp. 4545–4554.
pp. 5506–5514.
[36] E. S. Gastal and M. M. Oliveira, “Domain transform for edge-aware image
[10] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper and video processing,” in ACM Trans. Graph., vol. 30, no. 4. ACM,
depth prediction with fully convolutional residual networks,” in Proc. 4th 2011, p. 69.
Int. Conf. 3D Vis., 2016, pp. 239–248.
[37] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[11] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single large-scale image recognition,” in Proc. Int. Conf. Learn. Representations,
monocular images using deep convolutional neural fields,” IEEE Trans. 2015.
Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2024–2039, 2015.
[38] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
[12] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale continu- “Semantic image segmentation with deep convolutional nets and fully
ous crfs as sequential deep networks for monocular depth estimation,” in connected crfs,” in Proc. Int. Conf. Learn. Representations, 2015.
Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 5354–5362. [39] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see
[13] B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He, “Depth and better,” in Proc. Int. Conf. Learn. Representations, 2016.
surface normal estimation from monocular images using regression on [40] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
deep features and hierarchical crfs,” in Proc. IEEE Conf. Comput. Vis. network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp.
Pattern Recog., 2015, pp. 1119–1127. 2881–2890.
[14] X. Wang, D. Fouhey, and A. Gupta, “Designing deep networks for surface [41] D. Fouhey, A. Gupta, and M. Hebert, “Data-driven 3d primitives for single
normal estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., image understanding,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp.
2015, pp. 539–547. 3392–3399.
[15] A. Bansal, B. Russell, and A. Gupta, “Marr revisited: 2d-3d alignment [42] D. F. Fouhey, A. Gupta, and M. Hebert, “Unfolding an indoor origami
via surface normal prediction,” in Proc. IEEE Conf. Comput. Vis. Pattern world,” in Proc. 13th Eur. Conf. Comput. Vis. Springer, 2014, pp.
Recog., 2016, pp. 5965–5974. 687–702.
[16] A. Bansal, X. Chen, B. Russell, A. G. Ramanan et al., “Pixelnet: [43] B. Zeisl, M. Pollefeys et al., “Discriminatively trained dense surface
Representation of the pixels, by the pixels, and for the pixels,” arXiv normal estimation,” in Proc. 13th Eur. Conf. Comput. Vis. Springer,
e-prints, 2017. 2014, pp. 468–484.
[17] J. T. Barron and J. Malik, “Shape, illumination, and reflectance from [44] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger,
shading,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 8, pp. “Sparsity invariant cnns,” in Proc. Int. Conf. 3D Vis. IEEE, 2017, pp.
1670–1687, 2015. 11–20.
[18] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal [45] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using optimization,”
regression network for monocular depth estimation,” in Proc. IEEE Conf. ACM Trans. Graph., vol. 23, no. 3, pp. 689–694, 2004.
Comput. Vis. Pattern Recog., 2018, pp. 2002–2011. [46] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
[19] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for
network for joint depth and surface normal estimation,” in Proc. IEEE large-scale machine learning,” in 12th Sympos. Operat. Sys. Design and
Conf. Comput. Vis. Pattern Recog., 2018, pp. 283–291. Implement.), 2016, pp. 265–283.
[20] A. Torralba and A. Oliva, “Depth estimation from image structure,” IEEE [47] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1226–1238, 2002. Proc. Int. Conf. Learn. Representations, 2015.
[21] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering surface layout from [48] A. Chakrabarti, J. Shao, and G. Shakhnarovich, “Depth from a single
an image,” Int. J. Comput. Vision, vol. 75, no. 1, pp. 151–172, 2007. image by harmonizing overcomplete local network predictions,” in Proc.
[22] A. G. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun, “Box in the box: Advances Neural Inf. Process. Syst., 2016, pp. 2658–2666.
Joint 3d layout and object reasoning from single images,” in Proc. IEEE [49] K. Karsch, C. Liu, and S. B. Kang, “Depth extraction from video using
Int. Conf. Comput. Vis., 2013, pp. 353–360. non-parametric sampling,” in Proc. 12th Eur. Conf. Comput. Vis. Springer,
[23] L. Ladicky, J. Shi, and M. Pollefeys, “Pulling things out of perspective,” 2012, pp. 775–788.
in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 89–96. [50] W. Zhuo, M. Salzmann, X. He, and M. Liu, “Indoor scene structure
[24] J. Shi, X. Tao, L. Xu, and J. Jia, “Break ames room illusion: depth from analysis for single image depth estimation,” in Proc. IEEE Conf. Comput.
general single images,” ACM Trans. Graph., vol. 34, no. 6, p. 225, 2015. Vis. Pattern Recog., 2015, pp. 614–622.
[25] P. Favaro and S. Soatto, “A geometric approach to shape from defocus,” [51] M. H. Baig and L. Torresani, “Coupled depth learning,” in Proc. IEEE
IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 3, pp. 406–417, 2005. Winter Conf. Appl. Comput. Vis., 2016, pp. 1–10.
[26] E. Shelhamer, J. T. Barron, and T. Darrell, “Scene intrinsics and depth [52] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure
from a single image,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, from a single still image,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31,
2015, pp. 37–44. no. 5, pp. 824–840, 2009.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

[53] Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semi-supervised deep learning Jiaya Jia received his Ph.D. degree in Com-
for monocular depth map prediction,” in Proc. IEEE Conf. Comput. Vis. puter Science from Hong Kong University of
Pattern Recog., 2017, pp. 6647–6655. Science and Technology in 2004. From March
[54] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Monocular depth 2003 to August 2004, he was a visiting scholar
estimation using multi-scale continuous crfs as sequential deep networks,” at Microsoft. Then, he joined the Department
IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 6, pp. 1426–1440, of Computer Science and Engineering at The
2018. Chinese University of Hong Kong (CUHK) in 2004
[55] B. Z. Ladicky, Lubor and M. Pollefeys, “Discriminatively trained dense as an assistant professor and was promoted
surface normal estimation,” in Proc. 13th Eur. Conf. Comput. Vis., 2014, to Associate Professor in 2010. He conducted
p. 4. collaborative research at Adobe Research in
2007. He was promoted to Professor in 2015.
He was the Distinguished Scientist and Founding Executive Director of
Tencent YouTu X-Lab. He is an IEEE Fellow.

Xiaojuan Qi received her [Link] degree in Elec-

tronic Science and Technology at Shanghai Jiao
Tong University (SJTU) in 2014, and the Ph.D.
degree in Computer Science and Engineering
from the Chinese University of Hong Kong in
2018. She was a postdoc at the University of
Oxford. She is now an assistant professor at the
University of Hong Kong.

Zhengzhe Liu received his [Link] degree in

Information Engineering at Shanghai Jiao Tong
University (SJTU) in 2014, and the MPhil degree
in Computer Science and Engineering from the
Chinese University of Hong Kong in 2017.

Renjie Liao received his [Link] degree from

School of Automation Science and Electrical
Engineering at Beihang University (former Bei-
jing University of Aeronautics and Astronautics),
and an MPhil degree in Computer Science and
Engineering from the Chinese University of Hong
Kong. He is now a Ph.D. student in the Depart-
ment of Computer Science, University of Toronto.

Philip H.S. Torr received his Ph.D. degree from

Oxford University. After working for another three
years at Oxford as a research fellow, he worked
for six years in Microsoft Research, first in Red-
mond, then in Cambridge, founding the vision
side of the Machine Learning and Perception
Group. He then became a Professor in Computer
Vision and Machine Learning at Oxford Brookes
University. He is now a professor at Oxford Uni-
versity. He is a fellow of the Royal Academy
engineering and an Ellis Fellow.

Raquel Urtasun her Ph.D. degree from the Com-

puter Science department at Ecole Polytechnique
Federal de Lausanne (EPFL) in 2006 and did her
postdoc at MIT and UC Berkeley. She was also a
visiting professor at ETH Zurich during the spring
semester of 2010. Then, she was an Assistant
Professor at the Toyota Technological Institute at
Chicago (TTIC). She is currently Uber ATG Chief
Scientist and the Head of Uber ATG Toronto. She
is also an Associate Professor in the Department
of Computer Science at the University of Toronto,
a Canada Research Chair in Machine Learning and Computer Vision,
and a co-founder of the Vector Institute for AI.

GeoNet: Joint Depth & Surface Normal Estimation
No ratings yet
GeoNet: Joint Depth & Surface Normal Estimation
9 pages
Adaptive Surface Normal for 3D Estimation
No ratings yet
Adaptive Surface Normal for 3D Estimation
17 pages
Efficient Normal Estimation for 3D Mesh
No ratings yet
Efficient Normal Estimation for 3D Mesh
10 pages
Deep Learning for Surface Reconstruction
No ratings yet
Deep Learning for Surface Reconstruction
13 pages
Metric3D v2: Zero-Shot Depth Estimation
No ratings yet
Metric3D v2: Zero-Shot Depth Estimation
30 pages
Multi-task Learning for 360° Geometry
No ratings yet
Multi-task Learning for 360° Geometry
18 pages
PointGrid: Advanced 3D Shape Network
No ratings yet
PointGrid: Advanced 3D Shape Network
11 pages
MoRE: 3D Geometry Reconstruction Model
No ratings yet
MoRE: 3D Geometry Reconstruction Model
16 pages
3D Scene Structure from Single Image
No ratings yet
3D Scene Structure from Single Image
16 pages
3D Scene Shape Recovery from Images
No ratings yet
3D Scene Shape Recovery from Images
10 pages
SurfaceNet: End-to-End 3D Reconstruction
No ratings yet
SurfaceNet: End-to-End 3D Reconstruction
9 pages
Depth Map Refinement for 3D Imaging
No ratings yet
Depth Map Refinement for 3D Imaging
9 pages
Occupancy Networks for 3D Reconstruction
No ratings yet
Occupancy Networks for 3D Reconstruction
11 pages
M3VSNET Unsupervised Multi-Metric Multi-View Stereo Network
No ratings yet
M3VSNET Unsupervised Multi-Metric Multi-View Stereo Network
5 pages
SpinNet: Robust 3D Point Descriptor
No ratings yet
SpinNet: Robust 3D Point Descriptor
10 pages
GeoNet: Unsupervised 3D Scene Estimation
No ratings yet
GeoNet: Unsupervised 3D Scene Estimation
10 pages
3D Polygon Maps for Mobile Robots
No ratings yet
3D Polygon Maps for Mobile Robots
5 pages
Continuous 3D Perception Model With Persistent State
No ratings yet
Continuous 3D Perception Model With Persistent State
17 pages
Enhancing MeshCNN for 3D Shape Analysis
No ratings yet
Enhancing MeshCNN for 3D Shape Analysis
6 pages
Mesh R-CNN: 3D Object Detection System
No ratings yet
Mesh R-CNN: 3D Object Detection System
11 pages
Mesh R-CNN: 3D Object Detection System
No ratings yet
Mesh R-CNN: 3D Object Detection System
15 pages
PC2WF: 3D Wireframe Reconstruction From Raw Point Clouds
No ratings yet
PC2WF: 3D Wireframe Reconstruction From Raw Point Clouds
20 pages
PointNet++: Hierarchical Point Set Learning
No ratings yet
PointNet++: Hierarchical Point Set Learning
14 pages
SplineCNN: Efficient Geometric Learning
No ratings yet
SplineCNN: Efficient Geometric Learning
9 pages
3D Point Cloud Segmentation via 2D CNNs
No ratings yet
3D Point Cloud Segmentation via 2D CNNs
10 pages
PointNet: 3D Classification & Segmentation
No ratings yet
PointNet: 3D Classification & Segmentation
19 pages
PSMNet: Pyramid Stereo Matching Network
No ratings yet
PSMNet: Pyramid Stereo Matching Network
9 pages
CvxNet: Learnable Convex Geometry
No ratings yet
CvxNet: Learnable Convex Geometry
14 pages
Geometric Deep Learning Going Beyond Euclidean Data
No ratings yet
Geometric Deep Learning Going Beyond Euclidean Data
25 pages
3D Deep Learning for Autonomous Driving
No ratings yet
3D Deep Learning for Autonomous Driving
56 pages
General and Distinctive 3D Descriptors
No ratings yet
General and Distinctive 3D Descriptors
7 pages
Pano3D: Benchmark for 360° Depth Estimation
No ratings yet
Pano3D: Benchmark for 360° Depth Estimation
21 pages
9 MultiDim
No ratings yet
9 MultiDim
66 pages
Springer Lecture Notes in Computer Science 1
No ratings yet
Springer Lecture Notes in Computer Science 1
19 pages
3D Geological Modeling with Graph Neural Networks
No ratings yet
3D Geological Modeling with Graph Neural Networks
25 pages
3D Model Retrieval Using CNN-RNN
No ratings yet
3D Model Retrieval Using CNN-RNN
17 pages
3D Image Projection on LiDAR Surfaces
No ratings yet
3D Image Projection on LiDAR Surfaces
5 pages
Adaptive Depth Estimation with CNNs
No ratings yet
Adaptive Depth Estimation with CNNs
15 pages
GeoNeRF: Advanced Novel View Synthesis
No ratings yet
GeoNeRF: Advanced Novel View Synthesis
19 pages
Monocular Depth Estimation with Laplacian
No ratings yet
Monocular Depth Estimation with Laplacian
13 pages
CvxNet: Learnable Convex Geometry
No ratings yet
CvxNet: Learnable Convex Geometry
11 pages
NeuralRecon: Real-Time 3D Reconstruction
No ratings yet
NeuralRecon: Real-Time 3D Reconstruction
3 pages
Tensor Voting for Feature Inference
No ratings yet
Tensor Voting for Feature Inference
10 pages
PointNet++: Hierarchical Feature Learning
No ratings yet
PointNet++: Hierarchical Feature Learning
10 pages
GridFormer: Efficient 3D Surface Reconstruction
No ratings yet
GridFormer: Efficient 3D Surface Reconstruction
9 pages
DeCoTR: 3D Attention for Depth Completion
No ratings yet
DeCoTR: 3D Attention for Depth Completion
11 pages
Real-Time 3D Reconstruction with NeuralRecon
No ratings yet
Real-Time 3D Reconstruction with NeuralRecon
10 pages
3D Gaussian Splatting for Dynamic Views
No ratings yet
3D Gaussian Splatting for Dynamic Views
11 pages
DeepSDF: Continuous Shape Representation
No ratings yet
DeepSDF: Continuous Shape Representation
10 pages
3D Shape Reconstruction from Sketches
No ratings yet
3D Shape Reconstruction from Sketches
41 pages
Adaptive Riemannian Graph Neural Networks
No ratings yet
Adaptive Riemannian Graph Neural Networks
34 pages
Achlioptas 18 A
No ratings yet
Achlioptas 18 A
10 pages
3D Object Detection via Triangulation
No ratings yet
3D Object Detection via Triangulation
9 pages
Detailed Construction Reinforcement Guide
No ratings yet
Detailed Construction Reinforcement Guide
1 page
Dr. Deepak Kumar's Academic CV
No ratings yet
Dr. Deepak Kumar's Academic CV
7 pages
Software Testing Methods Explained
No ratings yet
Software Testing Methods Explained
12 pages
April 2022 Rental Income Report
No ratings yet
April 2022 Rental Income Report
3 pages
Setup Wizard Application Log Analysis
No ratings yet
Setup Wizard Application Log Analysis
22 pages
Trajexia Machine Control Programming Guide
No ratings yet
Trajexia Machine Control Programming Guide
389 pages
AutoCAD Plotting: 5 Essential Steps
No ratings yet
AutoCAD Plotting: 5 Essential Steps
5 pages
Miba Frictec GmbH Delivery Program F0907
No ratings yet
Miba Frictec GmbH Delivery Program F0907
66 pages
Graduation Project Proposal Guidelines
No ratings yet
Graduation Project Proposal Guidelines
14 pages
Microsoft Word: Inserting Images & Screenshots
No ratings yet
Microsoft Word: Inserting Images & Screenshots
4 pages
Law Clinic Protocol and Manual
No ratings yet
Law Clinic Protocol and Manual
11 pages
UN65NU6900F Service Manual Overview
100% (1)
UN65NU6900F Service Manual Overview
101 pages
Software Testing Exam Paper 2018
No ratings yet
Software Testing Exam Paper 2018
2 pages
Chat GPT Is A Better Teacher Than A School Teacher
No ratings yet
Chat GPT Is A Better Teacher Than A School Teacher
4 pages
Dynamic Macroeconomics: Difference Equations
No ratings yet
Dynamic Macroeconomics: Difference Equations
98 pages
Android Fragment Interaction Example
No ratings yet
Android Fragment Interaction Example
2 pages
Machina-Labs RoboCraftsman Product-Guide
No ratings yet
Machina-Labs RoboCraftsman Product-Guide
6 pages
Codeium Server Connection Errors
No ratings yet
Codeium Server Connection Errors
4 pages
Innovative Ideas for Renewable Energy
No ratings yet
Innovative Ideas for Renewable Energy
1 page
Uni Taskbook1
No ratings yet
Uni Taskbook1
22 pages
Lintasarta: Empowering Indonesia's ICT Future
100% (1)
Lintasarta: Empowering Indonesia's ICT Future
50 pages
Read & Write Tools for Student Success
No ratings yet
Read & Write Tools for Student Success
36 pages
DNAKE S414 User Manual Guide
No ratings yet
DNAKE S414 User Manual Guide
50 pages
Crypto Library Implementation Guide
No ratings yet
Crypto Library Implementation Guide
17 pages
Footing Reinforcement Schedule Details
No ratings yet
Footing Reinforcement Schedule Details
1 page
IPTV Control Panel Operations Manual
No ratings yet
IPTV Control Panel Operations Manual
9 pages
Scale Bucket Development Guide
No ratings yet
Scale Bucket Development Guide
5 pages
Calibration Parameters for RF Nodes
No ratings yet
Calibration Parameters for RF Nodes
18 pages
OghmaNano in Perovskite Solar Cells
No ratings yet
OghmaNano in Perovskite Solar Cells
7 pages
Daftar Inventarisasi Barang KPIJ
No ratings yet
Daftar Inventarisasi Barang KPIJ
17 pages

GeoNet++: Enhanced Depth & Normal Estimation

Uploaded by

GeoNet++: Enhanced Depth & Normal Estimation

Uploaded by

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

GeoNet++: Iterative Geometric Neural Network

W E tackle the important problem of jointly estimating depth

Depth to Normal Normal Ensemble Edge Refinement

(a) Initial Depth (d) Refined Normal

(c) Input Image

Normal to Depth Depth Ensemble Edge Refinement

Least Square Module Residual Module

Depth Estimation Kernel Regression

Initial Depth Initial Normal Geo-refined Depth Geo-EN-refined Depth

Geo-EN-Refined Depth Input Canny Edge Weight Maps

Geo-EN-Refined Normal Input Canny Edge Weight Maps

3.5 Edge-aware Refinement Module

TABLE 1 matrix AT A, which might be inaccurate if the condition number is

4.2 Datasets 5.1 Experiments on NYUD-V2 Dataset

Backbone Method Error Accuracy

DepthTransfer [49] 1.214 - 0.349 0.447 0.745 0.897

Method Error Accuracy

Make3D [52] 8.73 0.361 0.280 0.601 0.820 0.926

Baseline 15.21 8.17 23.76 60.08 78.95 85.23

6 B ENCHMARK D EPTH P REDICTION IN 3D

Input DORN [18] (Depth) DORN [18] + GeoNet++ (Depth)

Ours (Normal) DORN [18] (Normal) DORN [18] + GeoNet++ (Normal)

following the method of [55]. We finally compare it with the

Xiaojuan Qi received her [Link] degree in Elec-

Zhengzhe Liu received his [Link] degree in

Renjie Liao received his [Link] degree from

Philip H.S. Torr received his Ph.D. degree from

Raquel Urtasun her Ph.D. degree from the Com-

You might also like