GeoNet++: Enhanced Depth & Normal Estimation
GeoNet++: Enhanced Depth & Normal Estimation
Abstract—In this paper, we propose a geometric neural network with edge-aware refinement (GeoNet++) to jointly predict both depth
and surface normal maps from a single image. Building on top of two-stream CNNs, GeoNet++ captures the geometric relationships
between depth and surface normals with the proposed depth-to-normal and normal-to-depth modules. In particular, the “depth-to-normal”
arXiv:2012.06980v1 [[Link]] 13 Dec 2020
module exploits the least square solution of estimating surface normals from depth to improve their quality, while the “normal-to-depth”
module refines the depth map based on the constraints on surface normals through kernel regression. Boundary information is exploited
via an edge-aware refinement module. GeoNet++ effectively predicts depth and surface normals with strong 3D consistency and sharp
boundaries resulting in better reconstructed 3D scenes. Note that GeoNet++ is generic and can be used in other depth/normal prediction
frameworks to improve the quality of 3D reconstruction and pixel-wise accuracy of depth and surface normals. Furthermore, we propose
a new 3D geometric metric (3DGM) for evaluating depth prediction in 3D. In contrast to current metrics that focus on evaluating pixel-wise
error/accuracy, 3DGM measures whether the predicted depth can reconstruct high-quality 3D surface normals. This is a more natural
metric for many 3D application domains. Our experiments on NYUD-V2 [1] and KITTI [2] datasets verify that GeoNet++ produces fine
boundary details, and the predicted depth can be used to reconstruct high-quality 3D surfaces. Code has been made publicly available.
Index Terms—Depth estimation, surface normal estimation, 3D point cloud, 3D geometric consistency, 3D reconstruction, edge-aware,
convolutional neural network (CNN), geometric neural network.
1 I NTRODUCTION
(a) Input (b) Depth (DORN [18]) (c) Normal (DORN [18]) (d) 3D (DORN [18])
(e) 3D (GT) (f) Depth (Ours) (g) Normal (Ours) (h) 3D (Ours)
Fig. 2. Visual illustrations of depth/normal maps and the reconstructed 3D point cloud. (a) is the input image. (b) is the depth map from a state-
of-the-art approach DORN [18]. (c) is the surface normal derived from (b). (d) is the corresponding point cloud visualization of (b). (e) shows the
ground-truth point cloud. (f) is the depth map from our approach. (g) shows the surface normal derived from (f). (h) shows the corresponding point
cloud visualization of (f). The normal maps (DORN [18] and Ours) are computed from the corresponding point cloud shown in (d) and (h) respectively
using the least square fitting provided in [1] followed by TV-denoising.
hyperparameters. Another challenge stems from pooling operations Our final contribution is a new geometric-related evaluation
and large receptive fields, which makes current architectures metric for depth prediction which measures the quality of the 3D
perform poorly near object boundaries. We refer the reader to surface reconstruction. This metric directly measures the local
Fig. 2 (b) where the results are blurry on the black bounding box. 3D surface quality by casting the predicted depth into 3D point
This phenomenon is amplified when viewing the results in 3D. clouds. It is better correlated with the end goals of the 3D tasks.
As shown in Fig. 2 (d), the points inside the red bounding box Experimental results on NYUD-V2 [1] and KITTI [2] datasets
are scattered in 3D due to the blurry boundaries. It is therefore show that our GeoNet++ achieves decent performance, while being
problematic for robotic applications where obstacle detection and more efficient.
avoidance are needed for safety. Difference from our Conference Paper: This manuscript
The above facts motivate us to design a new architecture significantly improves the conference version [19]: (i) we introduce
that explicitly incorporates and enforces 3D geometric constraints an edge-aware propagation network to improve the prediction
considering object boundaries. Towards this goal, we propose Geo- at boundaries, facilitating the generation of better point clouds;
metric Neural Network with Edge-Aware Refinement (GeoNet++), (ii) we develop an iterative scheme to progressively improve the
which integrates geometric constraints and boundary information quality of predicted depth and surface normals; (iii) we propose
into a CNN. This contrasts with previous works [15], [5], [18], a new evaluation metric to measure 3D surface reconstruction
which focus on designing new network architectures [5] or loss accuracy; (iv) we conduct additional experiments and analysis on
functions more tailored for the task [18]. KITTI [2] dataset; (v) we empirically show that GeoNet++ can
The overall system (see Fig. 5) has a two-stream backbone be incorporated into previous methods [5], [12], [18] to further
CNN, which predicts initial depth and surface normals from a improve the results especially the 3D reconstruction quality; (vi)
single image respectively. With initial depth and surface normals, from both qualitative and quantitative perspectives, our results are
GeoNet++ (see Fig. 3) is utilized to incorporate geometric con- significantly better compared to [19], especially in 3D metrics.
straints by modeling depth-to-normal (see Fig. 4 (a) – (L)) and The rest of the paper is organized as follows. Sec. 2 reviews
normal-to-depth (see Fig. 4 (b) – (L)) mapping, and introduce the literature on depth and surface normal prediction. In Sec. 3,
boundary information with edge-aware refinement (see Fig. 4 we elaborate on our GeoNet++ model. We conduct experiments
(c)). Our “depth-to-normal” module relies on least-square and and show more detailed analysis in Sections 4 – 8. We draw our
residual sub-modules, while the “normal-to-depth” module updates conclusion in Sec. 9.
the depth estimates via kernel regression. Guided by the learned
propagation weights, our “edge-aware refinement module” sharpens
2 R ELATED W ORK
boundary predictions and smooths out noisy estimations. Our
framework enforces the final depth and surface normal prediction The 2.5D geometry estimation from a single image has been
to follow the underlying 3D constraints, which directly improves intensively studied. Previous works can be roughly divided into
3D surface reconstruction quality. Note that GeoNet++ can be two categories based on whether deep neural networks have been
integrated into other CNN backbones for depth or surface normal used.
prediction. Importantly, the overall system can be trained end-to- Traditional methods do not use deep neural networks and
end. mainly focus on exploiting low-level image cues and geometric
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
constraints. For example, [20] estimates the mean depth of the make the approach suffer from the heavy computational cost. In
scene by recognizing the structures presented in the image, and comparison, our GeoNet++ exploits the geometric relationship
inferring the scene scale. Based on Markov random fields (MRF), between depth and surface normal for general situations without
Saxena et al. [3] predicted a depth map given hand-crafted features making any planar or curvature assumption. Our model is more
of a single image. Vanishing points and lines are utilized in [21] for efficient compared to the iterative inference of CRF in [30].
recovering the surface layout [22]. Liu et al. [4] leveraged predicted Recently, the depth-normal consistency was utilized for depth
semantic segmentation to incorporate geometric constraints. A completion [33] from a single image and unsupervised depth-
scale-dependent classifier was proposed in [23] to jointly learn normal estimation [34] from monocular videos. In this paper, we
semantic segmentation and depth estimation. Shi et al. [24] showed focus on deploying geometric constraints to improve depth and
that estimating defocus blur is beneficial for recovering the depth surface normal estimation from a single image and analyzing its
map. Favaro et al. [25] proposed to learn a set of projection influence on 3D surface reconstruction.
operators from blurred images, which are further utilized to estimate
the 3D geometry of the scene from novel blurred images. In [17], a
3 G EO N ET ++
unified optimization problem was formed aiming at recovering the
intrinsic scene properties, e.g., shape, illumination, and reflectance In this section, we first introduce the overall architecture of our
from shading. Relying on specially designed features, above GeoNet++, and then elaborate on its components.
methods directly incorporate geometric constraints.
Many deep learning methods were recently proposed for single- 3.1 Overall Architecture
image depth and/or surface normal prediction. Eigen et al. [5] The overall architecture of GeoNet++ is illustrated in Fig. 3. Based
directly predicted the depth map by feeding the image to a CNN. on the initial depth map predicted by the backbone CNNs (Sec. 4.1),
Shelhamer et al. [26] proposed a fully convolutional network we apply the depth-to-normal module (Sec. 3.2) to transfer the
(FCN) to learn the intrinsic decomposition of a single image, initial depth map to the normal map as shown in Fig. 4 (a) – (L).
which involves inferring the depth map as the first intermediate This module refines the surface normals with the initial depth
step. Recently, Ma et al. [27] incorporated the physical rule in map considering geometric constraints. Similarly, given the initial
multi-image intrinsic decomposition for single image intrinsic surface normal estimation, we generate the depth using the normal-
decomposition. In [6], a unified coarse-to-fine hierarchical network to-depth module (Sec. 3.3). This enhances the depth prediction by
was adopted for depth/normal prediction. Continuous conditional incorporating the inherent geometric constraints to the estimation
random fields (CRFs) were proposed in [12] to fuse information of depth from normals. The depth/normal maps generated with
derived from CNN outputs. Fu et al. [18] introduced ordinal the above components are then adjusted via the depth (normal)
regression loss to help the optimization process and achieve better ensemble module (Sec. 3.4). Furthermore, by removing noisy
performance. In [11], continuous CRFs were built on top of results and refining boundary predictions, the edge refinement
CNN to smooth super-pixel-based depth prediction. For predicting network as described in Sec. 3.5 further improves the predictions
single-image surface normals, Wang et al. [14] incorporated local, and generates the refined results as shown in Fig. 3 (d)-(e). Finally,
global, and vanishing point information in designing the network GeoNet++ can be applied iteratively by taking the refined results
architecture. Reconstruction loss has been exploited in [28] for from previous iteration as inputs as described in Sec. 3.6.
unsupervised depth estimation from a single image. Following
work [29] introduced left-right consistency constraint. A skip-
connected architecture has also been proposed in [15] to fuse hidden 3.2 Depth-to-Normal Module
representations of different layers for surface normal estimation. Learning geometrically consistent surface normals from depth via
All these methods regard depth and surface normal predic- directly applying neural networks is surprisingly hard as discussed
tions as independent tasks, thus ignoring their basic geometric in Sec. 8. To this end, we propose a depth to normal transformation
relationship that also influences the quality of the reconstructed module that explicitly incorporates depth-normal consistency into
surface. Recently, a few works [30], [31], [32], [8] jointly reason deep neural networks. We start our discussion with the least square
multiple tasks. Wang et al. [30] designed CRFs to fuse semantic module, viewed as a fix-weight neural network. We then describe
segmentation prediction and depth estimation. Xu et al. [31] the residual sub-module that aims at smoothing and combining the
proposed a hierarchical framework to first predict depth, surface results with initial normals as in Fig. 4 (a) – (L).
normal, edge maps, and semantic segmentation, and then fuse them
Pinhole camera model. As a common practice, we adopt the
together for the final depth and semantic map prediction. Zhang et
pinhole camera model. We denote (ui , vi ) as the location of
al. [32] learned depth prediction and semantic segmentation with
pixel i in the 2D image. Its corresponding location in 3D space
the recursive refinement to progressively refine the predicted
is (xi , yi , zi ), where zi is the depth. Based on the geometry of
depth and semantic segmentation. All these approaches focus
perspective projection, we have
on modifying either CNN architectures or loss functions to make
the model better fit the data without explicitly considering the xi = (ui − cx ) ∗ zi /fx ,
geometric property. In contrast, our approach explicitly incorporates yi = (vi − cy ) ∗ zi /fy . (1)
geometric constraints by designing edge-aware geometric modules,
which is orthogonal to previous works. where fx and fy are the focal length along the x and y directions
The most related work to ours is that of [30], which has a CRF respectively. cx and cy are coordinates of the principal points.
with a 4-stream CNN, considering the consistency of predicted Least square sub-module. We formulate the inference of surface
depth and surface normal in planar regions. Nevertheless, it may normals from a depth map as a least-square problem. Specifically,
fail when planar regions are uncommon in images. Moreover, the for any pixel i, given its depth zi , we first compute its 3D
iterative inference in CRF and the Monte Carlo sampling strategy coordinates (xi , yi , zi ) from its 2D coordinates (ui , vi ) relying on
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
Iterative
Fig. 3. The overall structure of GeoNet++. GeoNet++ takes as inputs initial depth estimation (a), initial normal estimation (b), and an input image
(c). The initial depth and surface normal are firstly refined with depth-to-normal and normal-to-depth modules. Then, the depth (normal) ensemble
module is adopted to combine results from the initial estimation. It is followed by the edge refinement module, which reduces noise and refines
boundaries. GeoNet++ can be applied for multiple times by iteratively taking the refined normal and depth as inputs.
the pinhole camera model. To compute the surface normal of pixel the geometric relationship between depth and surface normals,
i, we need to determine the tangent plane, which crosses pixel i in our network circumvents the aforementioned difficulty in learning
3D space. We follow the traditional assumption that pixels within geometrically consistent depth and surface normals. Note that
a local neighborhood lie on the same tangent plane. In particular, the module can be incorporated and jointly fine-tuned with other
we define the set of neighboring pixels, including pixel i itself, as
networks that predict depth maps from raw images.
Ni = {(xj , yj , zj )||ui − uj | < β, |vi − vj | < β, |zi − zj | < γzi } ,
where β and γ are hyperparameters controlling the size of 3.3 Normal-to-Depth Module
neighborhood along x, y , and depth axes respectively. With Now we turn our attention to the normal-to-depth module. For
these pixels on the
tangent plane, the surface normal estimate any pixel i, given its surface normal (nix , niy , niz ) and an initial
n = nx , ny , nz should satisfy the over-determined linear system estimate of depth zi , the goal is to refine its depth.
of equations First, note that given the 3D point (xi , yi , zi ) and its surface
2 normal (nix , niy , niz ), we can uniquely determine the tangent
An = b, subject to knk2 = 1. (2) plane Pi , which satisfies the following equation
where nix (x − xi ) + niy (y − yi ) + niz (z − zi ) = 0. (5)
x1 y1 z1
x2 As explained in Sec. 3.2, we assume that pixels within a small
y2 z2 neighborhood of i lie on this tangent plane Pi . This neighborhood
K×3
A= . .. ∈ R , (3)
.. Mi is defined as
.. . . n o
xK yK zK Mi = (xj , yj , zj ) n>
j ni > α, |ui − uj | < β, |vi − vj | < β ,
and b ∈ RK×1 is a constant vector. K is the size of Ni , i.e., the where β is a hyperparameter controlling the size of the neighbor-
set of neighboring points. The least square solution of this problem, hood along the x and y axes, α is a threshold to rule out spatially
which minimizes kAn − bk2 , can be computed in closed form as close points, which are not approximately coplanar, and (ui , vi )
(A> A)−1 A> 1 are the coordinates of pixel i in the 2D image.
n= , (4) For any pixel j ∈ Mi , if the depth zj is given, we can compute
k(A> A)−1 A> 1k 2 0
the depth estimate of pixel i as zji relying on Eqs. (1) and (5) as
where 1 ∈ Rk is a vector with all-one elements. It is not surprising njx xj + njy yj + njz zj
0
that Eq. (4) can be regarded as a fix-weight neural network, which zji = . (6)
(ui − cx )njx /fx + (vi − cy )njy /fy + njz
predicts surface normals given the depth map.
To refine the depth of pixel i, we then use kernel regression to
Residual sub-module. This least-square module occasionally aggregate the estimation from all pixels in the neighborhood as
produces noisy surface normal estimation (see Fig. 4 (a): “Rough P 0
Normal”) due to issues like noise and improper neighborhood j∈M K(nj , ni )zji
ẑi = P i , (7)
size. To further improve the quality, we propose a residual module, j∈Mi K(nj , ni )
which consists of a 3-layer CNN with skip-connections as shown
in Fig. 4 (a). The goal is to smooth the noisy estimation from the where ẑi is the refined depth, ni = nix , niy , niz and K is
least square module. the kernel function. We use linear kernels (i.e., cosine similarity)
to measure the similarity between ni and nj , i.e., K(nj , ni ) =
Overall architecture. The architecture of the depth-to-normal n>j ni . In this case, the smaller the angle between normals ni and
module is illustrated in Fig. 4 (a) – (L). By explicitly leveraging nj is, which means the higher probability that pixels i and j are
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
Initial Normal Initial Depth Rough Normal Geo-refined Normal Geo-EN-refined Normal
(a) Depth-to-normal (L) and normal ensemble (R) modules. L: left shaded box, R: right shaded box.
(b) Normal-to-depth (L) and depth ensemble (R) modules. L: left shaded box, R: right shaded box.
Recusive
Propagator
Residual Maps
(c) Edge-aware refinement module for depth. Residual (weight) maps include “left to right”, “right to left”, “top to bottom”, “bottom to top”.
Recusive
Propagator
Residual Maps
(d) Edge-aware refinement module for surface normal. Residual (weight) maps include “left to right”, “right to left”, “top to bottom”, “bottom to top”.
Fig. 4. GeoNet++ components. (a) The depth-to-normal module (L) first estimates “Rough Normal” from the “Initial Depth” with least square fitting;
normals are then refined by the residual module producing “Geo-refined Normal”; a normal ensemble network (R) is utilized to fuse the initial and
Geo-refined normals generating “Geo-EN-refined normal”. (b) The normal-to-depth module (L) takes the “Initial Depth” and “Initial Normal” as inputs;
the normal map helps propagate the initial depth prediction to neighbors; depth estimates are aggregated by the kernel regression module producing
“Geo-refined Depth”. The depth ensemble module (R) taking “Geo-refined Depth” and “Initial Depth” as inputs further improves prediction generating
“Geo-EN-refined Depth”. (c) The edge-aware refinement module first constructs direction-aware propagation “Weight Maps” by combining low-level
edges with “Residual Maps”; the recursive propagator utilizes the learned weight maps to refine “Geo-EN-refined Depth” producing “Final Depth”; (d)
the edge-aware refinement module for surface normal. Please zoom in to see more details.
0
in the same tangent plane, the more contribution the estimate zji ensemble module illustrated in Fig. 4 (a) – (R) for surface normal
makes to the estimate of ẑi . and Fig. 4 (b) – (R) for depth. In the following, we detail the depth
The above process is illustrated in Fig. 4 (b) – (L). It can be ensemble module. The normal ensemble module shares a similar
viewed as a voting process where every pixel j ∈ Mi gives a architecture.
“vote” to determine the depth of pixel i. By utilizing the geometric The depth ensemble module takes as inputs “Initial Depth”
relationship between surface normal and depth, we efficiently from the backbone network and “Geo-refined Depth” (Fig. 4 (b))
improve the quality of depth estimate without any weights to learn. from the geometric module, and produces a refined depth – “Geo-
EN-refined Depth”, as shown in Fig. 4 (b). To enlarge the receptive
3.4 Depth (Normal) Ensemble Module field of the ensemble module, the input is firstly processed with 3
To further enhance the prediction quality, the “Initial Depth convolution layers with a dilation rate of 2, kernel size 3 × 3, and
(Normal)” from the backbone network and the “Geo-refined Depth” channel number 128. This is followed by another 2 dilation-free
from the geometric refinement are combined together with the convolution layers with kernel size 3 × 3 and channel number 128.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
(a) Images (b) Deep3D [14] (c) MS CNN V2 [6] (d) SkipNet [15] (e) GeoNet [19] (f) Ours (GeoNet++) (g) Ground truth
Fig. 6. Visual comparisons of surface normal predictions using VGG-16 as the backbone architecture.
TABLE 2
Performance of depth prediction on NYUD-V2 test set. “Baseline” means using VGG-16 to directly predict depth from raw images. The backbone
architecture for FCRN [10] and DORN [18] are ResNet-50 and ResNet-101 respectively. The results for DORN [18] are derived by evaluating their
model. * denotes that GeoNet or GeoNet++ with the backbone network are end-to-end finetuned. We do not finetune GeoNet++ with DORN [18]
since the released Caffe code is not compatible with ours. “Baseline + Loss” indicates that we only use a geometry-aware loss function as [34].
generates more accurate depth maps with regard to the washbasin compare our normal prediction results with those of other methods,
and small objects on the table in the 2-nd and 4-th rows respectively. including Deep3D [14], Multi-scale CNN V2 [6], and SkipNet [15]
We also show the corresponding surface normal predictions to in Fig. 6. GeoNet++ produces results with nice details on, e.g.,
verify that our GeoNet++ takes advantage of them to improve the the chair, washbasin, and wall from the 1-st, 2-nd, and 3-rd rows
depth prediction. We refer the reader to look closely at the wall respectively. More results of joint prediction are shown in Fig. 13.
in the 1-st row of the figure. DORN [18] achieves decent depth From these figures, it is clear that our model does a better job than
prediction performance on the NYUD-V2 dataset. However, the previous approaches in terms of geometry estimation.
visual quality of results still has much room for improvement since
they are not piecewise smooth in planar regions. When the produced 5.2 Experiments on the KITTI Dataset
depth is refined by GeoNet++, the visual quality is significantly
improved as illustrated in Fig. 8 (columns e and f). We further We further conduct experiments on KITTI dataset to verify the
effectiveness of our model in outdoor scenes. The KITTI dataset
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
TABLE 3
Quantitative comparisons of depth predictions on the KITTI dataset. * denotes that we directly evaluate results or model released by respective
authors. “Multi-scale CNN V1* [5] + GeoNet++” indicates that we utilize Multi-scale CNN V1* [5] to produce the initial depth. “DORN [18] + GeoNet++”
represents that we utilize DORN [18] to produce the initial depth.
TABLE 4
Performance of surface normal prediction on the KITTI dataset.
Error Accuracy
mean median rmse 11.25◦ 22.5◦ 30◦
Local Network [48] takes around 24s to predict the depth map of
the same-sized image; Multi-scale CRF [54] takes around 2.25s
to process an image; SURGE [30]1 also takes longer time since
it has to go through four VGG-16 networks and requires multiple
mean-field inference steps.
(a) Images (b) Ground truth (c) FCRN [10] (d) FCRN [10] + Ours (e) DORN (f) DORN + Ours (g) Ours (normal)
Fig. 8. Visual illustrations on depth prediction. “FCRN [10] + Ours” indicates that we utilize FCRN [10] as the depth backbone in our system.
“DORN [10] + Ours” indicates that DORN [10] is utilized as the depth backbone in our system. Please zoom in to see more details.
GT (Depth) DORN [18] (Point Clouds) DORN [18] + GeoNet++ (Point Clouds)
Fig. 9. “DORN [18] + GeoNet++” represents that we utilize DORN [18] to produce the initial depth. The first row shows the depth prediction results. The
second row (column 1) shows normal map directly predicted by our approach for reference. The second row (columns 2-3) shows the corresponding
normal directly estimated from the generated depth. The third row (column 1) shows the depth ground truth from LIDAR (invalid fields are filled with
method [45] for visualization). The third row (columns 2-3) shows the corresponding point clouds.
TABLE 6 TABLE 8
Quantitative comparisons on the KITTI dataset in terms of 3DGM. Ablation studies on depth prediction on the NYUD-V2 dataset. GeoNet:
depth-to-normal and normal-to-depth modules. Full Model: all the
modules with iterative inference.
Error Accuracy
mean median rmse 11.25◦ 22.5◦ 30◦
Method Error Accuracy
Multi-scale CNN [6] 38.00 29.44 47.00 11.42 39.53 50.74 rmse log 10 rel 1.25 1.252 1.253
Multi-scale CNN [6] + GeoNet++ 36.63 28.25 45.29 11.18 40.97 52.39
Our Baseline 0.626 0.068 0.155 0.768 0.951 0.988
DORN [18] 22.49 13.00 33.25 45.53 66.65 74.71 With GeoNet[19] 0.608 0.065 0.149 0.786 0.956 0.990
DORN [18] + GeoNet++ 21.89 12.31 32.59 46.84 68.99 76.73 With Ensemble 0.605 0.064 0.147 0.789 0.957 0.990
With Edge-aware 0.605 0.064 0.146 0.789 0.958 0.990
Full Model 0.600 0.063 0.144 0.791 0.960 0.991
TABLE 7
Ablation studies on surface normal estimation on the NYUD-V2 dataset. Canny (Low) 0.606 0.064 0.147 0.789 0.957 0.990
GeoNet: depth-to-normal and normal-to-depth modules. Full Model: all Canny (Mid) 0.605 0.064 0.146 0.789 0.958 0.990
the modules with iterative inference. Canny (High) 0.605 0.064 0.147 0.790 0.957 0.990
Error Accuracy
TABLE 9
mean median rmse 11.25◦ 22.5◦ 30◦
Ablation studies regarding 3DGM on the NYUD-V2 dataset. GeoNet:
Baseline 19.4 12.5 27.0 46.0 70.3 78.9 depth-to-normal and normal-to-depth modules. Full Model: all the
With GeoNet [19] 19.0 11.8 26.9 48.4 71.5 79.5 modules with iterative inference.
With Ensemble 18.9 11.8 26.9 48.3 71.9 79.8
With Edge-aware 18.6 11.3 26.7 50.0 72.7 80.4
Error Accuracy
Full Model 18.5 11.2 25.7 50.2 73.2 80.7
mean median rmse 11.25◦ 22.5◦ 30◦
Canny (Low) 18.8 11.4 26.9 49.4 72.2 80.0
Our Baseline 42.39 37.61 50.81 12.09 28.97 39.68
Canny (Mid) 18.6 11.3 26.7 50.0 72.7 80.4
With GeoNet[19] 35.02 29.12 43.33 17.60 39.04 51.36
Canny (High) 18.7 11.4 26.8 50.0 72.7 80.4
With Ensemble 35.16 28.78 43.82 17.95 39.58 51.87
With Edge-aware 34.96 29.14 43.09 17.05 38.87 51.37
Full Model 33.24 26.28 42.24 19.60 43.19 56.09
GeoNet++, the 3D quality has been significantly improved. This
further validates that 2D metrics are insufficient to fully measure
the depth quality, and GeoNet++ smooths prediction in planar “maxValue”: 200). Experimental results in Tab. 8 and Tab. 7 show
regions considering geometric constraints, and refines boundaries that the performance is robust to parameters of the Canny edge
with the weighted propagation. detector benefited from the learn-able residual module. We also
show visual comparisons in Fig. 11. Extremely low edge thresholds
will lead to noisy reconstructions (see Fig 11 white points). With
7 A BLATION S TUDIES high thresholds, the result will have more flying pixels in the
We evaluate the effectiveness of each component of GeoNet++ boundary regions compared to our settings. Our experimental
both quantitatively and qualitatively on the NYUD-V2 dataset. observation indicates that the visual quality is generally stable with
Tab. 8 shows the effectiveness of different components, including a large range of parameters around the “Mid” .
depth-to-normal, normal-to-depth, ensemble module, and edge-
aware refinement module in terms of 2D pixel-wise metrics. Tab. 7
shows the influence of different components for surface normal 8 CNN S AND G EOMETRIC C ONSTRAINTS
prediction. The ensemble module improves the performance via In this section, we verify our motivation by testing if CNNs can
fusing predictions from the geometric module and the backbone directly learn the mapping from depth to surface normal, i.e.,
network. The edge-aware refinement module improves the output implicitly learn the geometric constraints, so that the generated
by reducing the noise and making the boundary predictions more depth naturally produces high-quality surface reconstructions. To
accurate. Quantitative results in terms of 3DGM are shown in Tab. 9. this end, we train CNNs, which take the ground-truth depth and
Visual comparisons are given in Fig. 12. As can be seen, GeoNet, surface normal maps as inputs and supervision respectively. We
including depth-to-normal and normal-to-depth modules, smooths tried architectures including the first 4 layers, the first 7 layers, and
the prediction in planar regions (the wall region in Fig. 12 (c)) and the full version of VGG-16 network. Before feeding to the above
meanwhile preserves the details of small objects (the pillow and
counter in Fig. 12 (c)). The ensemble module further enhances the
result by combing the initial and the geometric predictions, making TABLE 10
it closer to ground truth (Fig. 12 (d)). The edge-aware module Performance evaluation of depth-to-normal on the NYUD-V2 test set.
VGG stands for the VGG-16 network. “LS” means our least square
refines the boundary (bed and counter in Fig. 12 (e)). For the module. “D-N” is our depth-to-normal network without the last 1 × 1
Canny edge detection, we use the function in OpenCV to compute convolution layer. Ground-truth depth maps are used as input.
the edge maps. The “minValue” threshold is set to be the mean of
pixel intensities of the dataset (around 100) and the “maxValue” Error Accuracy
is two times the mean value (around 200). Here, we validate the mean median rmse 11.25◦ 22.5◦ 30◦
robustness of our pipeline to the parameters of the Canny edge 4-layer 39.5 37.6 44.0 6.1 21.4 35.5
detector by changing the thresholds to be extremely low values 7-layer 39.8 38.2 44.3 6.5 21.0 34.2
(i.e., “minValue”: 0 and “maxValue”: 0 ) and extremely high values VGG 47.8 47.3 52.1 2.8 11.8 20.7
(i.e., “minValue”: 255 and “maxValue”: 255). “Mid” corresponds LS 11.5 6.4 18.8 70.0 86.7 91.3
D-N 8.2 3.0 15.5 80.0 90.3 93.5
to the values used in our experiments (i.e., “minValue”: 100 and
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
(a) Input (b) DORN [18] (c) DORN [18] + GeoNet++ (d) FCRN [10] (e) FCRN [10] + GeoNet++ (f) GT
Fig. 10. Visual comparison of 3D point clouds on the NYUD-V2 dataset. “FCRN [10] + GeoNet++” indicates that we utilize FCRN [10] as the depth
backbone in our system. “DORN [18] + GeoNet++” indicates that DORN [18] serves as the depth backbone for our system.
(a) Image (b) Point Cloud (Low) (c) Edge (Low) (d) Point Cloud (Mid) (e) Edge (Mid) (f) Point Cloud (High) (g) Edge (High)
Fig. 11. Comparisons of different canny parameters for surface reconstruction. Zoom in to see clearer.
(a) Input & GT (b) Baseline (c) With GeoNet (d) With Ensemble (e) With Edge-aware (f) Full Model
Fig. 12. Qualitative comparisons for ablation studies. GeoNet: depth-to-normal and normal-to-depth modules. Full Model: all modules with iterative
inference. First row: depth prediction results; Second row: surface normal generated from predicted depth in the first row; Third row: point cloud
visualization; Fourth row: surface normal directly predicted from the image.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
(a) Image (b) GT (D) (c) VGG (D) (d) Loss (e) Ours (D) (f) GT (N) (g) VGG (N) (h) Loss (N) (i) Ours (N)
Fig. 13. Visual comparisons on joint prediction of depth and surface normal with VGG-16 as the backbone architecture. GT stands for “ground truth".
“(D)" and “(N)" represent depth and surface normal respectively. Loss indicates that the geometric constraint is only adopted in constructing the loss. .
TABLE 11 Given the predicted depth map, we compute the transformed surface
Depth-to-normal consistency evaluation on the NYUD-V2 test set. “Pred” normal map using this trained standard module.
means that we transform predicted depth to surface normal and
compare it with the predicted surface normal. “GT” means that we With these preparations, we conduct experiments under the
transform predicted depth to surface normal and compare it with the following 4 settings. (1) Comparison between transformed normal
ground-truth surface normal. “Baseline” and “GeoNet” indicate that and predicted normal both generated by the baseline network.
predictions are from baseline and our model respectively. The backbone (2) Comparison between transformed normal and predicted nor-
network of our baseline is VGG-16.
mal both generated by our GeoNet. (3) Comparison between
transformed normal generated by the baseline network and the
Error Accuracy
mean median rmse 11.25◦ 22.5◦ 30◦ ground-truth normal. (4) Comparison between transformed normal
generated by GeoNet and the ground-truth normal. Here we also
Pred-Baseline 42.2 39.8 48.9 9.8 25.2 35.9 use the VGG-16 network as the backbone network.
Pred-GeoNet 34.9 31.4 41.4 15.3 35.0 47.7
GT-Baseline 47.8 47.3 52.1 2.8 11.8 20.7 These results are shown in Tab. 11. The “Pred” columns of
GT-GeoNet 36.8 32.1 44.5 15.0 34.5 46.7 the table show that our GeoNet++ can generate predictions of
depth and surface normal which are more consistent than those of
the baseline CNN. From the “GT” columns of the table, it is also
networks, the depth map is transformed into a 3-channel image obvious that compared to the baseline CNN, the predictions yielded
encoding {x, y, z} coordinates respectively. from our GeoNet++ are consistently closer to the ground-truth.
We provide the test performance on the NYUD-V2 dataset in
Tab. 10. All variants of CNNs converge to very poor local minima. 9 C ONCLUSION
We also show the test performance of the surface normal predicted
We have proposed Geometric Neural Network with Edge-Aware
by our depth-to-normal network. In particular, since the depth-to-
Refinement (GeoNet++) to jointly predict depth and surface normal
normal module contains least-square and residual modules, we
from a single image. Our GeoNet++ involves depth-to-normal and
also show the surface normal map predicted by the least square
normal-to-depth modules. It effectively enforces the geometric
module only denoted as “LS”. Tab. 10 reveals that the “LS” module
constraints that the prediction should obey regarding depth and
alone is significantly better than the vanilla CNN baselines in all
surface normal. They make the final prediction geometrically
metrics. Moreover, with the residual module, the performance of
consistent and more accurate. The ensemble network slightly
our module gets further boosted.
adjusts the results by fusing geometric refined predictions and initial
These preliminary experiments lead to the following important
predictions from backbone networks. The edge-aware refinement
findings:
network updates predictions in the planar and boundary regions.
1) Learning a mapping from depth to normal directly via The iterative inference is finally adopted to improve the prediction.
vanilla CNNs hardly respects the underlying geometric Our extensive experiments show that GeoNet++ achieves state-of-
relation. the-art results in terms of both 2D metrics and a newly proposed
2) Despite its simplicity, the least square module is very 3D geometric metric. In the future, we would like to apply our
effective in incorporating geometric constraints into neural GeoNet++ to geometric estimation tasks, such as 3D reconstruction,
networks, thus achieving better performance. stereo matching, and SLAM.
3) Our depth-to-normal network further improves the quality
compared to the least-square module alone. ACKNOWLEDGEMENTS
The work was supported in part by HKU Start-up Fund,
8.1 Geometric Consistency Seed Fund for Basic Research, the ERC grant ERC-2012-
We verify if the depth and surface normal maps predicted by our AdG321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1 and
geometric model, i.e., GeoNet, are consistent. To this end, we first EPSRC/MURI grant EP/N019474/1. We would also like to thank
train a standard depth-to-normal module using ground-truth depth the Royal Academy of Engineering and FiveAI. RL was supported
and surface normal maps and regard it as an accurate transformation. by Connaught International Scholarship and RBC Fellowship.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
R EFERENCES [27] W.-C. Ma, H. Chu, B. Zhou, R. Urtasun, and A. Torralba, “Single image
intrinsic decomposition without a single intrinsic image,” in Proc. 15th
[1] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation Eur. Conf. Comput. Vis., 2018, pp. 201–217.
and support inference from rgbd images,” in Proc. 12th Eur. Conf. Comput. [28] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn for single
Vis., 2012, pp. 746–760. view depth estimation: Geometry to the rescue,” in Proc. 14th Eur. Conf.
[2] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The Comput. Vis. Springer, 2016, pp. 740–756.
kitti dataset,” Int. J. Robotics Research, vol. 32, no. 11, pp. 1231–1237, [29] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular
2013. depth estimation with left-right consistency,” in Proc. IEEE Conf. Comput.
[3] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from single Vis. Pattern Recog., 2017, pp. 270–279.
monocular images,” in Proc. Advances Neural Inf. Process. Syst., 2006, [30] P. Wang, X. Shen, B. Russell, S. Cohen, B. Price, and A. L. Yuille, “Surge:
pp. 1161–1168. Surface regularized geometry estimation from a single image,” in Proc.
[4] B. Liu, S. Gould, and D. Koller, “Single image depth estimation from Advances Neural Inf. Process. Syst., 2016, pp. 172–180.
predicted semantic labels,” in Proc. IEEE Conf. Comput. Vis. Pattern [31] D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: Multi-tasks guided
Recog., 2010, pp. 1253–1260. prediction-and-distillation network for simultaneous depth estimation and
[5] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single scene parsing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018,
image using a multi-scale deep network,” in Proc. Advances Neural Inf. pp. 675–684.
Process. Syst., 2014, pp. 2366–2374. [32] Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, “Joint task-recursive
learning for semantic segmentation and depth estimation,” in Proc. 15th
[6] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic
Eur. Conf. Comput. Vis., 2018, pp. 235–251.
labels with a common multi-scale convolutional architecture,” in Proc.
IEEE Int. Conf. Comput. Vis., 2015, pp. 2650–2658. [33] Y. Zhang and T. Funkhouser, “Deep depth completion of a single rgb-
d image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp.
[7] M. Liu, M. Salzmann, and X. He, “Discrete-continuous depth estimation
175–185.
from a single image,” in Proc. IEEE Int. Conf. Comput. Vis., 2014, pp.
716–723. [34] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia, “Unsupervised learning
of geometry from videos with edge-aware depth-normal consistency,” in
[8] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards
Proc. AAAI Conf. Artificial Intell., 2018.
unified depth and semantic prediction from a single image,” in Proc. IEEE
[35] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille,
Conf. Comput. Vis. Pattern Recog., 2015, pp. 2800–2809.
“Semantic image segmentation with task-specific edge detection using
[9] A. Roy and S. Todorovic, “Monocular depth estimation using neural
cnns and a discriminatively trained domain transform,” in Proc. IEEE
regression forest,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016,
Conf. Comput. Vis. Pattern Recog., 2016, pp. 4545–4554.
pp. 5506–5514.
[36] E. S. Gastal and M. M. Oliveira, “Domain transform for edge-aware image
[10] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper and video processing,” in ACM Trans. Graph., vol. 30, no. 4. ACM,
depth prediction with fully convolutional residual networks,” in Proc. 4th 2011, p. 69.
Int. Conf. 3D Vis., 2016, pp. 239–248.
[37] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[11] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single large-scale image recognition,” in Proc. Int. Conf. Learn. Representations,
monocular images using deep convolutional neural fields,” IEEE Trans. 2015.
Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2024–2039, 2015.
[38] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
[12] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale continu- “Semantic image segmentation with deep convolutional nets and fully
ous crfs as sequential deep networks for monocular depth estimation,” in connected crfs,” in Proc. Int. Conf. Learn. Representations, 2015.
Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 5354–5362. [39] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see
[13] B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He, “Depth and better,” in Proc. Int. Conf. Learn. Representations, 2016.
surface normal estimation from monocular images using regression on [40] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
deep features and hierarchical crfs,” in Proc. IEEE Conf. Comput. Vis. network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp.
Pattern Recog., 2015, pp. 1119–1127. 2881–2890.
[14] X. Wang, D. Fouhey, and A. Gupta, “Designing deep networks for surface [41] D. Fouhey, A. Gupta, and M. Hebert, “Data-driven 3d primitives for single
normal estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., image understanding,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp.
2015, pp. 539–547. 3392–3399.
[15] A. Bansal, B. Russell, and A. Gupta, “Marr revisited: 2d-3d alignment [42] D. F. Fouhey, A. Gupta, and M. Hebert, “Unfolding an indoor origami
via surface normal prediction,” in Proc. IEEE Conf. Comput. Vis. Pattern world,” in Proc. 13th Eur. Conf. Comput. Vis. Springer, 2014, pp.
Recog., 2016, pp. 5965–5974. 687–702.
[16] A. Bansal, X. Chen, B. Russell, A. G. Ramanan et al., “Pixelnet: [43] B. Zeisl, M. Pollefeys et al., “Discriminatively trained dense surface
Representation of the pixels, by the pixels, and for the pixels,” arXiv normal estimation,” in Proc. 13th Eur. Conf. Comput. Vis. Springer,
e-prints, 2017. 2014, pp. 468–484.
[17] J. T. Barron and J. Malik, “Shape, illumination, and reflectance from [44] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger,
shading,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 8, pp. “Sparsity invariant cnns,” in Proc. Int. Conf. 3D Vis. IEEE, 2017, pp.
1670–1687, 2015. 11–20.
[18] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal [45] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using optimization,”
regression network for monocular depth estimation,” in Proc. IEEE Conf. ACM Trans. Graph., vol. 23, no. 3, pp. 689–694, 2004.
Comput. Vis. Pattern Recog., 2018, pp. 2002–2011. [46] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
[19] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for
network for joint depth and surface normal estimation,” in Proc. IEEE large-scale machine learning,” in 12th Sympos. Operat. Sys. Design and
Conf. Comput. Vis. Pattern Recog., 2018, pp. 283–291. Implement.), 2016, pp. 265–283.
[20] A. Torralba and A. Oliva, “Depth estimation from image structure,” IEEE [47] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1226–1238, 2002. Proc. Int. Conf. Learn. Representations, 2015.
[21] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering surface layout from [48] A. Chakrabarti, J. Shao, and G. Shakhnarovich, “Depth from a single
an image,” Int. J. Comput. Vision, vol. 75, no. 1, pp. 151–172, 2007. image by harmonizing overcomplete local network predictions,” in Proc.
[22] A. G. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun, “Box in the box: Advances Neural Inf. Process. Syst., 2016, pp. 2658–2666.
Joint 3d layout and object reasoning from single images,” in Proc. IEEE [49] K. Karsch, C. Liu, and S. B. Kang, “Depth extraction from video using
Int. Conf. Comput. Vis., 2013, pp. 353–360. non-parametric sampling,” in Proc. 12th Eur. Conf. Comput. Vis. Springer,
[23] L. Ladicky, J. Shi, and M. Pollefeys, “Pulling things out of perspective,” 2012, pp. 775–788.
in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 89–96. [50] W. Zhuo, M. Salzmann, X. He, and M. Liu, “Indoor scene structure
[24] J. Shi, X. Tao, L. Xu, and J. Jia, “Break ames room illusion: depth from analysis for single image depth estimation,” in Proc. IEEE Conf. Comput.
general single images,” ACM Trans. Graph., vol. 34, no. 6, p. 225, 2015. Vis. Pattern Recog., 2015, pp. 614–622.
[25] P. Favaro and S. Soatto, “A geometric approach to shape from defocus,” [51] M. H. Baig and L. Torresani, “Coupled depth learning,” in Proc. IEEE
IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 3, pp. 406–417, 2005. Winter Conf. Appl. Comput. Vis., 2016, pp. 1–10.
[26] E. Shelhamer, J. T. Barron, and T. Darrell, “Scene intrinsics and depth [52] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure
from a single image,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, from a single still image,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31,
2015, pp. 37–44. no. 5, pp. 824–840, 2009.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
[53] Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semi-supervised deep learning Jiaya Jia received his Ph.D. degree in Com-
for monocular depth map prediction,” in Proc. IEEE Conf. Comput. Vis. puter Science from Hong Kong University of
Pattern Recog., 2017, pp. 6647–6655. Science and Technology in 2004. From March
[54] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Monocular depth 2003 to August 2004, he was a visiting scholar
estimation using multi-scale continuous crfs as sequential deep networks,” at Microsoft. Then, he joined the Department
IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 6, pp. 1426–1440, of Computer Science and Engineering at The
2018. Chinese University of Hong Kong (CUHK) in 2004
[55] B. Z. Ladicky, Lubor and M. Pollefeys, “Discriminatively trained dense as an assistant professor and was promoted
surface normal estimation,” in Proc. 13th Eur. Conf. Comput. Vis., 2014, to Associate Professor in 2010. He conducted
p. 4. collaborative research at Adobe Research in
2007. He was promoted to Professor in 2015.
He was the Distinguished Scientist and Founding Executive Director of
Tencent YouTu X-Lab. He is an IEEE Fellow.