0% found this document useful (0 votes)
15 views22 pages

Confidence Intervals for Neural Networks

This document discusses methods for constructing confidence intervals (CIs) for neural network models based on least squares estimation and linear Taylor expansion. It presents the theoretical basis for approximating CIs for nonlinear models using a linear approach. The authors show that this approach provides a tool to estimate model performance and detect potential ill-conditioning issues. They apply the method to both simulated and real-world data, finding it compares favorably to other analytic approaches while being more efficient than bootstrap methods.

Uploaded by

thyagosmesme
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views22 pages

Confidence Intervals for Neural Networks

This document discusses methods for constructing confidence intervals (CIs) for neural network models based on least squares estimation and linear Taylor expansion. It presents the theoretical basis for approximating CIs for nonlinear models using a linear approach. The authors show that this approach provides a tool to estimate model performance and detect potential ill-conditioning issues. They apply the method to both simulated and real-world data, finding it compares favorably to other analytic approaches while being more efficient than bootstrap methods.

Uploaded by

thyagosmesme
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Neural

Networks

PERGAMON Neural Networks 13 (2000) 463484


[Link]/locate/neunet

Construction of confidence intervals for neural networks based on least


squares estimation
I. Rivals*, L. Personnaz
Laboratoire dElectronique, Ecole Superieure de Physique et de Chimie Industrielles, 10 rue Vauquelin, 75231 Paris Cedex 05, France
Received 14 October 1997; accepted 5 July 1999

Abstract
We present the theoretical results about the construction of confidence intervals for a nonlinear regression based on least squares
estimation and using the linear Taylor expansion of the nonlinear model output. We stress the assumptions on which these results are
based, in order to derive an appropriate methodology for neural black-box modeling; the latter is then analyzed and illustrated on simulated
and real processes. We show that the linear Taylor expansion of a nonlinear model output also gives a tool to detect the possible ill-
conditioning of neural network candidates, and to estimate their performance. Finally, we show that the least squares and linear Taylor
expansion based approach compares favorably with other analytic approaches, and that it is an efficient and economic alternative to the
nonanalytic and computationally intensive bootstrap methods. 2000 Elsevier Science Ltd. All rights reserved.
Keywords: Nonlinear regression; Neural networks; Least squares estimation; Linear Taylor expansion; Confidence intervals; Ill-conditioning detection; Model
selection; Approximate leave-one-out score

1. Introduction in the past years (Efron & Tibshirani, 1993). These


nonanalytic methods have been proposed to build CIs for
For any modeling problem, it is very important to be able neural networks (Heskes, 1997; Paass, 1993; Tibshirani,
to estimate the reliability of a given model. This problem 1996), but with the shortcoming of requiring a large number
has been investigated to a great extent in the framework of of trainings.
linear regression theory, leading to well-established results This paper presents an economic alternative to the
and commonly used methods to build confidence intervals construction of CIs using neural networks. This approach
(CIs) for the regression, that is the process output expecta- being built on the linear least squares (LS) theory applied to
tion (Seber, 1977); more recently, these results have been the linear Taylor expansion (LTE) of the output of nonlinear
extended to nonlinear models (Bates & Watts, 1988; Seber models, we first recall how to establish CIs for linear models
& Wild, 1989). In the neural network modeling studies in Section 2, and then approximate CIs for nonlinear models
however, these results are seldom exploited, and generally in Section 3. In Section 4, we exploit these known theore-
only an average estimate of the neural model reliability is tical results for practical modeling problems involving
given through the mean square model error on a test set; but neural models. We show that the LTE of a nonlinear
in an application, one often wishes to know a CI at any input model output not only provides a CI at any input value of
value of interest. Nevertheless, thanks to the increase of interest, but also gives a tool to detect the possible ill-condi-
computer power, the use of bootstrap methods has increased tioning of the model, and as in Monari (1999) and Monari
and Dreyfus (submitted), to estimate its performance
* Corresponding author. Tel.: 33-1-40-79-45-45; fax: 33-1-40-79-44-
through an approximate leave-one-out (LOO) score. A
25. real-world illustration is given through an industrial appli-
E-mail address: [Link]@[Link] (I. Rivals). cation, the modeling of the elasticity of a complex material
Current address: Equipe de Statistique Appliquee, Ecole Superieure de from some of its structural descriptors. Section 5 compares
Physique et de Chimie Industrielles, 10 rue Vauquelin, 75231 Paris Cedex the LS LTE approach to other analytic approaches, and
05, France.
discusses its advantages with respect to bootstrap
Abbreviations: CI: confidence interval; LOO: leave-one-out; LS: least
squares; LTE: linear Taylor expansion; SISO: single inputsingle output;
approaches.
MISO: multi inputsingle output; MSTE: mean square training error: We consider single-output models, since each output of a
MSPE: mean square performance error. multi-output model can be handled separately. We deal with
0893-6080/00/$ - see front matter 2000 Elsevier Science Ltd. All rights reserved.
PII: S0893-608 0(99)00080-5
464 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484

Nomenclature
We distinguish between random variables and their values (or realizations) by using upper- and lowercase letters; all
vectors are column vectors, and are denoted by boldface letters; nonrandom matrices are denoted by light lowercase letters
x Nonrandom n-input vector
Yp Yp x Random scalar output depending on x
EYp x Mathematical expectation, or regression function, of Yp given x
W Random variable with zero expectation denoting additive noise
s2 Variance of W
{x k ; ykp }k1 to N Data set of N inputoutput pairs, where the {xk } are nonrandom n-vectors, and the {ykp } are the
corresponding realizations of the random outputs {Ypk Yp xk }
{x u; u Rn } Family of linear functions of x parameterized by u
k T

up Unknown true q-parameter vector (q n in the linear case)


x x1 x2 xN T Nonrandom (N,n) input matrix
Y p Yp1 Yp2 YpN T Random N-vector of the outputs of the data set
W W 1 W 2 W N T Random N-vector with EW 0
J(u ) Value of the least squares cost function
Q LS Least squares estimator of u p
u LS Least squares estimate of u p
R Y p xQ LS Least squares residual random N-vector in the linear case
r Value of R
mx Range of x (linear manifold)
px Orthogonal projection matrix on mx
S2 Estimator of s 2
2
s Value of S 2
{ f x; u; u R q } Family of nonlinear functions of x parameterized by u
f x; u N-vector f x1 ; uf xk ; uf xN ; uT
R Y p f x; Q LS Least squares residual random N-vector
j j 1 j 2 j N T Unknown nonrandom (N,q) matrix with j k 2f xk ; u=2uuup
m(j ) Range of j
pj Orthogonal projection matrix on m(j )
z z 1 z2 zN T Matrix approximating j with z k 2f xk ; u=2uuuLS
m(z) Range of z
pz Orthogonal projection matrix on m(z)
IN (N,N) identity matrix
uk
LS Leave-one-out (the kth example) least squares estimate of up
{ek }k1 to N Leave-one-out errors
nh Number of hidden neurons of a neural network
H Random Hessian matrix of the cost function
h Value of the Hessian matrix of the cost function
var d f x1 Q LS ref Reference variance estimate
var d f x; Q LS LTE LTE estimate of the variance of a nonlinear model output
var d f x; Q LS Hessian Hessian estimate of the variance of a nonlinear model output
d
var f x; Q LS sandwich Sandwich estimate of the variance of a nonlinear model output
Abbreviations
CI confidence interval
LOO leave-one-out
LS least squares
LTE linear Taylor expansion
SISO single input - single output
MISO multi input - single output
MSTE mean square training error
MSPE mean square performance error
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 465

where up is an unknown n-parameter vector. Model (2)


associated to the data set leads to:
Y p xup W 3
where x x1 x2 xN T is the nonrandom (N,n) input
matrix, Y p Yp1 Yp2 YpN T and W W 1 W 2 W N T
are random N-vectors, with EW 0: Geometrically, this
means that EY p x xup belongs to the solution surface,
Fig. 1. Geometric representation of the linear LS solution (true assumed the linear manifold m(x) of the observation space RN
model). spanned by the columns of the input matrix 2 (the range of
x). We assume that m(x) is of dimension n, that is rankx
static modeling problems for the case of a nonrandom (noise n: In other words, the model is identifiable, i.e. the data set
free) n-input vector x x1 x2 xn T ; and a noisy is appropriately chosen, possibly using experimental
measured output yp which is considered as the actual design.
value of a random variable Yp Yp x depending on x. We
assume that there exists an unknown regression function 2.1. The linear least squares solution
EYp x such that for any fixed value of x:
The LS estimate uLS of up minimizes the empirical
Yp x EYp x Wx 1 quadratic cost function:
where Wx is thus a random variable with zero expectation. X
N

A family of parameterized functions {f x; u; x Rn ; u Ju 1


2 ykp xk T u2 1
2 yp xuT yp xu 4
k1
Rq } is called an assumed model. This assumed model is said
to be true if there exists a value u p of u such that, x in the The estimate u LS is a realization of the LS estimator Q LS
input domain of interest, f x; up EYp x: In the follow- whose expression is:
ing, a data set of N inputoutput pairs {xk ; ykp }k1 to N is
Q LS xT x1 xT Y p up xT x1 xT W 5
available, where the xk xk1 xk2 xkn T are nonrandom
n-vectors, and the {ykp } are the corresponding realizations As the assumed model is true, this estimator is unbiased.
of the random variables {Ypk Yp xk }. 1 The goal of the The orthogonal projection matrix on mx is px
modeling procedure is not only to estimate the regression xxT x1 xT : It follows from Eq. (5) that the unbiased LS
EYp x in the input domain of interest with the output of a estimator of EY p x is:
model, but also to compute the value of a CI for the regres-
xQ LS xup px W 6
sion, that is the value of a random interval with a chosen
probability to contain the regression. For the presentation of that is the sum of EY p x and of the projection of W on
the results of linear and nonlinear regression estimation, we m(x), as shown in Fig. 1. Let R denote the residual random
deal with the true model (a model which is linear in the N-vector R Y p xQ LS ; that is the vector of the errors on
parameters in Section 2, a nonlinear one in Section 3), i.e. the data set, then:
we consider that a family of functions containing the regres-
sion is known. In Section 4, we consider the general realistic R IN px W 7
case of neural black-box modeling where a preliminary Under the assumption that the {W k } are identically distrib-
selection among candidate neural models is first performed uted and uncorrelated (homoscedastic), i.e. the noise covar-
because a true model is not known a priori. iance matrix is KW s 2 IN ; it follows from Eq. (5) that
the variance of the LS estimator of the regression for any
input x of interest is: 3
2. Confidence intervals for linear models
varxT Q LS s 2 xT xT x1 x 8
We consider a true linear assumed model, that is the
associated family of linear functions {xT u; x Rn ; u Using Eq. (7), we obtain the unbiased estimator
Rn } contains the regression; Eq. (1) can thus be uniquely RT R
rewritten as: S2
Nn
Yp x xT up Wx 2 of s 2; the corresponding (unbiased) estimate of the variance
1
We recall that we distinguish between random variables and their 2
m(x) is sometimes called the expectation surface (Seber & Wild,
values (or realizations) by using upper and lowercase letters, e.g. Ypk and 1989); as a matter of fact, the solution surface coincides with the expecta-
ykp ; all vectors are column vectors, and are denoted by boldface letters, e.g. tion surface only when the assumed model is true.
the n-vectors x and {xk }; nonrandom matrices are denoted by light lower- 3
We recall that x (boldface) is the (n, 1) input vector of interest, and that
case letters (except the unambiguous identity matrix). x is the experimental (N, n) input matrix.
466 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484

a)
5

-5
-3 -2 -1 0 1 2 3
b)
0.08

0.06

0.04

0.02

0
-3 -2 -1 0 1 2 3

Fig. 2. CI for process #1, a simulated linear SISO process (true assumed model with n 2 parameters): (a) regression (thin line), data set (crosses), model
output and 99% CI (thick lines); and (b) true variance (thin line) and LS estimate of the variance (thick line) of x TQ LS.

of x T Q LS is thus: EY p x and s appear:

varxd
T
Q LS s2 xT xT x1 x 9 x T Q LS EYp x
p ! N0; 1 12
where s is the value of S. s x T xT x1 x

2.2. Confidence intervals for a linear regression Thus, using the Pearson variable (11), which is independent
from Eq. (12) according to Theorem 3, we obtain the follow-
If the {W k } are homoscedastic gaussian variables, that is ing Student variable:
W ! NN 0; s 2 IN :
xT Q LS EYp x
Theorem 1. p ! Student N n 13
S xT xT x1 x
1
Q LS up ! Nn 0; s x x
2 T
10
A 100(1 a )% CI for EYp x is thus:
  q
Theorem 2. a
xT uLS ^ tNn 1 s xT xT x1 x 14
2
T
R R
! x2Nn 11
s2 where tNn is the inverse of the Student(N n) cumulative
distribution.
Note that Eq. (14) allows to compute a CI corresponding
Theorem 3. Q LS is statistically independent from RT R: to any input vector, and that it is much more informative
than average values such as that the mean square error on
The proof of the above theorems follows from Fig. 1 and the data set, or the mean of the variance estimate over the
from the FisherCochrane theorem (Goodwin & Payne, data set; 4 as a matter of fact, the latter invariably equals
1977), see for instance (Seber, 1977). s2 n=N:
The goal is to build a CI for the regression value 4
The mean of the variance estimate over the training data set is:
EYp x xT up ; for any input vector x of interest. The 1 PN 2 k T T 1 k s2 PN s2
variance of the measurements s 2 being unknown, let us k1 s x x x x k1 px kk tracepx : As px is the
N N N
build a normalized centered gaussian variable where both orthogonal projection matrix on a n-dimensional subspace, tracepx n:
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 467

Fig. 3. Geometric representation of the nonlinear LS solution and of its LTE approximation (true assumed model).

2.3. Example of a simulated linear SISO process surface, the manifold m f x; u { f x; u; u Rq } of


(process #1) RN :

We consider a simulated single input-single output 3.1. The linear Taylor expansion of the nonlinear least
(SISO) linear process: squares solution

ykp up1 up2 xk wk k 1 to N 15 A LS estimate u LS of u p minimizes the empirical cost


function: 6
We take up1 1; up2 1; s 2 0:5; N 30: The inputs
X
N
{xk } of the data set are uniformly distributed in [3; 3], Ju 1
ykp f xk ; u2 1
yp f x; uT yp f x; u
as shown in Fig. 2a. The family of functions {u1 u2 x; u 2
k1
2

R2 } is considered, that is the assumed model is true, and we 18


choose a confidence level of 99% t28 1% 2:76: The LS
estimation leads to s2 0:29; i.e. underestimates the noise The estimate u LS is a realization of the LS estimator Q LS.
variance. Fig. 2b displays the estimate (9) of the variance of Efficient algorithms are at our disposal for the minimization
xT Q LS ; and the true variance (8). The estimator S 2 of the of the cost function (18): they can lead to an absolute mini-
noise variance being unbiased, the difference between the mum, but they do not give an analytic expression of the
estimated variance (9) and the (usually unknown) true estimator that could be used to build CIs. In order to take
variance (8) is only due to the particular values of the advantage of the results concerning linear models, it is
measurement noise. Fig. 2(a) shows the regression worthwhile considering a linear approximation of Q LS
EYp x; the data set, the model output and the 99% CI for which is obtained by writing the LTE of f(x, u ) around
the regression computed with Eq. (14). f(x, u p):
!
Xq
2f x; u
f x; u f x; u p u upr
r1
2 u r u up r
3. Approximate confidence intervals for nonlinear
models
f x; up j T u up 19
We consider a family of nonlinear functions {f x; u; x where
Rn ; u Rq } which contains the regression, that is the
2f x; u
assumed model is true; Eq. (1) can thus be rewritten as: j
2u uup
Yp x f x; up Wx 16
Thus, with the matrix notation:
where u p is an unknown q-parameter vector. We denote by f x; u f x; up ju u p 20
f x; up the unknown vector f x1 ; up f xk ; up
f xN ; up T defined on the data set, thus: where

Y p f x; up W 17 j j 1 j 2 j N T
6
As in Section 2, x denotes the (N,n) input matrix, 5 and Yp In the case of a multilayer neural network, the minimum value of the
cost function can be obtained for several values of the parameter vector;
and W are random N-vectors with EW 0: Geometri-
but, since the only function-preserving transformations are neuron
cally, this means that EY p x belongs to the solution exchanges, as well as sign flips for odd activation functions like the hyper-
bolic tangent [Sussman 1992], we will legitimately consider the neighbor-
5
In the case of a nonlinear model, x is merely a two-dimensional array. hood of one of these values only.
468 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484

and matrix j , the vector j is not available, and its value may
be approximated by:
2f x ; u
k

j
k
2f x; u
2u uu z 26
2u uuLS
p

The (N, q) matrix j is the nonrandom and unknown (since


u p is unknown) Jacobian matrix of f. Using Eq. (20), one From Eqs. (23), (25) and (26), the LTE estimate of the
obtains, similarly to the linear case, the following approx- variance of the LS estimator of the regression for an input
imation of Q LS (see Appendix A.1 for a detailed derivation x is thus:
of Eqs. (21) and (23)): dQ LS LTE s2 zT zT z1 z
var fx; 27
1 T
Q LS up j j T
j W 21
The range m(j ) of j is tangent to the manifold m( f(x, u ) 3.2. Approximate confidence intervals for a nonlinear
at u u p ; this manifold is assumed to be of dimension q, regression
hence rankj q: The matrix p j jj T j 1 j T is the
orthogonal projection matrix on m(j ). From Eqs. (20) and If W ! NN 0; s 2 IN ; and for large N, it follows from the
(21), the LS estimator of EY p x can be approximately above relations and from the linear LS theory (Seber &
expressed by: Wild, 1989) that:
f x; Q LS f x; up pj W 22 Theorem 4.
i.e. it is approximately the sum of EY p x and of the projec-
Q LS ! Nq up ; s 2 j T j 1 28
tion of W on m(j ), as illustrated in Fig. 3. If KW s 2 IN
(homoscedasticity), the variance of the model output, that is
the LS estimator of the regression, for an input x is
approximately: Theorem 5.
1
var f x; Q LS s j j j
2 T T
j 23 RT R
! x2Nq 29
In the following, approximation (23) will be termed the s2
LTE approximation of the model output variance. Let R
denote the LS residual vector R Y p f x; uLS ; thus:
R IN pj W 24 Theorem 6. Q LS is approximately statistically indepen-
dent from R T R:
Under the assumption of appropriate regularity condi-
tions on f, and for large N, the curvature of the solution Using Eqs. (23) and (28), let us again build a quasi-normal-
surface 7 m( f(x, u )) is small; thus, using Eq. (24), one ized and centered gaussian variable where both EYp x and
obtains the asymptotically (i.e. when N ! ) unbiased esti- s appear:
mator S2 RT R=N q of s 2. In Eq. (23), the matrix j
takes the place of matrix x in the linear case. But, as opposed f x; Q LS EYp x
q ! N0; 1 30
to x, j is unknown since it is a function of the unknown
s j T j T j 1 j
parameter u p. The (N, q) matrix j may be approximated by
z z 1 z2 zN T Thus, the variable (29) being approximately independent
from Eq. (30) according to Theorem 6, we have:
where
f x; Q LS EYp x
2f xk ; u q ! Student N q 31
z
k
; S j T j T j 1 j
2u uuLS

that is A 100(1 a )% approximate CI for EY p x is thus:


  q
a
2f xk ; u f x; uLS ^ tNq 1 s zT zT z1 z 32
zr
k
25 2
2 u r u u LS
Note that, when N is large, the Student distribution is close
In the following, we assume that rankz q: Like the to the normal distribution, and thus tNq 1 a2 n1 a2 ;
7
where n is the inverse of the normal cumulative distribution.
The curvature is usually decomposed in two components: (i) the intrin-
sic curvature, which measures the degree of bending and twisting of the
Like in the linear case, Eq. (32) allows to compute a CI at
solution surface m( f(x, u )), and (ii) the parameter-effects curvature, which any input x of interest, which gives much more information
describes the degree of curvature induced by the choice of the parameters u . than the value of the mean variance estimate over the data
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 469

a)

-1
-3 -2 -1 0 1 2 3
x 10 -3 b)
6

0
-3 -2 -1 0 1 2 3

Fig. 4. CI for process #2, a simulated neural SISO process (the assumed model, a two hidden neurons network with q 7 parameters, is true): (a) regression
(thin line), the N 50 examples of the data set (crosses), model output and 99% approximate CI (thick lines); and (b) reference (thin line) and LTE (thick line)
estimates of the variance of f(x,Q LS).

set: as a matter of fact, the latter always approximately Thus, if N is large: (i) as in the linear case, the estimator of
equals s2 q=N: the noise variance S 2 is unbiased, and the difference between
From a practical point of view, the construction of a CI s 2 and s 2 is only due to the particular values of the noise;
for a neural model output at any input x of interest involves and (ii) the variance of f(x, Q LS) is small, and u LS is likely to
once and for all the computation of the matrix z, that is the be close to u p: z and z are thus likely to be good approxima-
N q partial derivatives of the model output with respect to tions of, respectively, j and j . A reliable estimate of a CI
the parameters evaluated at u LS for the data inputs may thus be obtained from the LTE variance estimate (27).
{xk }k1 to N ; and, for each x value, that of z, i.e. the deriva- On the other hand, if N is too small: (i) as opposed to the
tives at x. In the case of a neural model, these derivatives are linear case, even if the assumed model is true, the estimator
easily obtained with the backpropagation algorithm. of the noise variance S 2 is biased; and (ii) the variance of f(x,
All the previous results and the above considerations are Q LS) is large, and u LS is likely to differ from u p: z and z risk
valid provided an absolute minimum of the cost function to be poor approximations of j and j . Thus, if N is diag-
(18) is reached. In real-life, several estimations of the para- nosed as too small, one cannot have confidence in the
meters must thus be made starting from different initial confidence intervals (32), and additional data should be
values, the estimate corresponding to the lowest minimum gathered.
being kept in order to have a high probability to obtain an
absolute minimum. In the examples of this work, the algo- 3.3.2. Quantitative analysis
rithm used for the minimization of the cost function is the As detailed for example by Antoniadis, Berruyer and
Levenberg algorithm, as described for example in Bates and Carmona (1992), Bates and Watts (1988) and Seber and
Watts (1988), and several trainings are performed. The Wild (1989) different measures of the curvature can be
probability of getting trapped in a relative minimum computed, and can be used in each particular case to eval-
increasing with the number of parameters of the network uate the accuracy of the LTE. In Section 4.3 dealing with
and decreasing when the size of the data set increases, the neural network modeling, we give indications on how to
number of trainings is chosen accordingly. judge if N is large enough for the approximate CI to be
accurate.
3.3. Quality of the approximate confidence intervals In order to evaluate the accuracy of the LTE variance
estimate (27) when dealing with simulated processes, we
3.3.1. Theoretical analysis introduce an estimate of the unknown true variance of f(x,
The quality of the approximate CI depends essentially on Q LS) that is not biased by curvature effects, the reference
the curvature of the solution surface m f x; u; which variance estimate var d f x 1 Q LS ref : This estimate is
depends on the regularity of f and on the value of N. In computed on a large number M of other sets of N outputs
practice, f is often regular enough for a first-order approx- corresponding to the inputs of the training set, and whose
imation to be satisfactory, provided that N is large enough. values are obtained with different realizations (simulated
470 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484

simulated; (2) the LTE variance estimate (27), which is the


estimate that can be computed in real-life; and (3) the refer-
a) ence variance estimate (33), which tends to the true variance
3 when M tends to infinity because it is not biased by curva-
2
ture effects, and which can be computed only when the
process is simulated.
1
0
-1 3.4. Illustrative examples

As we are concerned with neural models, and since, in


2
-2 this section, the assumed model is true, the following
0
0 examples bring into play neural processes, that is
2 -2 to say processes whose regression function is the
b) output of a neural network; the more realistic case of
0.04 arbitrary processes for whom a family of nonlinear
functions (a neural network with a given architecture)
containing the regression is unknown is tackled in the next
0.02 section.

0 3.4.1. Example of a simulated neural SISO process


2
-2 (process #2)
0 0
2 -2 We consider a SISO process simulated by a neural
x 10 -3 c) network with one hidden layer of two hidden neurons
with hyperbolic tangent activation function and a linear
5 output neuron:
0
y pk up1 up2 tanh up3 up4 xk
-5
up5 tanh up6 up7 xk wk k 1 to N 34
2
-2
0 0
We take up 1; 2; 1; 2; 1; 1; 3T ; s 2 102 ; N
2 -2
50: The inputs {x k} of the data set are uniformly distributed
Fig. 5. CI for process #3, a simulated neural MISO process (the assumed in [3; 3], as shown in Fig. 4a. The family of functions
model, a two hidden neurons network with q 9 parameters, is true): (a) {u1 u2 tanh u3 u4 x u5 tanh u6 u7 x; u R7 } is
the N 100 inputs of the data set (circles) and regression; (b) LTE estimate considered, that is the assumed model is true, and we choose
of the variance of f(x,Q LS); and (c) difference between the reference and the a confidence level of 99% t43 1% 2:58: The minimiza-
LTE estimates of the variance of f(x,Q LS).
tion of the cost function with the Levenberg algorithm leads
to s2 1:02 10 2. Fig. 4b displays the LTE estimate of
the variance of f(x, Q LS) (27), and the reference estimate
(33) computed over M 10 000 sets. Fig. 4a shows the
values) of the noise W. The ith LS estimate f x; uiLS of regression, the data set used for the LS estimation, and the
EYp x is computed with data set i (i 1 to M), and the corresponding model output and 99% CI (32). The model
reference estimate of the variance at input x is computed as: being true and the size N 50 of the data set being rela-
1 XM tively large with respect to the number of parameters and to
var d
f x1 Q LS ref f x; ui
LS f x ;
2
the noise variance, we observe that: (i) s 2 s 2 ; (ii) the
M i1
model output is close to the regression, leading to good
approximations of j and of j by z and z. Thus, (i) and (ii)
1 XM
where f x f x; ui
LS (33) lead to an accurate LTE estimate of the variance, and hence
M i1 of the CI.

In the nonlinear case, we thus use three notions related to the


(true) variance of f(x,Q LS): (1) the LTE variance approxi- 3.4.2. Example of a simulated neural MISO process
mation (23), which is a good approximation of the true (process #3)
variance if the curvature is small, as we show in Section We consider a MISO process simulated by a neural
3.4.3, and which can be computed only when the process is network with two inputs, one hidden layer of two tanh
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 471

x 10 -3 a)

0
-3 -2 -1 0 1 2 3
x 10 -4 b)
2

-1
-3 -2 -1 0 1 2 3

Fig. 6. Accuracy of the LTE approximation of the variance for process #2: (a) reference estimate of the variance of f(x,Q LS) (thin line), LTE approximation
obtained with the true values u p and s 2 (thick line); and (b) difference between the reference estimate of the variance of f(x,Q LS) and the LTE approximation
obtained with u p and s 2.

hidden neurons and a linear output neuron: We take up 1; 1; 0; 1; 1; 2; 0; 1; 1T ; s 2


1
10 ; N 100: The inputs {xk1 } and {xk2 } of the data set
y pk up1 up2 tanh up3 up4 xk1 up5 xk2 are uniformly distributed in [3; 3]. As for process #2, the
assumed model is true, i.e. the neural network associated to
up6 tanh up7 up8 xk1 up9 xk2 wk k 1 to N 35 Eq. (35) is used; the minimization of the cost function with

a)

0.04

0.03

0.02

0.01

2
-2
0 0
2 -2

x 10 -3 b)

-5

2
-2
0 0
2 -2

Fig. 7. Accuracy of the LTE approximation of the variance for process #3: (a) reference estimate of the variance of f(x,Q LS); and (b) difference between the
reference estimate of the variance of f(x,Q LS) and the LTE approximation obtained with u p and s 2.
472 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484

the Levenberg algorithm leads to s2 9:73 10 2. Fig. 5a computation of the approximate Cl (32). We propose to
shows the inputs of the data set and the regression; Fig. 5b discard too large models by a systematic detection of ill-
displays the LTE estimate of the variance of f(x, Q LS) (27); conditioning, and to perform the selection among the
Fig. 5c displays the difference between the reference approved, i.e. well-conditioned models using an approxi-
variance estimate (33) computed over M 10 000 sets, mate value of their LOO score whose computation does
and the LTE variance estimate (27). As s 2 is slightly smaller not require further training. Both the ill-conditioning detec-
than s 2, the variance is globally slightly underestimated. tion and the estimation of the LOO score of a neural candi-
However except in the domain around the corner (3, 3) date are based on the LTE of its output.
where the density of the inputs is lower, and where the slope
of the output surface is steep, the LTE variance estimate is 4.1. III-conditioning detection for model approval
satisfactory. A reliable estimate of the CI may thus be obtained.
A too large neural model, trained up to convergence with
a simple LS cost-function, will generally overfit. Overfitting
3.4.3. Accuracy of the LTE variance approximation is often avoided by using early stopping of the training
(processes #2 and #3) algorithm or by adding regularization terms in the cost func-
Let us now show on the example of processes #2 and #3 tion, e.g. weight decay (Bishop, 1995). Unfortunately, as
that the curvature of the solution surface is small enough for only the estimator whose value corresponds to an absolute
the LTE approximation of the variance (23) to be satisfac- minimum of the quadratic cost function (18) without weight
tory. For both processes, we have computed approximation decay terms is unbiased, both early stopping and weight
(23), using the values of j and of j (at u p) and the value of decay introduce a bias in the estimation of the regression:
s 2 used for the noise simulation. As shown in Fig. 6a and b the corresponding estimates thus lead to questionable CIs
for process #2, the LTE approximation of the variance (23) for the regression.
is very close to the reference variance estimate. As a matter To detect and discard too large networks, we propose,
of fact, the difference between them (Fig. 6b) is only due to after the training of each candidate up to a (hopefully) abso-
the curvature, which is small as N 50 is large with respect lute minimum of the cost function (18), to check the condi-
to the complexity of the regression. Expression (23) also tioning of their matrix z (see Rivals & Personnaz, 1998).
leads to satisfactory results in the case of process #3 as The fact that z be ill-conditioned is the symptom that some
shown in Fig. 7a and b N 100; two inputs). This tends parameters are useless, since the elements of z represent the
to show that a first-order approximation of the variance is sensibility of the model output with respect to the para-
often sufficient, and that it is not worth to bother with a meters. A typical situation is the saturation of a tanh
higher-order approximation. Seber and Wild (1989) intro- hidden neuron, a situation which generates in the matrix z
duces a quadratic approximation of the LS estimator using a column of 1 or 1 that corresponds to the parameter
the curvature of the solution surface. This approximation between the output of the saturated hidden neuron and the
uses arrays of projected second derivatives, the intrinsic linear output neuron, and columns of zeros that correspond
and parameter-effects curvatures arrays; but their to the parameters between the network inputs and the satu-
presentation is beyond the scope of this paper. rated hidden neuron. 8 In practice, we propose to perform a
singular value factorization of z, and to compute its condi-
tion number, that is the ratio of its largest to its smallest
4. Confidence intervals for neural networks
singular value, see, e.g. Golub & Van Loan, 1983. The
In the previous sections, the model used for the construc- matrix z can be considered as very ill-conditioned when
tion of CIs is true. For real-world black-box modeling its condition number reaches the inverse of the computer
problems however, a family of functions that contains the precision, which is of the order of 10 16.
regression is not known a priori. The first task is thus to Further, in order to be able to compute the approximate
select the less complex family of functions which contains CI (32) which involve (z Tz) 1, the cross-product Jacobian
a function approximating the regression to a certain degree matrix z Tz must also be well conditioned. As the condition
of accuracy in the input domain delimited by the data set. In number of z Tz is the square of the condition number of z, the
practice, several families of increasing complexity (for networks whose matrix z has a condition number much
example neural networks with one layer of an increasing larger than 10 8 cannot be approved.
number nh of hidden units, and a linear output neuron) are There are other studies of the ill-conditioning of neural
considered, and the data set is used both to estimate their networks, but they deal with their training rather than with
parameters, and to perform the selection between the candi- their approval, like in the work by Zhou and Si (1998) where
dates. In order to retain the less complex family containing a an algorithm avoiding the Jacobian rank deficiency is
good approximation of the regression, it is of interest to 8
Such a situation might also correspond to a relative minimum; to check
perform the selection only between neural candidates the conditioning of z helps thus also to discard neural networks trapped in
which are not unnecessarily large, and which are (that is relative minima, and leads to retrain the neural candidate starting from
their matrix z is) sufficiently well-conditioned to allow the different initial weights.
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 473

a)

4
selected network

2
2
0
0 1 2 3 4 5 6 7 8 nh

b)

selected network
4

2
2
0
0 1 2 3 4 5 6 7 8 nh

Fig. 8. Schematic evolution of the MSTE (crosses) and MSPE (circles) as a function of the number of hidden neurons of the neural network candidates, the
network with the smallest MSPE being selected: (a) large data set: the ratio MSPE/MSTE of the selected network (six hidden neurons) is roughly equal to 1,
hint that the data set size N is large; and (b) small data set: the ratio MSPE/MSTE of the selected network (three hidden neurons) is roughly equal to 2, hint that
the data set size N is small.

presented, or by Saarinen, Bramley and Cybenko (1993) without performing these N time-consuming trainings of
where the Hessian rank deficiency is studied during the each candidate network, as proposed in Monari (1999)
training. In our view, rank deficiency is not very relevant and Monari and Dreyfus (submitted)).
during the training because with a Levenberg algorithm, the In the case of a linear model, it is well known (Efron &
matrix to be inverted is made well conditioned by the Tibshirani, 1993) that the kth LOO error e k can be directly
addition of a scalar matrix l lq, l 0; to the cross-product derived from the corresponding residual r k:
Jacobian.
rk
ek k 1 to N 36
4.2. Approximate leave-one-out scores for model selection 1 px kk
where, we recall, px denotes the orthogonal projection
The selection among the networks which have been matrix on the range of x. Expression (36) holds irrespective
approved can be performed with statistical tests (Rivals & of whether or not the assumed model is true.
Personnaz, 1998; Urbani, Roussel-Ragot, Personnaz & In the case of a nonlinear model, we show (see Appendix
Dreyfus, 1994). Another approach, cross validation, consists B) that a useful approximation of the kth LOO error can be
in partitioning the data set in training and test sets, and in obtained using the LTE of the model output at u LS:
selecting the smallest network leading to the smallest mean
square error on the test sets. 9 One of the drawbacks of cross rk
ek k 1 to N 37
validation is to require a successful training of the candidate 1 pz kk
models on many test sets, that is N successful trainings in the
case of LOO cross validation. Let us denote by e k the error where pz denotes the orthogonal projection matrix on the
obtained on the left out example k with the model trained on range of z. The approximation (37) is thus similar to Eq.
the N 1 remaining examples (kth test set). In this section, (36). 10 Like in the linear case, expression (37) holds
we derive an approximate expression of e k, expression independently on the assumed model being true or not.
which allows an economic estimation of the LOO score Hence the LOO score:
1 XN
9
Note that statistical tests may advantageously be used complementarily LOO score ek 2 38
N k1
to cross validation in order to take a decision (Rivals and Personnaz, 1999);
these tests can also be established by applying LS theory to the LTE of
10
nonlinear models (Bates and Watts, 1988), but this exceeds the scope of this An expression similar to (37) is proposed in Hansen and Larsen (1993),
paper. but unfortunately, it is not valid even in the linear case.
474 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484

a)
1.5

0.5

-0.5
-5 0 5
b)
0.01

0.005

0
-5 0 5

Fig. 9. CI for process #4, a simulated nonlinear SISO process, in the case of a data set of size N 200 (the selected model is a four hidden neurons network
with q 13 parameters): (a) regression (thin line), data set (small points), model output and 99% approximate CI (thick lines); and (b) reference (thin line) and
LTE (thick line) estimates of the variance of f(x,Q LS).

This LOO score can be used as an estimate of the mean f(x,u LS) thus underfits. The approximate CIs are thus
square performance error, and we thus denote it as MSPE, questionable, and additional data should be gathered.
as opposed to 2=NJuLS ; the mean square training error
(MSTE). The interested reader will find in the work of A good indicator of whether the data set size N is large
Monari (1999) and Monari and Dreyfus (submitted) a enough or not is the ratio MSPE/MSTE of the selected
systematic model selection procedure based on both the candidate: if its value is close to 1, then N is probably
approximate LOO score and the distribution of the values large enough, whereas a large value is the symptom of a
of the {[pz]kk}. Nevertheless, another performance measure too small data set size, as shown in Fig. 8 (and as illustrated
could be chosen as well (a 10-fold cross validation score, a numerically in the following examples).
mean square error on an independent set, etc.).
4.4. Example of a simulated nonlinear SISO process
4.3. Accuracy of the approximate confidence intervals (process #4)
The quality of the selected model f(x, u LS), and thus of the This first example is based on a simulated process. Like in
associated approximate CI, depends essentially on the size the previous sections, a reference estimate of the variance of
N of the available data set with respect to the complexity of the output of a neural network is made, using M 1000
the unknown regression function and to the noise variance other sets; to ensure that an absolute minimum is reached
s 2. on each of the M sets, 530 trainings (depending on the
network size) with the Levenberg algorithm for different
1. N is large: it is likely that the selected family initializations of the weights are performed, and the weights
{f x; u; u R q } contains the regression E(Ypx), i.e. giving the smallest value of the cost function (18) are kept.
that the LS estimator is asymptotically unbiased, that We consider the SISO process simulated with:
the model f(x,u LS) is a good approximation of E(Ypx)
in the domain delimited by the dataset, and that the ykp sinc 2xk 5 wk k 1 to N 39
curvature is small. In this case, reliable CIs can be
where sinc denotes the cardinal sine function; we take s 2
computed with Eq. (32).
102 :
2. N is small: it is likely that the selected family
First, a data set of N 200 inputoutput pairs is computed,
{f x; u; u Rq } is too small 11 to contain E(Ypx), i.e.
with input values uniformly distributed in [5; 5]. As a
that the LS estimator is biased, and that the model
family of nonlinear functions (a neural network with a
11
It will generally not be too large since the approval procedure proposed
given architecture) containing the regression is not known
in Section 4.1 prevents from selecting a neural network with useless a priori, neural networks with a linear output neuron and a
parameters. layer of nh tanh hidden neurons are trained. The numerical
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 475

a)
1.5

0.5

-0.5
-5 0 5
b)
0.03

0.02

0.01

0
-5 0 5

Fig. 10. CI for process #4, a simulated nonlinear SISO process, in the case of a data set of size N 30 (the selected model is a two hidden neurons network with
q 7 parameters): (a) regression (thin line), data set (circles), model output and 99% approximate CI (thick lines); and (b) reference (thin line) and LTE (thick
line) estimates of the variance of f(x,Q LS).

results are summarized in Table 1. We list the number of computed, the numerical results being summarized in
parameters q, the MSTE (i.e. the smallest MSTE obtained Table 2. The data set being much smaller, the candidates
with the network for its different random weight initializa- cannot be approved as soon as nh 4 : for nh 5;
tions), the condition number of z, and, if the latter is not too condz 1015 : The optimal number of neurons
large, the MSPE (corresponding approximate LOO score nopt
h 30 2 is selected on the basis of the MSPE. The
computed with Eqs. (37) and (38)) and the ratio MSPE/ ratio MSPE/MSTE of the selected network equals 2.1,
MSTE. The candidates with more than six hidden neurons symptom that N is relatively small, and that the selected
cannot be approved, because condz 108 : for nh 7; family of networks is likely not to contain the regression
condz 1011 : The optimal number of neurons (case 2 of Section 4.3). The results obtained for the
nopt
h 200 4 is selected on the basis of the MSPE. selected neural network are shown in Fig. 10. The family
The fact that the corresponding ratio MSPE/MSTE is of functions implemented by a network with two hidden
close to 1 is the symptom that N is large enough, so units is obviously too small to contain a good approxima-
that the selected family of networks contains a good tion of the regression, and though the estimate of the
approximation of the regression, and that the curvature output variance is good (it is close to the reference
is small (case 1 of Section 4.3). The results obtained for variance estimate), since the output of the neural network
the selected neural network are shown in Fig. 9. The differs from the regression, the CI is less accurate than in
model output is close to the regression, the LTE the case where N 200: Note that in the input domain [0,
variance estimate (27) is close to the reference variance 5] where the model underfits, the variance remains
estimate (33), and the CI is thus accurate. constant and low. This is due to the fact that, in this
Second, a data set of N 30 inputoutput pairs is domain, the model output is insensitive to most para-
meters of the network (this is usually the case when,
Table 1
Results obtained on the modeling of the simulated SISO process #4 using
neural networks, in the case N 200 Table 2
Results obtained on the modeling of the simulated SISO process #4 using
nh q MSTE Cond(z) MSPE MSPE=MSTE neural networks, in the case N 30

1 4 1:4 102 10 1:4 102 1.0 nh q MSTE Cond(z) MSPE MSPE=MSTE


2 7 1:2 102 10 3 1:3 102 1.1
3 10 9:7 103 10 6 1:1 102 1.1 1 4 2:4 102 10 1 2:7 102 1.1
4 13 8:5 103 10 2 9:8 103 1.1 2 7 1.1 10 2 10 6 2:3 102 2.1
5 16 8:4 103 10 6 9:9 103 1.2 3 10 8:1 103 10 3 2:4 102 3.0
6 19 8:2 103 10 7 1:0 102 1.2 4 13 7:1 103 10 4 4:3 101 6:1 103
7 22 7.9 10 3 10 11 5 16 5.0 10 3 10 15
476 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484

10

x2
80
160
60 140 x1
120
40 100
80

Fig. 11. Industrial modeling problem (the selected model is a two hidden neurons network with q 9 parameters): model output, and the N 69 examples of
the data set (circles).

a)

10 20 30 40 50 60

b)
0.5

-0.5
10 20 30 40 50 60

c)
0.25

0.2

0.15
s
0.1

0.05
0 10 20 30 40 50 60 70

Fig. 12. Industrial modeling problem (a) the N 69 outputs of the data set presented in increasing order of their values; (b) the corresponding residuals
(circles) and approximate LOO errors (crosses); and (c) half width of the 95% approximate Cl at the N 69 examples of the data set, and LS estimate s of the
noise standard-deviation s (dotted line).
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 477

Table 3
x2 Results obtained on the modeling of the industrial process using neural
networks
80 nh q MSTE Cond(z) MSPE MSPE=MSTE

1 5 5:2 102 10 4 6:6 102 1.3


70 2 9 1:6 102 10 5 2:1 102 1.3
3 13 1:5 102 10 4 1:7 101 1.1 10 1
4 17 1.2 10 2 10 12
60

50 dates with more than three hidden neurons cannot be


approved: for nh 4; condz 1012 : The optimal number
of neurons nopth 69 2 is selected on the basis of the
40 MSPE. The fact that the corresponding ratio MSPE/MSTE
equals 1.3 indicates that N is large enough, so that the
selected family of networks contains probably a good
30 approximation of the regression, and that the curvature is
small (case 1 of Section 4.3). The function implemented by
the selected network is shown in Fig. 11.
80 100 120 140 160 x1
The N 69 output values of the training set are presented
Fig. 13. Industrial modeling problem: isocontours of the LTE estimate of in the increasing order in Fig. 12a, and the corresponding
the standard deviation of f(x, Q LS), and the N 69 inputs of the data set residuals and approximate LOO errors in Fig. 12b: both
(circles). appear quite uncorrelated and homoscedastic. A CI with a
level of significance of 95% is then computed with Eq. (32);
the half width of the 95% CI on the N 69 examples of the
like here, the output of a network does not vary 12): the data set is shown in Fig. 12c. In order to check the confi-
elements of the zs in this domain are thus constant and dence, which can be attached to the model, the variance of
small, hence a small and constant variance at the corre- its output must be examined in the whole input domain of
sponding xs. interest. Fig. 13 shows the isocontours of the LTE estimate
p
of the standard deviation of the model output s zT zT z1 z
4.5. Industrial modeling problem in the input domain defined by the training set. The compu-
tation of the LTE variance estimate thus allows not only to
We apply here the presented methodology (LS parameter construct a CI at any input of interest, but also to diagnose
estimation, model approval, model selection, CI construc- that, at the top right corner of the input domain, the model
tion) to an industrial example first tackled in (Rivals & standard deviation is larger than that of the noise itself (the
Personnaz, 1998), that is the modeling of a mechanical highest isocontour value equals that of the estimate of the
property of a complex material from three structural noise standard deviation s 1:39 10 1). Little confidence
descriptors. We have been provided with a data set of N can thus be attached to the model output in this input
69 examples; the inputs and outputs are normalized for the domain, where more data should be gathered. On the
LS estimations. Thanks to repetitions in the data, and contrary, there is a large region on the left of the diagram
assuming homoscedasticity, the mean square pure error where there are very few training examples, but where the
(Draper & Smith, 1998) gives a good estimate of the noise LTE estimate of the standard deviation is surprisingly rather
variance: s c2 3:38 10 2. Using this reliable estimate,
small; like for the modeling of process #4 in Section 4.4, this
statistical tests establish the significance of two inputs. An is due to the fact that the model output is less sensitive to
affine model with these two inputs gives the estimate most parameters of the network in this region (the model
s2 2:38 10 1 of the variance, hence the necessity of output varies very little, see Fig. 11).
nonlinear modeling.
Neural networks with a linear output neuron and a layer
of nh tanh hidden neurons are trained. The numerical
5. Comparisons
results are summarized in Table 3. It shows that the candi-
In this section, we discuss the advantages of the LS LTE
12
approach to the construction of confidence intervals for
The output of a neural network with one layer of tanh hidden units
remains constant in a given domain of its inputs when the tanh activation
neural networks with respect to other analytic approaches
functions of all hidden units are saturated in this domain: the output of the and to the bootstrap methods, and compare them on
network is thus insensitive to all the parameters of the hidden units. simulated examples.
478 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484

a)
1.5

0.5

-0.5

-1

-1.5
-3 -2 -1 0 1 2 3

x 10 -3 b)
2.5
Sandwich estimate
Hessian estimate
2 LTE estimate

Reference estimate
1.5
LTE approximation

0.5

0
-3 -2 -1 0 1 2 3

Fig. 14. Comparison of different estimates of the variance of a nonlinear model output for process #5, a simulated neural SISO process (the assumed model, a
single nonlinear neuron with q 2 parameters, is true): (a) regression with a gentle slope (thin line), the N 30 examples of the data set (crosses), model
output (thick line); and (b) LTE approximation and estimates of the variance of f(x, Q LS).

5.1. Comparison to other analytic approaches approaches are conceptually very different: the Bayesian
approach treats the unknown parameters as random vari-
5.1.1. Maximum likelihood approach ables, whereas they are considered as certain in the frequen-
In the case of gaussian homoscedastic data, likelihood tist approach. Nevertheless, as outlined by Bishop (1995),
theory leads to the same approximate variance (23), but MacKay (1992a,b), the Bayesian approach leads to a poster-
two different estimators of it are commonly encountered ior distribution of the parameters with a covariance matrix
(see Appendix A.2): whose expression is very similar to that of the covariance
c2 zT zT z1 z matrix of the least-squares estimator of the parameters, and
dQ LS LTE s
var fx; thus to CIs which are similar to those presented in this paper.
i.e. the same estimate as Eq. (27), and also: We thus make a brief comparison between the CIs these two
approaches lead to.
var fx; c2 zT hu 1 z
dQ LS Hessian s 40 The most important difference is that the estimator which
LS
is considered here is the one whose estimate minimizes the
which necessitates the computation of the Hessian. Efficient cost function (18), whereas in the Bayesian approach, a
methods for computing the Hessian are presented by cost-function with an additional weight-decay regulariza-
Buntine and Weigend (1994). tion term is minimized; the presence of this weight-decay
term stems from the assumption of a gaussian prior for the
5.1.2. Bayesian approach parameters.
The Bayesian approach is an alternative approach to the Nevertheless, the least squares cost function (18) can be
sampling theory (or the frequentist approach) for modeling seen as the limit where the regularization term is zero, which
problems, and also leads to the design of CIs. These two corresponds to an uninformative prior for the parameters. In
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 479

a)
1.5

0.5

-0.5

-1

-1.5
-3 -2 -1 0 1 2 3

b)
0.025

0.02

0.015

Sandwich estimate
0.01 Reference estimate Hessian estimate
LTE approximation LTE estimate

0.005

0
-3 -2 -1 0 1 2 3

Fig. 15. Comparison of different estimates of the variance of a nonlinear model output for process #6, a simulated neural SISO process (the assumed model, a
single nonlinear neuron with q 2 parameters, is true): (a) regression with a steep slope (thin line), the N 30 examples of the data set (crosses), model
output (thick line); and (b) LTE approximation and estimates of the variance of f(x, Q LS).

this case (that is Eq. (18) is minimized as in this paper), most probable value of the parameter, that is here u LS. A
there is another small difference in the Bayesian approach as LTE of the estimator output leads then to the following
outlined by Bishop (1995) and MacKay (1992a,b). Under estimate of its variance at input x:
hypotheses which we cannot detail here, the Bayesian
approach leads to a posterior parameter distribution with var fx; c2 zT hu 1 z
dQ LS Hessian s LS
the approximate covariance matrix s 2 huLS 1 ; h(u LS)
being the Hessian of the cost function evaluated at the i.e. it also leads to estimate (40).

0.015

0.01

0.005

0
-3 -2 -1 0 1 2 3

Fig. 16. Comparison of the LS LTE and bootstrap pairs approach estimates of the variance for process #2. reference (thin line), LTE (thick line), and bootstrap
pairs (dotted line) estimates of the variance of f(x, Q LS).
480 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484

5.1.3. Sandwich estimator 5.2. Comparison to bootstrap approaches


The sandwich estimate of the variance of a nonlinear
model output can be derived in various frameworks (a possi- The bootstrap works by creating many pseudo replicates
ble derivation in the frequentist approach is given in Appen- of the data set, the bootstrap sets, and reestimating the LS
dix A.3): solution (retraining the neural network) on each bootstrap
set; the variance of the neural model output, and the asso-
var fx; c2 zT hu 1 zT zhu 1 z 41
d uLS sandwich s ciated CI, are then computed over the trained networks,
LS LS
typically a hundred (Efron & Tibshirani, 1993). In the
The sandwich estimator is known to be robust to model bootstrap pairs approach for example, a bootstrap set is
incorrectness, i.e. the considered family of functions is too created by sampling with replacement from the data set
small (see, e.g. Efron & Tibshirani, 1993; Ripley, 1995). (Efron & Tibshirani, 1993). The first advantage of the LS
LTE approach is to require only one successful training of
5.1.4. Numerical comparison (processes #5 and #6) the network on the data set to compute the LTE estimate of
Here, we perform a numerical comparison of the three the variance of its output, whereas the bootstrap methods
variance estimates considered above on a very simple require a hundred successful trainings of the network on the
example. We consider a SISO process simulated by a single different bootstrap sets.
tanh neuron: Studies on bootstrap where only one training with a
random initialization of the weights was performed for
ykp tanh up1 up2 xk wk k 1 to N 42 each bootstrap set show a pathological overestimation of
the variance. This can be seen in Tibshirani (1996), exam-
with s 2 0:01; N 30: For this comparison, the noise ples 2 and 3; but the overestimation of the bootstrap is not
variance s 2 is estimated with s 2 in the three (LTE, Hessian, detected in this work because the reference estimate is also
and sandwich) output variance estimates. overestimated for the same reasons (one single training per
We first simulate a process with u p1 0; up2 1 set). As pointed out by Refenes, Zapranis and Utans (1997),
(process #5). The corresponding results are shown in Fig. a way to reduce this overestimation is to start each training
14. The variance reference estimate is computed on M on a bootstrap set with the weights giving the smallest value
10 000 data sets. The LTE approximation (23) of the of the cost function (18) (that is on the original data set); but
variance is almost perfect. The LTE (27), Hessian (40), even so, the bootstrap method becomes untractable for large
and sandwich (41) estimates are comparable: the parameter networks, and/or for multi input processes.
estimates being accurate uLS1 3:63 102 ; uLS2 0:996; The claim that bootstrap methods are especially efficient for
the fact that they are overestimated is almost only due to the problems with small data sets (see, e.g. Heskes, 1997) may be
noise variance estimate s2 1:32 10 2. Nevertheless, the subject to criticism. As an illustration, the variance was esti-
shape of the LTE estimate is closer to the reference estimate mated for process #2 using the bootstrap pairs approach on
than that of the two others. 300 bootstrap sets, the network weights being initialized twice
We then simulate a process with: up1 0; up2 5 for each training, once with the true ones, and once with those
(process #6). The corresponding results are shown in Fig. obtained by training the network on the whole data set. As
15. The function being steeper, the curvature is larger, and shown in Fig. 16, though the size of the data set is not very
the LTE approximation (23) of the variance is a little less small N 50; the bootstrap variance estimate is far away
accurate. The three estimates are still very similar but, here, from the reference estimate. Increasing the number of boot-
their overestimation is due not only to the noise variance strap sets up to 1000 did not improve the variance estimate.
estimate s2 1:25 102 ; but also to the bias of the para- In fact, the bootstrap is especially suited to the estimation
meter estimates uLS1 3:79 102 ; uLS2 6:58: of the variance of estimators defined by a formula, like for
The computational cost of the LTE estimate being lower example an estimator of a correlation coefficient (Efron &
(is does not necessitate the computation of the Hessian Tibshirani, 1993): for each bootstrap set, an estimate is
matrix), there is no reason to prefer one of the two other computed using the formula, and the estimate of the
estimates. As a matter of fact, since the Hessian depends on variance is easily obtained. However the bootstrap is defi-
the data set, it is the realization of a random matrix. Thus, in nitely not the best method if each estimation associated to a
the maximum likelihood as well as in the Bayesian bootstrap set involves an iterative algorithm like the training
approach, it is often recommended to take the expectation of a neural network, which is the case for the construction of
of the Hessian, and to evaluate it at the available u LS, i.e. to a CI with a neural model. However, if the data set is large
replace it by the cross-product Jacobian z Tz (Seber & Wild, enough, and if the training time is considered unimportant,
1989): estimates (40) and (41) then reduce to estimate (27). the bootstrap pairs approach is a solution in the case of
As mentioned above, the sandwich variance estimator is heteroscedasticity (that is K(W) is not scalar anymore),
known to be robust to model incorrectness, a property whereas the LS LTE approach, as well as the bootstrap
which is not tested with this simple setting, but this is residuals approach (Efron & Tibshirani, 1993), are no
beyond the scope of this paper. longer valid.
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 481

6. Conclusion This leads to

We have given a thorough analysis of the LS LTE Ju 1


2 yp f x; up ju u p T
approach to the construction of CIs for a nonlinear regres-
 yp f x; up ju u p
sion using neural network models, and put emphasis on its
enlightening geometric interpretation. We have stressed the 1
yp f x; up ju p T yp f x; up ju p
2
underlying assumptions, in particular the fact that the
approval and selection procedures must have led to a parsi- uT j T yp f x; up ju p 1
uT j T ju
2
monious, well-conditioned model containing a good
approximation of the regression. Our whole methodology An approximate expression of the gradient of the cost-
(LS parameter estimation, model approval, model selection, function follows
CI construction) has been illustrated on representative 2J
examples, bringing into play simulated processes and an j T yp f x; up ju p j T ju A4
2u
industrial one.
We have also shown that, as opposed to the computation- Hence an approximate expression of the least-squares
ally intensive bootstrap methods, the LS LTE approach to estimate of the parameters
the estimation of CIs is both accurate and economical in u LS up j T j 1 j T yp f x; up A5
terms of computer power, and that it leads to CIs which
are comparable to those obtained by other analytic And hence the corresponding approximation of the least-
approaches under similar assumptions, at a lower computa- squares estimator (i.e. the random vector Q LS) of the para-
tional cost. meters (expression (21) in the main text)
A rigorous assessment of the accuracy of the results
Q LS up j T j 1 j T Y p f x; up up j T j 1 j T W
obtained with the LS LTE approach, as well as with any
statistical approach dealing with nonlinear models and A6
assuming the local planarity of the solution surface, remains Using the linear Taylor expansion (A1), we obtain an
an open problem: it could be enlightened by a specific study approximation of the variance of the LS estimator of the
of the curvature of the solution surface of neural networks. regression for any input x (expression (23) in the main text):
var f x; Q LS s 2 j T j T j 1 j A7
Acknowledgements The derivatives involved in j and j being performed at
the unknown u u p ; they may be estimated by the deriva-
We thank an anonymous reviewer whose constructive tives at u uLS ; that is by replacing j by z and j by z.
comments contributed to improve the quality of this Hence the LTE variance estimate presented in the paper:
paper. We also thank Howard Gutowitz whose advice we
dQ LS LTE s2 zT zT z1 z rT r T T 1
greatly appreciated. var fx; z z z z A8
Nq

Appendix A. Estimates of a nonlinear model output


A.2. LTE variance estimates in maximum likelihood theory
variance
For comparison, we sum up the results obtained with
In order to make this paper self-contained, we provide maximum likelihood theory (see, e.g. Efron & Tibshirani,
derivations of the different variance estimates. 1993; Tibshirani, 1996). We make the same assumptions as
for sampling theory, i.e. that the nonlinear assumed model is
A.1. LTE variance estimate in sampling theory
true and that KW s 2 IN (homoscedasticity), and we
The well-known approximation (Seber & Wild, 1989) we consider a gaussian distributed noise. In this case, the log
use in this paper is based on a single expansion, the LTE of likelihood function is:
the nonlinear model output for an input x at the true para- 1
meter value u p: Lu yp f x; uT yp f x; u cte A9
2s 2
f x; u f x; up j T u up A1 The parameters that maximize Eq. (A9) are those that mini-
mize Eq. (A3), i.e. uML uLS :
This expansion leads, for the data set, to
It can be shown (Seber & Wild, 1989) that the covariance
f x; u f x; up ju u p A2 matrix of Q ML is given asymptotically by the inverse of the
Fisher information matrix evaluated at u p. The Fisher infor-
We now use Eq. (A2) in the expression of the cost-function
mation matrix being the mathematical expectation of the
Ju 1
2 yp f x; uT yp f x; u A3 random matrix Mu of the second derivatives of the log
482 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484

likelihood function, we have We can thus estimate the variance with

var fx; c2 zT zT z1 z
dQ LS LTE s A16
22 L
Mu ij
2u i 2uj In likelihood theory, the variance of the noise is estimated
! with
1 XN
2f xk ; u 2f xk ; u 22 f xk ; u
Ypk f xk ; u RT R Nq 2
s k1
2 2ui 2uj 2ui 2uj s s2 ;
N N
A10
but we will skip over this minor difference; Eq. (A16) is thus
The assumed model being true, i.e. f x ; up
EYpk k identical to Eq. (A8).
EW k 0; the Fisher information matrix evaluated at u p It is also proposed to estimate the Fisher information
is given by: matrix E(M(u p)) with the observed information matrix
m(u LS); this leads to estimate the variance with

22 L c2 zT hu 1 z
dQ LS Hessian s
var fx; A17
EMup ij LS
2ui 2uj uup
In contrary to estimate (A16), estimate (A17) necessitates
" #
2f x ; u 2f xk ; u
X
N k the computation of the Hessian.
1

s2 k1
2ui uup 2uj uup A.3. Sandwich variance estimate
A11
Let us propose a derivation of this estimate in the
sampling theory. A second expansion is needed, the LTE
1 T of the gradient at the true parameter value u p:
EMup j j
s2 2J 2J 22 J
u up
Thus, the covariance matrix of Q ML Q LS is approxi- 2u uuLS 2u uup 2u2uT uu LS p
mately given by:
2J
hup uLS up A18
KQ ML EMup 1 s 2 j T j 1 A12 2u uup
where h(u p) is the value of the random Hessian matrix (see
Using the linear Taylor expansion (A1), the maximum like- Section A.2) evaluated at u p. Hence an approximate expres-
lihood approximation of the variance of the output in the sion of the LS estimate of the parameters
gaussian case is obtained
2J
uLS up hup 1 A19
2u uup
var f x; Q LS s 2 j T j T j 1 j A13
In Eq. (A19), we can replace the gradient by its expression

Hence, the maximum likelihood approximate variance 2J
j T yp f x; up j T W A20
(A13) is identical to the sampling theory approximate 2u uup
variance (A7).
Hence the corresponding approximation of the least-squares
estimator (random vector Q LS) of the parameters
Remark. The Hessian matrix h is the value of the random
matrix H with elements: Q LS up Hup 1 j T W A21

22 J Using the LTE of the model output (A1), we obtain:


Hu ij
2ui 2uj
f x; Q LS f x; up j T Hup 1 j T W A22
!
X
N
2f x ; u 2f x ; u
k k
2 f x ; u
2 k
Neglecting the random character of H (H being replaced by
Ypk f xk ; u
k1
2ui 2uj 2ui 2uj h), the output variance can be approximated by
A14 var f x; Q LS s 2 j T hup 1 j T j hu p 1 j A23

Thus This leads to propose the sandwich estimate

1 1 dQ LS sandwich s2 zT huLS 1 zT z huLS 1 z


var fx;
EMup EHup 2 j T j A15
s2 s A24
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 483

This estimate also necessitates the computation of the which is similar to the expression of the linear LOO error
Hessian of the cost-function. (36)

rk
Appendix B. Derivation of an approximate LOO error ek k 1 to N B8
1 pz kk

The following derivation is inspired from the work of In practice, the diagonal terms of pz are computed using
Antoniadis et al. (1992) and is valid irrespective of whether the singular value factorization of z uSv T ; where u is an
or not the assumed model is true. We denote by uk LS the LS
orthogonal (N,N) matrix, S is a diagonal (N,q) matrix, and v
estimate on the kth LOO set {xi ; yip }i1to N;ik : We have the is an orthogonal (q,q) matrix (see, Golub & Van Loan,
kth residual r k and the kth LOO error e k: 1983). Then:
8
< rk ykp f xk ; uLS X
q

B1 p z kk u2ki k 1 to N B9
: ek yk f xk ; uk i1
p LS

Let us denote by yk The diagonal elements of pz that differ from 1 by a threshold


p the (N 1)-vector obtained by deletion
of the kth component of the measured output vector yp ; by consistent with the computer precision are considered as
z (k) the N 1; q matrix obtained by deletion of the kth row equal to 1 (theoretically, the values of the [pz]kk are
of z, by x (k) the N 1; q matrix obtained by deletion of the comprised between 1/N and 1).
kth row of x. The LOO estimate uk LS minimizes the cost-
function References
k
J u 1
2 yk
p
k
f x ; u T
yk
p
k
f x ; u B2
Antoniadis, A., Berruyer, J., & Carmona, R. (1992). Regression non
We first approximate f xk ; u by its LTE at u LS: lineaire et applications. Paris: Economica.
Bates, D. M., & Watts, D. G. (1988). Nonlinear regression analysis and its
f xk ; u f xk ; uLS zk u uLS B3 applications. New York: Wiley.
Bishop, M. (1995). Neural networks for pattern recognition. Oxford: Clar-
Hence the approximation of uk
LS endon Press.
 T 1 T Buntine, W., & Weigend, A. (1994). Computing second derivatives in
uk
LS uLS zk zk zk yk
p f x
k
; uLS B4 feedforward neural networks: a review. IEEE Transactions on Neural
Networks, 5 (3), 480488.
Draper, N. R., & Smith, H. (1998). Applied regression analysis. New York:
In the previous expression, we have
Wiley.
zk yk k
T Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New
p f x ; uLS z yp f x; uLS z r
T k k
York: Chapman.
Golub, G. H., & Van Loan, C. F. (1983). Matrix computations. Baltimore:
B5 John Hopkins University Press.
zT r zk rk zk rk Goodwin, G. C., & Payne, R. L. (1977). Dynamic system identification;
experiment design and data analysis. New York: Academic Press.
Hansen, L. K., & Larsen, J. (1993). Linear unlearning for cross validation.
because the columns of z are orthogonal to the residual
Advances in Computational Mathematics, 5, 286290.
vector r. Using the matrix inversion lemma, we can express Heskes, T. (1997). Practical confidence and prediction intervals. In M.
zk zk 1 in Eq. (B4) in terms of (z T z) 1
T
Mozer, M. Jordan & T. Petsche, Advances in neural information
processing systems, vol. 9. Cambridge, MA: MIT Press, 176182.
 1 zT z1 zk zk zT z1
T
MacKay, D. J. C. (1992a). Bayesian interpolation. Neural Computation, 4,
zk zk zT z1
T

415447.
1 zk zT z1 zk
T

MacKay, D. J. C. (1992b). A practical Bayesian framework for backprop


B6 networks. Neural Computation, 4, 448472.
zT z1 zk zk zT z1
T
Monari, G. (1999). Selection de modeles non lineaires par leave-one-out;
1
z z T
etude theorique et application des reseaux de neurones au procede de
1 pz kk soudage par points. These de Doctorat de lUniversite Paris 6.
Monari, G., & Dreyfus, G. Local linear least squares: performing leave-
where pz denotes the orthogonal projection matrix on the one-out without leaving anything out. Submitted for publication.
range of z. Paass, G. (1993). Assessing and improving neural network predictions by
Replacing Eqs. (B5) and (B6) into Eq. (B4), we finally the bootstrap algorithm. In S. J. Hanson, J. D. Cowan & C. L. Giles,
Advances in neural information processing systems (pp. 186203), 5.
obtain Cambridge, MA: MIT Press.
Refenes, A.-P. N., Zapranis, A. D., & Utans, J. (1997). Neural model
rk
uk T 1 k
LS uLS z z z B7 identification, variable selection and model adequacy. In A. S.
1 pz kk Weigend, Y. Abu-Mostafa & A.-P. N. Refenes, Decision technologies
for financial engineering (pp. 243261). Singapore: World Scientific.
Expanding e k at u LS and replacing Eq. (B7) into this expan- Ripley, B. D. (1995). Pattern recognition and neural networks. Cambridge:
sion, we obtain an approximate expression of the LOO error Cambridge University Press.
484 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484

Rivals, I., & Personnaz, L. (1998). Construction of confidence intervals in Sussman, H. J. (1992). Uniqueness of the weights for minimal feedforward
neural modeling using a linear Taylor expansion. Proceedings of the nets with a given inputoutput map. Neural Networks, 5, 589593.
International Workshop on Advanced Black-Box Techniques for Tibshirani, R. J. (1996). A comparison of some error estimates for neural
Nonlinear Modeling (pp. 1722), 810 July, Leuwen. models. Neural Computation, 8, 152163.
Rivals, I., & Personnaz, L. (1999). On cross-validation for model selection. Urbani, D., Roussel-Ragot, P., Personnaz, L., & Dreyfus, G. (1994). The
Neural Computation, 11, 863870. selection of neural models of non-linear dynamical systems by statis-
Saarinen, S., Bramley, R., & Cybenko, G. (1993). Ill-conditioning in neural tical tests. Neural Networks for Signal Processing, Proceedings of the
network training problems. SIAM Journal on Scientific and Statistical 1994 IEEE Workshop.
Computing, 14, 693714. Zhou, G., & Si, J. (1998). A systematic and effective supervised learning
Seber, G. A. F. (1977). Linear regression analysis. New York: Wiley. mechanism based on Jacobian rank deficiency. Neural Computation,
Seber, G. A. F., & Wild, C. (1989). Nonlinear regression. New York: 10, 10311045.
Wiley.

You might also like