Confidence Intervals for Neural Networks
Confidence Intervals for Neural Networks
Networks
Abstract
We present the theoretical results about the construction of confidence intervals for a nonlinear regression based on least squares
estimation and using the linear Taylor expansion of the nonlinear model output. We stress the assumptions on which these results are
based, in order to derive an appropriate methodology for neural black-box modeling; the latter is then analyzed and illustrated on simulated
and real processes. We show that the linear Taylor expansion of a nonlinear model output also gives a tool to detect the possible ill-
conditioning of neural network candidates, and to estimate their performance. Finally, we show that the least squares and linear Taylor
expansion based approach compares favorably with other analytic approaches, and that it is an efficient and economic alternative to the
nonanalytic and computationally intensive bootstrap methods. 2000 Elsevier Science Ltd. All rights reserved.
Keywords: Nonlinear regression; Neural networks; Least squares estimation; Linear Taylor expansion; Confidence intervals; Ill-conditioning detection; Model
selection; Approximate leave-one-out score
Nomenclature
We distinguish between random variables and their values (or realizations) by using upper- and lowercase letters; all
vectors are column vectors, and are denoted by boldface letters; nonrandom matrices are denoted by light lowercase letters
x Nonrandom n-input vector
Yp Yp x Random scalar output depending on x
EYp x Mathematical expectation, or regression function, of Yp given x
W Random variable with zero expectation denoting additive noise
s2 Variance of W
{x k ; ykp }k1 to N Data set of N inputoutput pairs, where the {xk } are nonrandom n-vectors, and the {ykp } are the
corresponding realizations of the random outputs {Ypk Yp xk }
{x u; u Rn } Family of linear functions of x parameterized by u
k T
a)
5
-5
-3 -2 -1 0 1 2 3
b)
0.08
0.06
0.04
0.02
0
-3 -2 -1 0 1 2 3
Fig. 2. CI for process #1, a simulated linear SISO process (true assumed model with n 2 parameters): (a) regression (thin line), data set (crosses), model
output and 99% CI (thick lines); and (b) true variance (thin line) and LS estimate of the variance (thick line) of x TQ LS.
varxd
T
Q LS s2 xT xT x1 x 9 x T Q LS EYp x
p ! N0; 1 12
where s is the value of S. s x T xT x1 x
2.2. Confidence intervals for a linear regression Thus, using the Pearson variable (11), which is independent
from Eq. (12) according to Theorem 3, we obtain the follow-
If the {W k } are homoscedastic gaussian variables, that is ing Student variable:
W ! NN 0; s 2 IN :
xT Q LS EYp x
Theorem 1. p ! Student N n 13
S xT xT x1 x
1
Q LS up ! Nn 0; s x x
2 T
10
A 100(1 a )% CI for EYp x is thus:
q
Theorem 2. a
xT uLS ^ tNn 1 s xT xT x1 x 14
2
T
R R
! x2Nn 11
s2 where tNn is the inverse of the Student(N n) cumulative
distribution.
Note that Eq. (14) allows to compute a CI corresponding
Theorem 3. Q LS is statistically independent from RT R: to any input vector, and that it is much more informative
than average values such as that the mean square error on
The proof of the above theorems follows from Fig. 1 and the data set, or the mean of the variance estimate over the
from the FisherCochrane theorem (Goodwin & Payne, data set; 4 as a matter of fact, the latter invariably equals
1977), see for instance (Seber, 1977). s2 n=N:
The goal is to build a CI for the regression value 4
The mean of the variance estimate over the training data set is:
EYp x xT up ; for any input vector x of interest. The 1 PN 2 k T T 1 k s2 PN s2
variance of the measurements s 2 being unknown, let us k1 s x x x x k1 px kk tracepx : As px is the
N N N
build a normalized centered gaussian variable where both orthogonal projection matrix on a n-dimensional subspace, tracepx n:
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 467
Fig. 3. Geometric representation of the nonlinear LS solution and of its LTE approximation (true assumed model).
We consider a simulated single input-single output 3.1. The linear Taylor expansion of the nonlinear least
(SISO) linear process: squares solution
Y p f x; up W 17 j j 1 j 2 j N T
6
As in Section 2, x denotes the (N,n) input matrix, 5 and Yp In the case of a multilayer neural network, the minimum value of the
cost function can be obtained for several values of the parameter vector;
and W are random N-vectors with EW 0: Geometri-
but, since the only function-preserving transformations are neuron
cally, this means that EY p x belongs to the solution exchanges, as well as sign flips for odd activation functions like the hyper-
bolic tangent [Sussman 1992], we will legitimately consider the neighbor-
5
In the case of a nonlinear model, x is merely a two-dimensional array. hood of one of these values only.
468 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484
and matrix j , the vector j is not available, and its value may
be approximated by:
2f x ; u
k
j
k
2f x; u
2u uu z 26
2u uuLS
p
a)
-1
-3 -2 -1 0 1 2 3
x 10 -3 b)
6
0
-3 -2 -1 0 1 2 3
Fig. 4. CI for process #2, a simulated neural SISO process (the assumed model, a two hidden neurons network with q 7 parameters, is true): (a) regression
(thin line), the N 50 examples of the data set (crosses), model output and 99% approximate CI (thick lines); and (b) reference (thin line) and LTE (thick line)
estimates of the variance of f(x,Q LS).
set: as a matter of fact, the latter always approximately Thus, if N is large: (i) as in the linear case, the estimator of
equals s2 q=N: the noise variance S 2 is unbiased, and the difference between
From a practical point of view, the construction of a CI s 2 and s 2 is only due to the particular values of the noise;
for a neural model output at any input x of interest involves and (ii) the variance of f(x, Q LS) is small, and u LS is likely to
once and for all the computation of the matrix z, that is the be close to u p: z and z are thus likely to be good approxima-
N q partial derivatives of the model output with respect to tions of, respectively, j and j . A reliable estimate of a CI
the parameters evaluated at u LS for the data inputs may thus be obtained from the LTE variance estimate (27).
{xk }k1 to N ; and, for each x value, that of z, i.e. the deriva- On the other hand, if N is too small: (i) as opposed to the
tives at x. In the case of a neural model, these derivatives are linear case, even if the assumed model is true, the estimator
easily obtained with the backpropagation algorithm. of the noise variance S 2 is biased; and (ii) the variance of f(x,
All the previous results and the above considerations are Q LS) is large, and u LS is likely to differ from u p: z and z risk
valid provided an absolute minimum of the cost function to be poor approximations of j and j . Thus, if N is diag-
(18) is reached. In real-life, several estimations of the para- nosed as too small, one cannot have confidence in the
meters must thus be made starting from different initial confidence intervals (32), and additional data should be
values, the estimate corresponding to the lowest minimum gathered.
being kept in order to have a high probability to obtain an
absolute minimum. In the examples of this work, the algo- 3.3.2. Quantitative analysis
rithm used for the minimization of the cost function is the As detailed for example by Antoniadis, Berruyer and
Levenberg algorithm, as described for example in Bates and Carmona (1992), Bates and Watts (1988) and Seber and
Watts (1988), and several trainings are performed. The Wild (1989) different measures of the curvature can be
probability of getting trapped in a relative minimum computed, and can be used in each particular case to eval-
increasing with the number of parameters of the network uate the accuracy of the LTE. In Section 4.3 dealing with
and decreasing when the size of the data set increases, the neural network modeling, we give indications on how to
number of trainings is chosen accordingly. judge if N is large enough for the approximate CI to be
accurate.
3.3. Quality of the approximate confidence intervals In order to evaluate the accuracy of the LTE variance
estimate (27) when dealing with simulated processes, we
3.3.1. Theoretical analysis introduce an estimate of the unknown true variance of f(x,
The quality of the approximate CI depends essentially on Q LS) that is not biased by curvature effects, the reference
the curvature of the solution surface m f x; u; which variance estimate var d f x 1 Q LS ref : This estimate is
depends on the regularity of f and on the value of N. In computed on a large number M of other sets of N outputs
practice, f is often regular enough for a first-order approx- corresponding to the inputs of the training set, and whose
imation to be satisfactory, provided that N is large enough. values are obtained with different realizations (simulated
470 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484
x 10 -3 a)
0
-3 -2 -1 0 1 2 3
x 10 -4 b)
2
-1
-3 -2 -1 0 1 2 3
Fig. 6. Accuracy of the LTE approximation of the variance for process #2: (a) reference estimate of the variance of f(x,Q LS) (thin line), LTE approximation
obtained with the true values u p and s 2 (thick line); and (b) difference between the reference estimate of the variance of f(x,Q LS) and the LTE approximation
obtained with u p and s 2.
a)
0.04
0.03
0.02
0.01
2
-2
0 0
2 -2
x 10 -3 b)
-5
2
-2
0 0
2 -2
Fig. 7. Accuracy of the LTE approximation of the variance for process #3: (a) reference estimate of the variance of f(x,Q LS); and (b) difference between the
reference estimate of the variance of f(x,Q LS) and the LTE approximation obtained with u p and s 2.
472 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484
the Levenberg algorithm leads to s2 9:73 10 2. Fig. 5a computation of the approximate Cl (32). We propose to
shows the inputs of the data set and the regression; Fig. 5b discard too large models by a systematic detection of ill-
displays the LTE estimate of the variance of f(x, Q LS) (27); conditioning, and to perform the selection among the
Fig. 5c displays the difference between the reference approved, i.e. well-conditioned models using an approxi-
variance estimate (33) computed over M 10 000 sets, mate value of their LOO score whose computation does
and the LTE variance estimate (27). As s 2 is slightly smaller not require further training. Both the ill-conditioning detec-
than s 2, the variance is globally slightly underestimated. tion and the estimation of the LOO score of a neural candi-
However except in the domain around the corner (3, 3) date are based on the LTE of its output.
where the density of the inputs is lower, and where the slope
of the output surface is steep, the LTE variance estimate is 4.1. III-conditioning detection for model approval
satisfactory. A reliable estimate of the CI may thus be obtained.
A too large neural model, trained up to convergence with
a simple LS cost-function, will generally overfit. Overfitting
3.4.3. Accuracy of the LTE variance approximation is often avoided by using early stopping of the training
(processes #2 and #3) algorithm or by adding regularization terms in the cost func-
Let us now show on the example of processes #2 and #3 tion, e.g. weight decay (Bishop, 1995). Unfortunately, as
that the curvature of the solution surface is small enough for only the estimator whose value corresponds to an absolute
the LTE approximation of the variance (23) to be satisfac- minimum of the quadratic cost function (18) without weight
tory. For both processes, we have computed approximation decay terms is unbiased, both early stopping and weight
(23), using the values of j and of j (at u p) and the value of decay introduce a bias in the estimation of the regression:
s 2 used for the noise simulation. As shown in Fig. 6a and b the corresponding estimates thus lead to questionable CIs
for process #2, the LTE approximation of the variance (23) for the regression.
is very close to the reference variance estimate. As a matter To detect and discard too large networks, we propose,
of fact, the difference between them (Fig. 6b) is only due to after the training of each candidate up to a (hopefully) abso-
the curvature, which is small as N 50 is large with respect lute minimum of the cost function (18), to check the condi-
to the complexity of the regression. Expression (23) also tioning of their matrix z (see Rivals & Personnaz, 1998).
leads to satisfactory results in the case of process #3 as The fact that z be ill-conditioned is the symptom that some
shown in Fig. 7a and b N 100; two inputs). This tends parameters are useless, since the elements of z represent the
to show that a first-order approximation of the variance is sensibility of the model output with respect to the para-
often sufficient, and that it is not worth to bother with a meters. A typical situation is the saturation of a tanh
higher-order approximation. Seber and Wild (1989) intro- hidden neuron, a situation which generates in the matrix z
duces a quadratic approximation of the LS estimator using a column of 1 or 1 that corresponds to the parameter
the curvature of the solution surface. This approximation between the output of the saturated hidden neuron and the
uses arrays of projected second derivatives, the intrinsic linear output neuron, and columns of zeros that correspond
and parameter-effects curvatures arrays; but their to the parameters between the network inputs and the satu-
presentation is beyond the scope of this paper. rated hidden neuron. 8 In practice, we propose to perform a
singular value factorization of z, and to compute its condi-
tion number, that is the ratio of its largest to its smallest
4. Confidence intervals for neural networks
singular value, see, e.g. Golub & Van Loan, 1983. The
In the previous sections, the model used for the construc- matrix z can be considered as very ill-conditioned when
tion of CIs is true. For real-world black-box modeling its condition number reaches the inverse of the computer
problems however, a family of functions that contains the precision, which is of the order of 10 16.
regression is not known a priori. The first task is thus to Further, in order to be able to compute the approximate
select the less complex family of functions which contains CI (32) which involve (z Tz) 1, the cross-product Jacobian
a function approximating the regression to a certain degree matrix z Tz must also be well conditioned. As the condition
of accuracy in the input domain delimited by the data set. In number of z Tz is the square of the condition number of z, the
practice, several families of increasing complexity (for networks whose matrix z has a condition number much
example neural networks with one layer of an increasing larger than 10 8 cannot be approved.
number nh of hidden units, and a linear output neuron) are There are other studies of the ill-conditioning of neural
considered, and the data set is used both to estimate their networks, but they deal with their training rather than with
parameters, and to perform the selection between the candi- their approval, like in the work by Zhou and Si (1998) where
dates. In order to retain the less complex family containing a an algorithm avoiding the Jacobian rank deficiency is
good approximation of the regression, it is of interest to 8
Such a situation might also correspond to a relative minimum; to check
perform the selection only between neural candidates the conditioning of z helps thus also to discard neural networks trapped in
which are not unnecessarily large, and which are (that is relative minima, and leads to retrain the neural candidate starting from
their matrix z is) sufficiently well-conditioned to allow the different initial weights.
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 473
a)
4
selected network
2
2
0
0 1 2 3 4 5 6 7 8 nh
b)
selected network
4
2
2
0
0 1 2 3 4 5 6 7 8 nh
Fig. 8. Schematic evolution of the MSTE (crosses) and MSPE (circles) as a function of the number of hidden neurons of the neural network candidates, the
network with the smallest MSPE being selected: (a) large data set: the ratio MSPE/MSTE of the selected network (six hidden neurons) is roughly equal to 1,
hint that the data set size N is large; and (b) small data set: the ratio MSPE/MSTE of the selected network (three hidden neurons) is roughly equal to 2, hint that
the data set size N is small.
presented, or by Saarinen, Bramley and Cybenko (1993) without performing these N time-consuming trainings of
where the Hessian rank deficiency is studied during the each candidate network, as proposed in Monari (1999)
training. In our view, rank deficiency is not very relevant and Monari and Dreyfus (submitted)).
during the training because with a Levenberg algorithm, the In the case of a linear model, it is well known (Efron &
matrix to be inverted is made well conditioned by the Tibshirani, 1993) that the kth LOO error e k can be directly
addition of a scalar matrix l lq, l 0; to the cross-product derived from the corresponding residual r k:
Jacobian.
rk
ek k 1 to N 36
4.2. Approximate leave-one-out scores for model selection 1 px kk
where, we recall, px denotes the orthogonal projection
The selection among the networks which have been matrix on the range of x. Expression (36) holds irrespective
approved can be performed with statistical tests (Rivals & of whether or not the assumed model is true.
Personnaz, 1998; Urbani, Roussel-Ragot, Personnaz & In the case of a nonlinear model, we show (see Appendix
Dreyfus, 1994). Another approach, cross validation, consists B) that a useful approximation of the kth LOO error can be
in partitioning the data set in training and test sets, and in obtained using the LTE of the model output at u LS:
selecting the smallest network leading to the smallest mean
square error on the test sets. 9 One of the drawbacks of cross rk
ek k 1 to N 37
validation is to require a successful training of the candidate 1 pz kk
models on many test sets, that is N successful trainings in the
case of LOO cross validation. Let us denote by e k the error where pz denotes the orthogonal projection matrix on the
obtained on the left out example k with the model trained on range of z. The approximation (37) is thus similar to Eq.
the N 1 remaining examples (kth test set). In this section, (36). 10 Like in the linear case, expression (37) holds
we derive an approximate expression of e k, expression independently on the assumed model being true or not.
which allows an economic estimation of the LOO score Hence the LOO score:
1 XN
9
Note that statistical tests may advantageously be used complementarily LOO score ek 2 38
N k1
to cross validation in order to take a decision (Rivals and Personnaz, 1999);
these tests can also be established by applying LS theory to the LTE of
10
nonlinear models (Bates and Watts, 1988), but this exceeds the scope of this An expression similar to (37) is proposed in Hansen and Larsen (1993),
paper. but unfortunately, it is not valid even in the linear case.
474 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484
a)
1.5
0.5
-0.5
-5 0 5
b)
0.01
0.005
0
-5 0 5
Fig. 9. CI for process #4, a simulated nonlinear SISO process, in the case of a data set of size N 200 (the selected model is a four hidden neurons network
with q 13 parameters): (a) regression (thin line), data set (small points), model output and 99% approximate CI (thick lines); and (b) reference (thin line) and
LTE (thick line) estimates of the variance of f(x,Q LS).
This LOO score can be used as an estimate of the mean f(x,u LS) thus underfits. The approximate CIs are thus
square performance error, and we thus denote it as MSPE, questionable, and additional data should be gathered.
as opposed to 2=NJuLS ; the mean square training error
(MSTE). The interested reader will find in the work of A good indicator of whether the data set size N is large
Monari (1999) and Monari and Dreyfus (submitted) a enough or not is the ratio MSPE/MSTE of the selected
systematic model selection procedure based on both the candidate: if its value is close to 1, then N is probably
approximate LOO score and the distribution of the values large enough, whereas a large value is the symptom of a
of the {[pz]kk}. Nevertheless, another performance measure too small data set size, as shown in Fig. 8 (and as illustrated
could be chosen as well (a 10-fold cross validation score, a numerically in the following examples).
mean square error on an independent set, etc.).
4.4. Example of a simulated nonlinear SISO process
4.3. Accuracy of the approximate confidence intervals (process #4)
The quality of the selected model f(x, u LS), and thus of the This first example is based on a simulated process. Like in
associated approximate CI, depends essentially on the size the previous sections, a reference estimate of the variance of
N of the available data set with respect to the complexity of the output of a neural network is made, using M 1000
the unknown regression function and to the noise variance other sets; to ensure that an absolute minimum is reached
s 2. on each of the M sets, 530 trainings (depending on the
network size) with the Levenberg algorithm for different
1. N is large: it is likely that the selected family initializations of the weights are performed, and the weights
{f x; u; u R q } contains the regression E(Ypx), i.e. giving the smallest value of the cost function (18) are kept.
that the LS estimator is asymptotically unbiased, that We consider the SISO process simulated with:
the model f(x,u LS) is a good approximation of E(Ypx)
in the domain delimited by the dataset, and that the ykp sinc 2xk 5 wk k 1 to N 39
curvature is small. In this case, reliable CIs can be
where sinc denotes the cardinal sine function; we take s 2
computed with Eq. (32).
102 :
2. N is small: it is likely that the selected family
First, a data set of N 200 inputoutput pairs is computed,
{f x; u; u Rq } is too small 11 to contain E(Ypx), i.e.
with input values uniformly distributed in [5; 5]. As a
that the LS estimator is biased, and that the model
family of nonlinear functions (a neural network with a
11
It will generally not be too large since the approval procedure proposed
given architecture) containing the regression is not known
in Section 4.1 prevents from selecting a neural network with useless a priori, neural networks with a linear output neuron and a
parameters. layer of nh tanh hidden neurons are trained. The numerical
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 475
a)
1.5
0.5
-0.5
-5 0 5
b)
0.03
0.02
0.01
0
-5 0 5
Fig. 10. CI for process #4, a simulated nonlinear SISO process, in the case of a data set of size N 30 (the selected model is a two hidden neurons network with
q 7 parameters): (a) regression (thin line), data set (circles), model output and 99% approximate CI (thick lines); and (b) reference (thin line) and LTE (thick
line) estimates of the variance of f(x,Q LS).
results are summarized in Table 1. We list the number of computed, the numerical results being summarized in
parameters q, the MSTE (i.e. the smallest MSTE obtained Table 2. The data set being much smaller, the candidates
with the network for its different random weight initializa- cannot be approved as soon as nh 4 : for nh 5;
tions), the condition number of z, and, if the latter is not too condz 1015 : The optimal number of neurons
large, the MSPE (corresponding approximate LOO score nopt
h 30 2 is selected on the basis of the MSPE. The
computed with Eqs. (37) and (38)) and the ratio MSPE/ ratio MSPE/MSTE of the selected network equals 2.1,
MSTE. The candidates with more than six hidden neurons symptom that N is relatively small, and that the selected
cannot be approved, because condz 108 : for nh 7; family of networks is likely not to contain the regression
condz 1011 : The optimal number of neurons (case 2 of Section 4.3). The results obtained for the
nopt
h 200 4 is selected on the basis of the MSPE. selected neural network are shown in Fig. 10. The family
The fact that the corresponding ratio MSPE/MSTE is of functions implemented by a network with two hidden
close to 1 is the symptom that N is large enough, so units is obviously too small to contain a good approxima-
that the selected family of networks contains a good tion of the regression, and though the estimate of the
approximation of the regression, and that the curvature output variance is good (it is close to the reference
is small (case 1 of Section 4.3). The results obtained for variance estimate), since the output of the neural network
the selected neural network are shown in Fig. 9. The differs from the regression, the CI is less accurate than in
model output is close to the regression, the LTE the case where N 200: Note that in the input domain [0,
variance estimate (27) is close to the reference variance 5] where the model underfits, the variance remains
estimate (33), and the CI is thus accurate. constant and low. This is due to the fact that, in this
Second, a data set of N 30 inputoutput pairs is domain, the model output is insensitive to most para-
meters of the network (this is usually the case when,
Table 1
Results obtained on the modeling of the simulated SISO process #4 using
neural networks, in the case N 200 Table 2
Results obtained on the modeling of the simulated SISO process #4 using
nh q MSTE Cond(z) MSPE MSPE=MSTE neural networks, in the case N 30
10
x2
80
160
60 140 x1
120
40 100
80
Fig. 11. Industrial modeling problem (the selected model is a two hidden neurons network with q 9 parameters): model output, and the N 69 examples of
the data set (circles).
a)
10 20 30 40 50 60
b)
0.5
-0.5
10 20 30 40 50 60
c)
0.25
0.2
0.15
s
0.1
0.05
0 10 20 30 40 50 60 70
Fig. 12. Industrial modeling problem (a) the N 69 outputs of the data set presented in increasing order of their values; (b) the corresponding residuals
(circles) and approximate LOO errors (crosses); and (c) half width of the 95% approximate Cl at the N 69 examples of the data set, and LS estimate s of the
noise standard-deviation s (dotted line).
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 477
Table 3
x2 Results obtained on the modeling of the industrial process using neural
networks
80 nh q MSTE Cond(z) MSPE MSPE=MSTE
a)
1.5
0.5
-0.5
-1
-1.5
-3 -2 -1 0 1 2 3
x 10 -3 b)
2.5
Sandwich estimate
Hessian estimate
2 LTE estimate
Reference estimate
1.5
LTE approximation
0.5
0
-3 -2 -1 0 1 2 3
Fig. 14. Comparison of different estimates of the variance of a nonlinear model output for process #5, a simulated neural SISO process (the assumed model, a
single nonlinear neuron with q 2 parameters, is true): (a) regression with a gentle slope (thin line), the N 30 examples of the data set (crosses), model
output (thick line); and (b) LTE approximation and estimates of the variance of f(x, Q LS).
5.1. Comparison to other analytic approaches approaches are conceptually very different: the Bayesian
approach treats the unknown parameters as random vari-
5.1.1. Maximum likelihood approach ables, whereas they are considered as certain in the frequen-
In the case of gaussian homoscedastic data, likelihood tist approach. Nevertheless, as outlined by Bishop (1995),
theory leads to the same approximate variance (23), but MacKay (1992a,b), the Bayesian approach leads to a poster-
two different estimators of it are commonly encountered ior distribution of the parameters with a covariance matrix
(see Appendix A.2): whose expression is very similar to that of the covariance
c2 zT zT z1 z matrix of the least-squares estimator of the parameters, and
dQ LS LTE s
var fx; thus to CIs which are similar to those presented in this paper.
i.e. the same estimate as Eq. (27), and also: We thus make a brief comparison between the CIs these two
approaches lead to.
var fx; c2 zT hu 1 z
dQ LS Hessian s 40 The most important difference is that the estimator which
LS
is considered here is the one whose estimate minimizes the
which necessitates the computation of the Hessian. Efficient cost function (18), whereas in the Bayesian approach, a
methods for computing the Hessian are presented by cost-function with an additional weight-decay regulariza-
Buntine and Weigend (1994). tion term is minimized; the presence of this weight-decay
term stems from the assumption of a gaussian prior for the
5.1.2. Bayesian approach parameters.
The Bayesian approach is an alternative approach to the Nevertheless, the least squares cost function (18) can be
sampling theory (or the frequentist approach) for modeling seen as the limit where the regularization term is zero, which
problems, and also leads to the design of CIs. These two corresponds to an uninformative prior for the parameters. In
I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484 479
a)
1.5
0.5
-0.5
-1
-1.5
-3 -2 -1 0 1 2 3
b)
0.025
0.02
0.015
Sandwich estimate
0.01 Reference estimate Hessian estimate
LTE approximation LTE estimate
0.005
0
-3 -2 -1 0 1 2 3
Fig. 15. Comparison of different estimates of the variance of a nonlinear model output for process #6, a simulated neural SISO process (the assumed model, a
single nonlinear neuron with q 2 parameters, is true): (a) regression with a steep slope (thin line), the N 30 examples of the data set (crosses), model
output (thick line); and (b) LTE approximation and estimates of the variance of f(x, Q LS).
this case (that is Eq. (18) is minimized as in this paper), most probable value of the parameter, that is here u LS. A
there is another small difference in the Bayesian approach as LTE of the estimator output leads then to the following
outlined by Bishop (1995) and MacKay (1992a,b). Under estimate of its variance at input x:
hypotheses which we cannot detail here, the Bayesian
approach leads to a posterior parameter distribution with var fx; c2 zT hu 1 z
dQ LS Hessian s LS
the approximate covariance matrix s 2 huLS 1 ; h(u LS)
being the Hessian of the cost function evaluated at the i.e. it also leads to estimate (40).
0.015
0.01
0.005
0
-3 -2 -1 0 1 2 3
Fig. 16. Comparison of the LS LTE and bootstrap pairs approach estimates of the variance for process #2. reference (thin line), LTE (thick line), and bootstrap
pairs (dotted line) estimates of the variance of f(x, Q LS).
480 I. Rivals, L. Personnaz / Neural Networks 13 (2000) 463484
var fx; c2 zT zT z1 z
dQ LS LTE s A16
22 L
Mu ij
2u i 2uj In likelihood theory, the variance of the noise is estimated
! with
1 XN
2f xk ; u 2f xk ; u 22 f xk ; u
Ypk f xk ; u RT R Nq 2
s k1
2 2ui 2uj 2ui 2uj s s2 ;
N N
A10
but we will skip over this minor difference; Eq. (A16) is thus
The assumed model being true, i.e. f x ; up
EYpk k identical to Eq. (A8).
EW k 0; the Fisher information matrix evaluated at u p It is also proposed to estimate the Fisher information
is given by: matrix E(M(u p)) with the observed information matrix
m(u LS); this leads to estimate the variance with
22 L c2 zT hu 1 z
dQ LS Hessian s
var fx; A17
EMup ij LS
2ui 2uj uup
In contrary to estimate (A16), estimate (A17) necessitates
" #
2f x ; u 2f xk ; u
X
N k the computation of the Hessian.
1
s2 k1
2ui uup 2uj uup A.3. Sandwich variance estimate
A11
Let us propose a derivation of this estimate in the
sampling theory. A second expansion is needed, the LTE
1 T of the gradient at the true parameter value u p:
EMup j j
s2 2J 2J 22 J
u up
Thus, the covariance matrix of Q ML Q LS is approxi- 2u uuLS 2u uup 2u2uT uu LS p
mately given by:
2J
hup uLS up A18
KQ ML EMup 1 s 2 j T j 1 A12 2u uup
where h(u p) is the value of the random Hessian matrix (see
Using the linear Taylor expansion (A1), the maximum like- Section A.2) evaluated at u p. Hence an approximate expres-
lihood approximation of the variance of the output in the sion of the LS estimate of the parameters
gaussian case is obtained
2J
uLS up hup 1 A19
2u uup
var f x; Q LS s 2 j T j T j 1 j A13
In Eq. (A19), we can replace the gradient by its expression
Hence, the maximum likelihood approximate variance 2J
j T yp f x; up j T W A20
(A13) is identical to the sampling theory approximate 2u uup
variance (A7).
Hence the corresponding approximation of the least-squares
estimator (random vector Q LS) of the parameters
Remark. The Hessian matrix h is the value of the random
matrix H with elements: Q LS up Hup 1 j T W A21
This estimate also necessitates the computation of the which is similar to the expression of the linear LOO error
Hessian of the cost-function. (36)
rk
Appendix B. Derivation of an approximate LOO error ek k 1 to N B8
1 pz kk
The following derivation is inspired from the work of In practice, the diagonal terms of pz are computed using
Antoniadis et al. (1992) and is valid irrespective of whether the singular value factorization of z uSv T ; where u is an
or not the assumed model is true. We denote by uk LS the LS
orthogonal (N,N) matrix, S is a diagonal (N,q) matrix, and v
estimate on the kth LOO set {xi ; yip }i1to N;ik : We have the is an orthogonal (q,q) matrix (see, Golub & Van Loan,
kth residual r k and the kth LOO error e k: 1983). Then:
8
< rk ykp f xk ; uLS X
q
B1 p z kk u2ki k 1 to N B9
: ek yk f xk ; uk i1
p LS
415447.
1 zk zT z1 zk
T
Rivals, I., & Personnaz, L. (1998). Construction of confidence intervals in Sussman, H. J. (1992). Uniqueness of the weights for minimal feedforward
neural modeling using a linear Taylor expansion. Proceedings of the nets with a given inputoutput map. Neural Networks, 5, 589593.
International Workshop on Advanced Black-Box Techniques for Tibshirani, R. J. (1996). A comparison of some error estimates for neural
Nonlinear Modeling (pp. 1722), 810 July, Leuwen. models. Neural Computation, 8, 152163.
Rivals, I., & Personnaz, L. (1999). On cross-validation for model selection. Urbani, D., Roussel-Ragot, P., Personnaz, L., & Dreyfus, G. (1994). The
Neural Computation, 11, 863870. selection of neural models of non-linear dynamical systems by statis-
Saarinen, S., Bramley, R., & Cybenko, G. (1993). Ill-conditioning in neural tical tests. Neural Networks for Signal Processing, Proceedings of the
network training problems. SIAM Journal on Scientific and Statistical 1994 IEEE Workshop.
Computing, 14, 693714. Zhou, G., & Si, J. (1998). A systematic and effective supervised learning
Seber, G. A. F. (1977). Linear regression analysis. New York: Wiley. mechanism based on Jacobian rank deficiency. Neural Computation,
Seber, G. A. F., & Wild, C. (1989). Nonlinear regression. New York: 10, 10311045.
Wiley.