0% found this document useful (0 votes)
5 views19 pages

Mathematics: Science China

This paper presents a two-stage shrinkage estimation method for analyzing correlated binary data with a diverging number of parameters, particularly in high-dimensional settings. The authors propose a weighted least-squares (WLS) approach combined with a penalized model for variable selection and parameter estimation, demonstrating its effectiveness through simulation studies and a practical application. The method retains oracle properties and provides consistent estimation even when the working correlation matrix is misspecified.

Uploaded by

yx20021010141592
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views19 pages

Mathematics: Science China

This paper presents a two-stage shrinkage estimation method for analyzing correlated binary data with a diverging number of parameters, particularly in high-dimensional settings. The authors propose a weighted least-squares (WLS) approach combined with a penalized model for variable selection and parameter estimation, demonstrating its effectiveness through simulation studies and a practical application. The method retains oracle properties and provides consistent estimation even when the working correlation matrix is misspecified.

Uploaded by

yx20021010141592
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SCIENCE CHINA

Mathematics
. ARTICLES . February 2013 Vol. 56 No. 2: 359–377
doi: 10.1007/s11425-012-4564-y

Shrinkage estimation analysis of correlated binary


data with a diverging number of parameters
XU PeiRong1 , FU WenJiang2 & ZHU LiXing3,∗

1School
of Finance and Statistics, East China Normal University, Shanghai 200241, China;
2Department of Epidemiology and Biostatistics, Michigan State University, Michigan, MI 48824, USA;
3Department of Mathematics, Hong Kong Baptist University, Kowloon Tong 999077, Hong Kong, China

Email: xupeirong1108@[Link], fuw@[Link], lzhu@[Link]

Received March 2, 2011; accepted March 29, 2012; published online January 10, 2013

Abstract For analyzing correlated binary data with high-dimensional covariates, we, in this paper, propose
a two-stage shrinkage approach. First, we construct a weighted least-squares (WLS) type function using a
special weighting scheme on the non-conservative vector field of the generalized estimating equations (GEE)
model. Second, we define a penalized WLS in the spirit of the adaptive LASSO for simultaneous variable
selection and parameter estimation. The proposed procedure enjoys the oracle properties in high-dimensional
framework where the number of parameters grows to infinity with the number of clusters. Moreover, we prove
the consistency of the sandwich formula of the covariance matrix even when the working correlation matrix is
misspecified. For the selection of tuning parameter, we develop a consistent penalized quadratic form (PQF)
function criterion. The performance of the proposed method is assessed through a comparison with the existing
methods and through an application to a crossover trial in a pain relief study.

Keywords correlated binary data, variable selection, diverging number of parameters, adaptive LASSO,
GEE, oracle properties, sandwich covariance formula, penalized quadratic form function

MSC(2010) 62F07, 62J12

Citation: Xu P R, Fu W J, Zhu L X. Shrinkage estimation analysis of correlated binary data with a diverging
number of parameters. Sci China Math, 2013, 56: 359–377, doi: 10.1007/s11425-012-4564-y

1 Introduction
High-dimensional correlated data, which consists of repeated measurements on a large number of co-
variates, arise frequently from health and medical studies. For example, in a crossover trial comparing
the effect of placebo with that of the low and high doses of analgesic drug for pain relief (see [4]), each
participating woman was given all three treatments sequentially, one dose at each time with different
ordering of the treatments for different subject. In making the inference on the relationship between the
outcome measure of either no effect of pain relief (0) or some effect of pain relief (1) and the covariates:
the current treatment, the previous treatment, and the current period, the final model will be selected
based on the selection and estimation of the effects while taking into consideration the correlation between
the outcomes from different treatments within the same subject. Various statistical methods of model
parameter estimation have been developed, see [4, 15]. Recent studies of variable selection for analyzing
correlated data received much attention for high-dimensional data structure.
∗ Corresponding author


c Science China Press and Springer-Verlag Berlin Heidelberg 2013 [Link] [Link]
360 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

For example, QIC was proposed as a modification to Akaike’s information criteria (AIC), using quasi-
likelihood instead of likelihood [13]; a generalized version of Mallow’s Cp to measure model adequacy of
prediction was introduced [2]; BIQIF was introduced as a Bayesian information type of criterion based
on the quadratic inference function [18]. These are best subset type variable selection procedures which
become computationally intensive when the number of covariates is moderately large. A generalization
of the bridge and LASSO penalties was applied to generalized estimating equations (GEE) models [8];
similarly, the SCAD penalty (see [6]) was applied to the GEE model [5] following the spirit in [8]; a variable
selection was studied for linear mixed effects model for correlated data [1]. However, the aforementioned
papers all assumed that the dimension of covariates is fixed. Therefore, it is important to develop new
statistical methodology and theory of variable selection and estimation for high-dimensional clustered
data.
In this paper, we propose a two-stage shrinkage estimation procedure for variable selection in the
“large n, diverging p” setup for sparse generalized linear models, where p = pn is a function of the num-
ber of clusters n. We focus on correlated binary data, which is driven by the crossover trial analyzed in
Section 5. We first construct a weighted least-squares (WLS) type objective function to approximate a
potential function, although a real potential function does not exist in the GEE model in general. We
then define a penalized WLS model for simultaneous variable selection and parameter estimation. It
is noted that the proposed potential approximate function is path-dependent, requiring no conservative
gradient vector field. We show that the proposed approach achieves the desirable oracle properties, un-
der appropriate conditions for dimension, the sample size and the strength of the signal. In addition, it
inherits the built-in robustness of the GEE model with consistent estimation allowing misspecification of
the working correlation matrix. And the sandwich variance formula for nonzero part of true parameters
still remains valid in this new context. We further recommend a penalized quadratic form (PQF) func-
tion criterion that facilitates the oracle properties. Simulation studies suggest that, with the PQF, the
propsoed method is more efficient than the LASSO-based penalized estimating equations presented by
[8]. Also, the sandwich variance formula achieves satisfactory performance.
The rest of the paper is organized as follows. In Section 2, we propose our variable selection method
for correlated binary data. In Section 3, we establish the oracle properties of the proposed method, the
consistency of the sandwich variance formula, and the consistency of the PQF function criterion. Section 4
reports the Monte Carlo simulation study results. Section 5 applies the new method to a three crossover
trial dataset. Section 6 gives a brief discussion. All technical proofs are relegated to the Appendix.

2 Penalized weighted least-squares (PWLS)


Suppose that there are n clusters and mi subjects within cluster i, i = 1, 2, . . . , n. Let Yij be the binary
response for the subject in cluster i, and Xij be the corresponding pn × 1 covariate vector. Define
Yi = (Yi1 , . . . , Yimi )T , and Xi = (Xi1 , . . . , Ximi )T . Provided the mean of Yij depends only on the
covariate vector of subject j, i.e., E(Yij | Xi ) = E(Yij | Xij ) (see [14]), we consider the regression model
logit(E(Yij | Xij )) = Xij
T
βn , (2.1)

where logit(t) = log(t/(1 − t)), βn is a pn -dimensional unknown parameter vector. Denote πij = E(Yij |
Xij ). Then, the marginal variance of Yij is assumed as Var(Yij | Xij ) = πij (1 − πij ), where a dispersion
parameter may be added in the marginal variance function if overdispersion is suspected to be present.
Without loss of generality, the number of repetitions mi are assumed to be equal across the clusters, for
example, m. Then, let πi (βn ) = (πi1 (βn ), . . . , πim (βn ))T , where πij (βn ) = exp(Xij
T T
βn )/(1 + exp(Xij βn )),
Ai (βn ) be the m × m diagonal matrix with elements πij (βn )(1 − πij (βn )), and R(τ ) be an m × m working
correlation matrix, where τ is a finite dimensional parameters which fully characterizes R(τ ). The GEE
model (see [10]) is defined as

n
1/2 −1 A−1/2 (βn )(Yi − πi (βn )) = 0,
XiT Ai (βn )R (2.2)
i
i=1
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 361

where R  is the estimator of R(τ ). A key advantage of the GEE model is that the GEE estimator βnG
achieves consistent estimation for the model parameters even when the working correlation matrix is
misspecified in “large n, diverging p” setup (see [16]).
Motivated by this appealing fact, we consider a WLS-type function for model (2.1). First, for the
simplicity of notation, let Y = (Y1T , . . . , YnT )T , X = (X1T , . . . , XnT )T , π(βn ) = (π1T (βn ), . . . , πnT (βn ))T ,
1/2  1/2 (βn ). Let V (βn ) and A(βn ) be, respectively, the nm × nm block diagonal
and Vi (βn ) = Ai (βn )RA i
matrix with diagonal elements Vi (βn )’s and Ai (βn )’s. Consider the following approximation to the residual
Y − π(βnG ). Let Z = A(βnG )X βnG − (Y − π(βnG )), we have

Z ≈ A(βnG )X βnG .

Then, the WLS-type function can be constructed as

W LS(βn )  (Z − A(βnG )Xβn )T V −1 (βnG )(Z − A(βnG )Xβn ), (2.3)

and the parameter estimator is the minimizer of this WLS-type function.


Remark 2.1. The main advantage of this approach is that one can construct an approximate potential
function WLS(β) with no requirement of the conservative gradient vector field. Further, since Z =
A(βnG )X βnG − (Y − π(βnG )) and βnG is the solution of (2.2), one has

βnG = arg min WLS(βn ).


{βn ∈Rpn }

It implies that the special approximation leads to the GEE estimator as the minimizer. The approximate
allows the construction of the penalty models, such as the LASSO and the SCAD, in the conventional
way by minimizing the “objective loss function” with the added penalty, as we shall see in the following.
However, it has to be recognized that since the WLS does not yield the same vector field as in the GEE
model (2.2), the integral in the vector field is path dependent. We will show that this approximate leads
to an efficient penalty estimator unlike the penalized GEE estimator as defined in [8].
Following [20], we construct a penalized WLS for parameter estimation and achieve model selection
based on the above objective function in a two-stage procedure as follows:
Step 1. Obtain an initial estimator βnG of the estimating equation (2.2), such as the GEE estimator.
Step 2. Define a WLS-type function (2.3) as an objective function by replacing the parameters βn in
V (βn ) and A(βn ) with the estimate βnG .
Step 3. Define the penalized WLS (PWLS) estimator


pn
βλw = arg min WLS(βn ) + λ j |βj |,
w (2.4)
{βn ∈Rpn } j=1

j = |βnGj |−γ , and γ > 0.


where w
Remark 2.2. Note that PWLS (2.4) yields a sparse solution βλw by setting some parameters to 0.
Hence, variable selection is achieved with the nonzero values of βλw .
Remark 2.3. Note that PWLS (2.4) differs from the penalized estimating equations studied in [8] since
the PWLS estimator is achieved through a special algorithm as in the above and the non-conservativeness
of the GEE vector field implies that the path-dependent vector field leads to different estimators through
different paths. Therefore, PWLS (2.4) differs in many ways from the penalized estimating equations
estimator. We shall show in the next section that PWLS leads to efficient estimation compared with the
penalized GEE estimator.
Remark 2.4. Although the penalty function considered in this paper is the adaptive LASSO, PWLS
can be defined with other penalty functions in a more general class. Specifically, given the WLS-type
362 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

function (2.3), we can consider the variable selection methods by minimizing the PWLS function, taking
the form

pn
P (βn ) = WLS(βn ) + n pλ (|βj |). (2.5)
j=1

Formulation (2.5) includes many popular variable selection methods, including the adaptive LASSO
defined in (2.4), the bridge method (see [7]) using the Lq penalty pλ (|βj |) = λ|β|q , the SCAD penalty
(see [6]), the MCP penalty (see [19]), and the SICA penalty (see [11]).

3 Asymptotic properties when pn → ∞

To study the asymptotic behavior of PWLS estimator when pn → ∞, we first introduce some notations.
Let A = {j : βn0j = 0, j = 1, . . . , pn } be the true sparse model with non-sparsity size sn = |A|, where
βn0 = (β01 , . . . , β0pn )T denotes the true value. Let S = {j1 , . . . , jd } denote an arbitrary candidate model,
which contains the j1 -th, . . . , jd -th (1  d  pn ) predictors. For an arbitrary vector b ∈ Rpn , let bS ∈ R|S|
denote the subvector of b associated with the candidate model S. For an arbitrary d × pn matrix Q, let
QS denote the d × |S| submatrix of Q associated with the candidate model S.

3.1 Oracle properties

In this section, we shall show that with a proper choice of λ, PWLS retains oracle properties when
pn → ∞. The main idea is to find an asymptotic characterization of the solution to the PWLS criterion.
In other words, the first thing we need to do is propose an estimator based on the oracle information of
the true subset of the model we know in advance.
For this reason, we consider the following model:

logit(E(Yij | Xij )) = XijA


T
δn , (3.1)

T T
where δn is an sn -dimensional unknown vector. Let π̃ij (δn ) = exp(XijA δn )/(1 + exp(XijA δn )). Denote

δnG by the solution of the following estimating equations:


n
T
XiA
1/2 −1 Ã−1/2 (δn )(Yi − π̃i (δn )) = 0,
Ãi (δn )R (3.2)
i
i=1

where π̃i (δn ) = (π̃i1 (δn ), . . . , π̃im (δn ))T , Ãi (δn ) is the m × m diagonal matrix with elements π̃ij (δn )(1 −
π̃ij (δn )). It implies immediately that δnG is the GEE estimator of βn0A . Further, let Ṽi (βn ) = Ãi (βn )R 
1/2

1/2 T T T
Ãi (βn ) for i = 1, 2, . . . , n, and π̃(δn ) = (π̃1 (δn ), . . . , π̃n (δn )) . Let Ã(δn ) and Ṽ (δn ) be, respectively, the
nm × nm block diagonal matrixs with diagonal elements Ãi (δn )’s and Ṽi (δn )’s, and Z̃ = Ã(δnG )XA δnG −
(Y − π̃(δnG )). Then, similar to (2.4), we define an estimator δλw as δλw = arg min{δn ∈Rsn } Q(δn ), where

Q(δn ) = (Z̃ − Ã(δnG )XA δn )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δn ) + λ j |βj |.
w (3.3)
j∈A

Throughout the paper, we use c to denote a generic positive constant which may vary from line to line.
Lemma 3.1. Suppose that the conditions (a)–(d) in the Appendix hold. If we assume the model (3.1)
and n−1 s2n = o(1), then for any κ > 0, there exist two positive constants b1 and b2 such that

b1 n  sup √ λmin (Jn (δn ))  sup √ λmax (Jn (δn ))  b2 n,


sn sn
δn −βn0A κ n δn −βn0A κ n

n
Ãi (δn )R̂−1 Ãi (δn )XiA .
T 1/2 1/2
where Jn (δn ) = i=1 XiA
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 363

Theorem 3.2. Under the model (3.1) and the conditions (a)–(e) in the Appendix, we have
  2 
E(δλw − βn0A 2 )  2 λ2 sn max{wj } + cnsn /(b21 n2 ),
j∈A

where b1 is the constant given in Lemma 3.1.


Note that the derived risk bound is nonasymptotic. Theorem 3.2 is very useful for finding the asymp-
totic characterization of the solution of the PWLS criterion, which will be given in the following theorem
specifically.
Theorem 3.3. T
Let us write βn0 = (βn0A , 0T )T . Then, under the conditions (a)–(g) in the Appendix,
T T T
β̃λw = (δλw , 0 ) is the solution to (2.4) with probability tending to 1.
From Theorem 3.3, we know that the PWLS estimator works asymptotically as if it had the oracle infor-
mation that we know the true subset model in advance. This naturally implies that the PWLS may enjoy
n
Ãi (δn )R̄−1 R0 R̄−1
T 1/2
the oracle property, which is confirmed in the next theorem. Let Kn (δn ) = i=1 XiA

Ãi (δn )R̄−1 Ãi (δn )XiA .
1/2 n 1/2 1/2
Ãi (δn )XiA , and J¯n (δn ) = i=1 XiA
T

Theorem 3.4. Under the conditions (a)–(g) in the Appendix, the PWLS estimator satisfies the fol-
lowing properties:
(1) consistency in model selection: limn→∞ P ({j : βλwj
 = 0} = A) = 1,
T −1/2 ¯
(2) asymptotic normality: αn Kn (βn0A )Jn (βn0A )(βλwA D
 − βn0A ) → N (0, 1) for any αn ∈ R
sn
satis-
fying αn  = 1.
Remark 3.5. Theorem 3.4 shows that, with probability approaching 1, the PWLS criterion can
identify the right subset model, and the PWLS estimator has the optimal convergence rate, when the
number of parameters diverges. Moveover, the PWLS criterion has the model selection consistency and
the asymptotic normality even when the correlation structure is misspecified.
Remark 3.6. Let ΣnA = J¯n−1 (βn0A )Kn (βn0A )J¯n−1 (βn0A ). Together with Theorem 3.4, an application
of the Cramer-Wold device yields that for any q × sn matrix Qn with q fixed and such that Qn ΣnA QT
n
is invertible, we have
−1/2
Qn (βλwA
D
(Qn ΣnA QTn)  − βn0A ) −→ Nq (0, Iq ),

where Iq is a q × q identity matrix.

3.2 Estimation of covariance matrix

Since we can select variables and estimate parameters simultaneously, the covariance matrix of the esti-
mated parameters can be obtained directly. Naturally, the well-known sandwich covariance formula can
be used as an estimator for the covariance of the estimate βλwA   . Let
 , the nonvanishing component of βλw
−1/2
i (δn ) = Ãi (δn )(Yi − π̃i (δn )), i = 1, 2, . . . , n. Similar to that proposed by [10] in the “fixed p” setup and
by [16] in the “large n, diverging p” setup, the sandwich covariance matrix estimator of ΣnA is defined as
 nA = Jn−1 (βλwA
Σ    )Jn−1 (βλwA
 )Kn (βλwA  ), (3.4)

where

n
Jn (δn ) = T
XiA
1/2 −1 Ã1/2 (δn )XiA ,
Ãi (δn )R i
i=1
n
 n (δn ) =
K T
XiA
1/2 −1 i (δn )
Ãi (δn )R T −1 Ã1/2 (δn )XiA .
i (δn )R i
i=1

The following theorem shows the consistency of the sandwich covariance matrix estimator (3.4).
Theorem 3.7. If the conditions (a)–(g) in the Appendix are satisfied, then we have
 nA QT
Qn Σ n − Qn ΣnA Qn = op (n
T −1
),

n = G, where q is fixed, and G is a q × q positive definite matrix.


for any q × sn matrix Qn such that Qn QT
364 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

Remark 3.8. In the “large n, diverging p” setup, Theorem 3.7 provides a consistent estimator for
the variance of the nonvanshing part of the PWLS estimator even when the working correlation matrix
is misspecified. Then, based on Remark 3.6 and the above sandwich covariance matrix estimator, an
asymptotic (1 − α)% confidence interval (0 < α < 1) for βn0j (j ∈ A) can be constructed as

βλwj T
 ± zα/2 (ej ΣnA ej )
1/2
,

where ej is an sn × 1 vector with the j-th element equal to 1, and 0 otherwise, zα/2 is the upper α/2
quantile of the standard normal distribution.

3.3 Tuning parameter selection

Note that the pleasing performances of the PWLS estimator heavily hinge on a proper selection of the
tuning parameter. To implement the PWLS method in practice, it is desirable to have an automatic
data-driven method for selecting the tuning parameter λ. We now use a selector that is in spirit similar
to the one proposed by [9].
n 1/2  −1 A1/2 (βn )Xi and Dn (βn ) = n X T
For the sake of simplicity, let Hn (βn ) = i=1 XiT Ai (βn )R i i=1 i
1/2
Ai (βn )R −1 εi (βn )εT (β)R
−1 A1/2 (βn )Xi , where εi (βn ) = A−1/2 (βn )(Yi −πi (βn )). Denote Σ n = Hn−1 (βnG )
i i i
Dn (βnG )Hn−1 (βnG ). Based on the above notations, we define a penalized quadratic form (PQF) function
criterion as
P QFλ = n−1 (βλw − βnG )T Σ
 −1 (βλw − βnG ) + Cn × pn × dfλ /n,
n (3.5)
where dfλ is the number of nonzero coefficients in βλ , and Cn > 0 is required to grow to infinity with the
sample size.
We now establish the consistency of the proposed PQF function criterion (3.5) when the number of
parameters diverges. Let Sλ = {j : βλwj   is the solution of (2.4) given λ. According
 = 0}, where βλw
to whether the resulting model Sλ is underfitted, correctly fitted, or overfitted, we partition R+ into the
following three mutually exclusive regions:

R− = {λ ∈ R+ : Sλ  A},
R0 = {λ ∈ R+ : Sλ = A},
R+ = {λ ∈ R+ : Sλ ⊃ ST , Sλ = A}.

In addition, we define a reference tuning parameter sequence {λn ∈ R+ }∞ n=1 , where λn = n


3/4−(1−3v)(1+γ)/4
.
Since λn satisfies the conditions in Theorem 3.4, it follows that Sλn = A with probability tending to one.
We remark that {λn ∈ R+ }∞ n=1 is constructed only for a theoretical proof. It does not imply that the
optimal tuning parameter must be λn . The following theorem shows the consistency of the proposed
PQF function criterion.
Theorem 3.9. Under the conditions (a)–(h) in the Appendix, P (inf λ∈R− ∪R+ P QFλ > P QFλn ) → 1.
Theorem 3.9 implies that in an almost sure convergence sense, any λ failing to identify the true model
cannot be selected as the optimal tuning parameter. Therefore, the model associated with the optimal λ
must be the true one.
Remark 3.10. The divergence rate of Cn can be arbitrarily slow, although in theory we require it
to diverge to infinity. In our simulations, we set Cn = log(log(pn )) as that suggested by [9]. For all our
numerical experiments below, the simulation results are rather encouraging.

4 Monte Carlo simulations

In this section, we assess the finite sample performance of the proposed PWLS method with PQF function
criterion via a simulation experiment in terms of model selection and estimation accuracy. We compared
our proposed PQF based PWLS method with the LASSO-based penalized estimating equations (PGEE)
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 365

with QGCV (see [8]). We consider the following logistic model. The response variable Yij is binary and
its marginal mean is πij with
T
logit(πij ) = Xij βn0 , j = 1, 2, . . . , 10 and i = 1, 2, . . . , n.

where βn0 is a pn -dimensional vector of parameters with pn = 4n1/3 − 5, with s denoting the largest
integer not greater than s. The covariate vectors Xij are i.i.d. from normal distribution Npn (0pn , Σ)
with Σ whose (i, j)-th element is equal to 0.5|i−j| . For all t with 1  t  10, Yit are of an AR(1)
correlation structure with auto-correlation coefficient ρ. The true coefficient vector is taken to be βn0 =
(0.51dn , 0pn −dn ), where dn = pn /5 . We consider three correlations ρ = 0.3, 0.5, 0.7 and three sample
sizes n = 100, 200, 400. The simulations with 200 replications for every case are conducted using R
codes since we use the correlated random binary data generator by [12]. And three different working
correlation structures are considered: independence working correlation matrix (IND), exchangeable
working correlation matrix (CS), and the first-order auto-correlation working correlation matrix (AR(1)).
The summary of simulation results is depicted in Tables 1–3.
To find the subset which minimizes each criterion, we exhaustively search for all possibilities. Thus,
the corresponding results represent the best subset variable selection under the underlying criterion. We
compare the variable selection procedures in terms of model error and model complexity, summarized
with the simulated average mean square error (MSE) obtained by averaging βn0 − βn0 2 /pn over 200
simulated samples, the average number (per simulation) of truly zero coefficients correctly estimated as
zero, i.e., correct deletions (CD), the average number of truly nonzero coefficients erroneously set to zero,
i.e., erroneous deletions (ED), the percentage of times the correct true model exactly identified (PCM),
and the average model size (MS).
Several observations can be made from Tables 1–3. First, the PQF based PWLS method is robust
against the growth of correlation degree. Second, the PQF based PWLS method is also robust in terms
of misspecification of working correlation matrix, which is reflected from Theorem 3.4. Third, compared
with the LASSO-based PGEE, both of them perform satisfactorily in terms of variable selection, while

Table 1 Simulation results by different procedures assuming IND working correlation matrix

n pn dn ρ Criterion MSE CD ED PCM MS


Mean SE
100 13 2 0.3 PWLS 0.366 10.965 (99.70%) 0 (0%) 0.965 2.035 0.184
PGEE 1.214 10.995 (99.95%) 0 (0%) 0.995 2.005 0.071
0.5 PWLS 0.303 10.985 (99.90%) 0 (0%) 0.985 2.015 0.122
PGEE 1.174 10.995 (99.95%) 0 (0%) 0.995 2.015 0.122
0.7 PWLS 0.347 10.990 (99.90%) 0 (0%) 0.990 2.010 0.100
PGEE 1.190 11.000 (100%) 0 (0%) 1.000 2.000 0.000
200 18 3 0.3 PWLS 0.369 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.323 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.5 PWLS 0.413 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.427 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.7 PWLS 0.397 14.995 (99.97%) 0 (0%) 0.995 3.005 0.071
PGEE 1.370 15.000 (100%) 0 (0%) 1.000 3.000 0.000
400 24 4 0.3 PWLS 0.475 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 1.053 19.995 (99.98%) 0 (0%) 0.995 4.005 0.071
0.5 PWLS 0.484 19.995 (99.98%) 0 (0%) 0.995 4.005 0.071
PGEE 1.028 19.975 (99.88%) 0 (0%) 0.975 4.025 0.157
0.7 PWLS 0.462 19.995 (99.98%) 0 (0%) 0.995 4.005 0.071
PGEE 0.981 19.985 (99.90%) 0 (0%) 0.985 4.015 0.122
366 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

Table 2 Simulation results by different procedures assuming AR(1) working correlation matrix

n pn dn ρ Criterion MSE CD ED PCM MS


Mean SE
100 13 2 0.3 PWLS 0.370 10.965 (99.70%) 0 (0%) 0.965 2.035 0.184
PGEE 1.244 10.995 (99.95%) 0 (0%) 0.995 2.005 0.071
0.5 PWLS 0.332 10.995 (99.95%) 0 (0%) 0.995 2.005 0.071
PGEE 1.155 11.000 (100%) 0 (0%) 1.000 2.000 0.000
0.7 PWLS 0.392 11.000 (100%) 0 (0%) 1.000 2.000 0.000
PGEE 1.809 11.000 (100%) 0 (0%) 1.000 2.000 0.000
200 18 3 0.3 PWLS 0.371 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.532 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.5 PWLS 0.393 14.990 (99.93%) 0 (0%) 0.990 3.010 0.100
PGEE 1.555 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.7 PWLS 0.417 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.385 15.000 (100%) 0 (0%) 1.000 3.000 0.000
400 24 4 0.3 PWLS 0.477 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 1.587 20.000 (100%) 0 (0%) 1.000 4.000 0.000
0.5 PWLS 0.474 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 1.884 20.000 (100%) 0 (0%) 1.000 4.000 0.000
0.7 PWLS 0.457 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 1.849 20.000 (100%) 0 (0%) 1.000 4.000 0.000

Table 3 Simulation results by different procedures assuming CS working correlation matrix

n pn dn ρ Criterion MSE CD ED PCM MS


Mean SE
100 13 2 0.3 PWLS 0.368 10.965 (99.70%) 0 (0%) 0.965 2.035 0.184
PGEE 1.232 10.995 (99.95%) 0 (0%) 0.995 2.005 0.071
0.5 PWLS 0.315 10.990 (99.90%) 0 (0%) 0.990 2.010 0.100
PGEE 1.174 11.000 (100%) 0 (0%) 1.000 2.000 0.000
0.7 PWLS 0.368 10.995 (99.95%) 0 (0%) 0.995 2.005 0.071
PGEE 1.255 11.000 (100%) 0 (0%) 1.000 2.000 0.000
200 18 3 0.3 PWLS 0.368 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.539 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.5 PWLS 0.408 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.626 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.7 PWLS 0.402 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.628 15.000 (100%) 0 (0%) 1.000 3.000 0.000
400 24 4 0.3 PWLS 0.470 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 1.730 20.000 (100%) 0 (0%) 1.000 4.000 0.000
0.5 PWLS 0.485 19.995 (99.98%) 0 (0%) 0.995 4.005 0.071
PGEE 2.122 20.000 (100%) 0 (0%) 1.000 4.000 0.000
0.7 PWLS 0.453 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 2.151 20.000 (100%) 0 (0%) 1.000 4.000 0.000

the PQF based PWLS method has much smaller MSE, especially when CS working correlation matrix is
assumed. It confirms the oracle property of the PWLS method empirically.
We next test the accuracy of the sandwich covariance formula. We consider the above model with
ρ = 0.5, the three sample sizes, and IND working correlation structure. For each case, the standard
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 367

Table 4 Standard deviations (SD) (multiplied by 1000) of


estimators assuming IND working correlation matrix

n pn dn Parameter SD SDm SDmad


100 13 2 β̂1 6.711 6.174 0.947
β̂2 6.113 6.036 0.958
200 18 3 β̂1 3.919 3.298 0.342
β̂2 4.601 4.108 0.465
β̂3 3.879 3.314 0.390
400 24 4 β̂1 2.028 1.780 0.147
β̂2 2.523 2.219 0.190
β̂3 2.891 2.197 0.201
β̂4 2.402 1.800 0.139

deviations of the estimated coefficients are averaged among 200 replications. This can be regarded as the
true standard error (SD) and compared with 200 estimated standard error from the sandwich covariance
formula, which are summarized by their median (SDm ) and interquartile range divided by 1.349 (SDmad ).
The detailed results are summarized in Table 4. We observed that the sandwich covariance formula
performs satisfactorily. Similar phenomena are also observed with other working correlation matrix,
which are not reported here.

5 A three-period crossover trial data analysis


We apply the method to the data from a three-period crossover trial of an analgesic drug for relieving
pain from primary dysmenorrhoea (see [4]). Three levels of the analgesic (placebo, low and high doses)
were given to each of the 86 women, who were randomized to one of the six groups receiving treatments
in different orders. The placebo was labeled as treatment A, whilst analgesic with low and high doses
were treatments B and C, respectively. The response was measured for each woman at the end of each
treatment period and was coded as (1) for some relief and (0) for no relief. Six explanatory variables
(covariates) are defined as follows:

1, period 2(3),
x1 (x2 ) =
0, otherwise,
1, treatment B(C),
x3 (x4 ) =
0, otherwise,
1, the previous assignment is B(C),
x5 (x6 ) =
0, otherwise.
We consider 64 candidate models including all possible subsets of the six covariates. Table 5 reports
the estimated coefficients with p-values by using the GEE method and shows the models that were chosen
by PWLS and the LASSO-based PGEE with QGCV. From Table 5, it can be seen that PWLS selects
the best model that includes the covariates x1 , x2 and x5 , which are significant at the significance level
α = 0.05, whereas the LASSO-based PGEE selects the full model as the best model. We also note that
PWLS selects the same best model using either the IND, AR(1), or CS working correlation structures,
of which the results are not presented here for saving space.
Since the true values of parameters are unknown, one has to rely on out-of-sample testing to compare
the model performance in terms of forecasting. We then conduct 200 cross-validation experiments. For
each experiment, we randomly split the entire data set D = {1, . . . , 86} into two parts, where D0 is the
training data set with 70 subjects and D1 is the testing data set with 16 subjects. Consequently, denote
368 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

Table 5 Parameter estimate (p-value in parentheses) for analgesic


drug clinical data under the IND working model

GEE PWLS PGEE


x1 2.106 (<0.001) 1.471 2.079
x2 2.068 (<0.001) 1.715 2.037
x3 −0.516 (0.336) - −0.543
x4 −0.342 (0.456) - −0.369
x5 0.929 (0.040) 0.403 0.934
x6 0.800 (0.138) - 0.796

120
Prediction error

100

80

60

40

PWLS PGEE

Different model selection methods


Figure 1 Model comparison in prediction error by 200 random splits of data set

X0 = {Xi : i ∈ D0 }, Y0 = {Yi : i ∈ D0 }, X1 = {Xi : i ∈ D1 }, and Y1 = {Yi : i ∈ D1 }. We use D0 to train


the models, including PWLS with the BIC criterion and the LASSO-based PGEE and test them using
the testing data set D1 . For each observation (Xi , Yi ) in D1 , Ŷi is the prediction of the response. The
methods are compared for the prediction error using the following loss function by [3]

(Yi − Ŷi )T Vi−1 (Yi − Ŷi ),
i∈D1

where Vi is the working covariance matrix estimated together by the entire data set D. For each method,
the prediction errors from 200 random splits are summarized in the boxplot in Figure 1. It can be seen
that PWLS performs better than the LASSO-based PGEE.

6 Extension
The approaches and techniques can be readily extended to general marginal longitudinal generalized
linear model, although the focus of the paper is on correlated binary data. The key step is the existence
of a root-(n/pn)-consistent initial estimator. From [16, Theorem 5.1] we know that the GEE estimator
βnG is root-(n/pn )-consistent for general marginal longitudinal generalized linear model. This is the
reason why we can readily extend the PWLS method to general cases. The generalized PWLS may also
enjoy the oracle property, which is confirmed in the following theorem.
Theorem 6.1. Under the conditions (a)–(g) in the Appendix and [16, Conditions (A5) and (A6)], the
PWLS estimator satisfies the following properties:
(1) consistency in model selection: limn→∞ P ({j : βλwj
 = 0} = A) = 1;
T −1/2 ¯
(2) asymptotic normality: α Kn (βn0A )Jn (βn0A )(βλwA  L
 − βn0A ) → N (0, 1) for any αn ∈ R
sn
satis-
n
fying αn  = 1,
−1/2
where Kn (δn ) and J¯n (δn ) have the same expressions as in Subsection 3.1 with Ãij (δn ) = Var(Yij | Xij ),
i = 1, . . . , n, j = 1, . . . , mi .
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 369

In addition, if the primary interest is on subject-specific effects, then the mixed-effects models, espe-
cially generalized linear mixed models, are probably more appropriate. Model selection for the mixed-
effects models deserves further investigation and is ongoing.

Acknowledgements This work was supported by National Natural Science Foundation of China (Grant No.
11201306), the Innovation Program of Shanghai Municipal Education Commission (Grant No. 13YZ065), the
Fundamental Research Project of Shanghai Normal University (Grant No. SK201207), the scholarship under the
State Scholarship Fund by the China Scholarship Council in 2011 and the Research Grant Council of Hong Kong,
Hong Kong, China (Grant No. #HKBU2028/10P).

References
1 Bondell H D, Krishna A, Ghosh S K. Joint variable selection for fixed and random effects in linear mixed-effects models.
Biometrics, 2010, 66: 1069–1077
2 Cantoni E, Flemming J M, Ronchetti E. Variable selection for marginal longitudinal generalized linear models. Bio-
metrics, 2005, 61: 507–514
3 Cantoni E, Filed C, Flemming J M, et al. Longitudinal variable selection by cross-validation in the case of many
covariates. Stat Med, 2005, 26: 919–930
4 Diggle P J, Heagerty P J, Liang K Y, et al. Analysis of Longitudinal Data, 2nd edition. New York: Oxford University
Press, 2002
5 Dziak J J, Li R Z. An overview on variable selection for longitudinal data. In: Quantitative Medical Data Analysis
Using Mathematical Tools and Statistical Techniques Chapter One. Singapore: World Scientific Publishing Co., 2006
6 Fan J Q, Li R Z. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc,
2001, 96: 1348–1359
7 Frank I E, Friedman J H. A statistical view of some chemometrics regression tools (with discussion). Technometrics,
1993, 35: 109–148
8 Fu W J. Penalized estimating equations. Biometrics, 2003, 59: 126–132
9 Leng C L, Li B. Least squares approximation with a diverging number of parameters. Statist Prob Lett, 2010, 80:
254–261
10 Liang K Y, Zeger S L. Longitudinal data analysis using generalised linear models. Biometrics, 1986, 73: 12–22
11 Lv J C, Fan Y Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann
Statist, 2009, 37: 3498–3528
12 Oman S D. Easily simulated multivariate binary distributions with given positive and negative correlations. Comput
Stat Data Anal, 2009, 53: 999–1005
13 Pan W. Akaike’s Information Criterion in Generalized Estimating Equations. Biometrics, 2001, 57: 120–125
14 Pepe M S, Anderson G L. A cautionary note on inference for marginal regression models with longitudinal data and
general correlated response data. Comm Statist Ser B, 1994, 23: 939–951
15 Verbeke G, Molenberghs G. Linear Mixed Models for Longitudinal Data. New York: Springer, 2000
16 Wang H, Leng C. Unified lasso estimation via least square approximation. J Amer Statist Assoc, 2007, 102: 1039–1048
17 Wang L. GEE analysis of clustered binary data with diverging number of covariates. Ann Statist, 2011, 39: 389–417
18 Wang L, Qu A. Consistent model selection and data-driven smooth tests for longitudinal data in the estimating
equations approach. J R Statist Soc B, 2009, 71: 177–190
19 Zhang C H. Nearly unbiased variable selection under minimax concave penalty. Ann Statist, 2010, 38: 894–942
20 Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc, 2006, 101: 1418–1429
21 Zou H, Zhang H H. On the adaptive elastic-net with a diverging number of parameters. Ann Statist, 2009, 37:
1733–1751

Appendix
In this section, the proof of theorems will be given. We first introduce the following regularity conditions:

(a) supi,j Xij  = Op ( pn ).
(b) Denote λmin (Q) and λmax (Q) as the minimum and maximum eigenvalues of a matrix Q, respectively.
Then, assume that there exist two positive constants c1 and c2 such that

n 
n
−1 −1
c1  λmin n XiT Xi  λmax n XiT Xi  c2 .
i=1 i=1
370 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

(c) There exist two positive constants c3 and c4 such that 0 < c3  πij (βn0 )  c4  1, for i = 1, 2, . . . , n,
j = 1, 2, . . . , m.
(d) The eigenvalues of the common true correlation matrix R0 are bounded away from zero and +∞.
Hereafter, for a matrix B, define B = [trace(BB T )]1/2 as its Frobenius norm. The estimated working
correlation matrix R  satisfies R  −1 − R̄−1  = Op ( pn /n), where R̄ is a positive-definite matrix with
eigenvalues bounded away from zero and +∞. Note R̄ maybe not the true correlation matrix R0 .
(e) limn→∞ log(pn )/ log(n) = v, for some 0  v < 1/3.
√ √
(f) limn→∞ λ/ n = 0, and limn→∞ λn((1−3v)(1+γ)−1)/2 / n = ∞.
√ √  1/γ
(g) limn→∞ n/(λ pn ) minj∈A |βn0j | = ∞.
(h) limn→∞ n/(pn Cn ) minj∈A |βn0j | = ∞, and limn→∞ Cn p2n /n = 0.
2

Conditions (a)–(e) are general regularity conditions that ensure the GEE estimator of βn0A is a root-
(n/sn )-consistent estimator under model (3.1) (see [16]). Conditions (f) and (g) are similar to [21,
Conditions (A5) and (A6)]. Condition (g) characterizes the strength of the signals, i.e., the nonzero coef-
ficients can vanish but at the rate that can be distinguished by the penalized least squares. Condition (h)
is the same as [9, Condition (v)], which ensures the consistency of the proposed tuning parameter selector.
n
Ãi (δn )R̄−1 Ãi (δn )XiA . Note that
1/2 1/2
Proof of Lemma 3.1. Let J¯n (δn ) = i=1 XiA T


n 
n
λmax T
XiA XiA  λmax XiT Xi .
i=1 i=1

For any bn ∈ Rsn satisfying bn  = 1, we have



n
¯ 1/2 −1 1/2
bT
n Jn (δn )bn = bT T
n XiA Ãi (δn )R̄ Ãi (δn )XiA bn
i=1

n
 λ−1
min (R̄)λmax (Ãi (δn ))λmax
T
XiA XiA bn 2
i=1

= Op (n)

by the conditions (b) and (d). It implies that

sup √ sup |bT ¯


n Jn (δn )bn | = Op (n). (A.1)
sn b =1
δn −βn0A κ n
n

Similar to the proof of [16, Lemma 3.3], we have



sup √ sup |bT ¯
n [Jn (δn ) − Jn (δn )]bn | = Op ( nsn ).
sn b =1
δn −βn0A κ n
n

Therefore,

sup √ sup |bT


n Jn (δn )bn |  sup √ sup |bT ¯
n [Jn (δn ) − Jn (δn )]bn |
sn b =1 sn b =1
δn −βn0A κ n
n δn −βn0A κ n
n

+ sup √ sup |bT ¯


n Jn (δn )bn |
sn b =1
δn −βn0A κ n
n


= Op ( nsn ) + Op (n)
= Op (n).

Note that Jn (δn ) is symmetric. The above conclusion immediately implies that

sup √ λmin (Jn (δn )) = Op (n),


sn
δn −βn0A κ n

sup √ λmax (Jn (δn )) = Op (n).


sn
δn −βn0A κ n
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 371

Consequently, there exist two positive constants b1 and b2 such that

b1 n  sup √ λmin (Jn (δn ))  sup √ λmax (Jn (δn ))  b2 n.


sn sn
δn −βn0A κ n δn −βn0A κ n

Proof of Theorem 3.2. According to the definition of Z̃, it is easy to see that

δnG = (XA
T
Ã(δnG )Ṽ −1 (δnG )Ã(δnG )XA )−1 XA
T
Ã(δnG )Ṽ −1 (δnG )Z̃,

which implies immediately that δnG satisfies that

δnG = arg min (Z̃ − Ã(δnG )XA δn )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δn ), (A.2)
{δn ∈Rsn }

and
T
XA Ã(δnG )Ṽ −1 (δnG ){Z̃ − Ã(δnG )XA δnG } = 0. (A.3)
By the definition of δλw , we know that

(Z̃ − Ã(δnG )XA δλw )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δλw ) + λ j |δλwj
w  |
j∈A

 (Z̃ − Ã(δnG )XA δnG )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δnG ) + λ j |δnGj |.
w
j∈A

Thus, by (A.3), we have



λ j (|δnGj | − |δλwj
w  |)
j∈A

 Ṽ −1/2 (δnG )(Z̃ − Ã(δnG )XA δλw )2 − Ṽ −1/2 (δnG )(Z̃ − Ã(δnG )XA δnG )2
= (δnG − δλw )T XA
T
Ã(δnG )Ṽ −1 (δnG )Ã(δnG )XA (δnG − δλw )
= (δnG − δλw )T Jn (δnG )(δnG − δλw ).
 
Note that j∈A j (|δnGj | − |δλwj
w  |)  j∈A j2 δnG − δλw . Therefore, by Lemma 3.1, we conclude
w
that 
λ j2 δnG − δλw   b1 nδnG − δλw 2 ,
w
j∈A

which results in the inequality



λ √
δnG − δλw   j2  λ sn max{w
w j }/(b1 n). (A.4)
b1 n j∈A
j∈A

And by [16, Theorem 3.6], we know that under the model (3.1), we have δnG − βn0A  = Op ( sn /n).
Consequently,

E(δλw − βn0A 2 )  2Eδλw − δnG 2 + 2EδnG − βn0A 2


  2 
 2 λ2 sn max{w j } + cnsn /(b21 n2 ),
j∈A

where c is a positive constant.


n 1/2 
Proof of Theorem 3.3. Let xij be the j-th column of Xi , j = 1, 2, . . . , pn . Denote ξj = i=1 xT
ij Ai (βnG )
−1 A−1/2 (βnG )(Zi − Ai (βnG )XiA δλw ). By the Karush-Kuhn-Tucker (KKT) optimality condition, it
R i
suffices to show
P (∀ j ∈ Ac , | − 2ξj |  λw
j ) → 1. (A.5)
372 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

Let ς = minj∈A |βn0j |, and ςˆ = minj∈A |βnGj |. Note that

ς  ς/2)  P (βnG − βn0   ς/2)


P (ˆ
 4EβnG − βn0 2 /ς 2
 4cpn /(nς 2 )

under the model (2.1), where we have used Markov inequality to obtain the second inequality and [16,
Theorem 3.6] in the last step. On the other hand, let M = (λ/n)1/(1+γ) . An application of Markov
inequality and [16, Theorem 3.6] yields that

P (| − 2ξj | > λw
j , ςˆ > ς/2)
j∈Ac
 
 P (|βnGj | > M ) + j , ςˆ > ς/2, |βnGj |  M )
P (| − 2ξj | > λw
j∈Ac j∈Ac
 
E |βnGj |2 /M 2 + P (| − 2ξj | > λM −γ , ςˆ > ς/2)
j∈Ac j∈Ac

 EβnG − βn0 2 /M 2 + 4M 2γ E ξj2 I(ˆ
ς > ς/2) /λ2
j∈Ac

 cpn /(nM 2 ) + 4M 2γ E ξj2 I(ˆ
ς > ς/2) /λ2 .
j∈Ac

Therefore,

P (∃j ∈ Ac , |ξj | > λw
j )  P (|ξj | > λw
j )
j∈Ac

 P (|ξj | > λw
j , ςˆ > ς/2) + P (ˆ
ς  ς/2)
j∈Ac

4M 2γ  cpn 4cpn
 E ξj2 I(ˆ
ς > ς/2) + + .
λ2 nM 2 nς 2
j∈Ac

Now, we focus on the expectation term of the above inequality. Since βnG is the solution to (2.2), it is
easy to see that
n
1/2 
xT −1 A−1/2 (βnG )(Yi − πi (βnG )) = 0,
ij Ai (βnG )R i
i=1

for j = 1, . . . , pn . Then, based on the definition of Z, we have



n
1/2 
ξj = xT −1 A1/2 (βnG )Xi (βnG − β̃λw ).
ij Ai (βnG )R i
i=1

n 1/2 −1 A1/2 (βn )Xi = (Hn1 (βn ), . . . , Hnpn (βn ))T . Similar to the proof of
Let Hn (βn ) = i=1 XiT Ai (βn )R i
Lemma 3.1, we can prove that there exist two positive constants b3 and b4 such that

b3 n  sup √ λmin (Hn (βn ))  sup √ λmax (Hn (βn ))  b4 n.


pn pn
βn −βn0 κ n βn −βn0 κ n

Then, we have
 
T 
|ξj |2 = |Hnj (βnG )(βnG − β̃λw )|2
j∈Ac j∈Ac

T 
 Hnj (βnG )Hnj (βnG )βnG − β̃λw 2
j∈Ac
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 373


pn
T 
 Hnj (βnG )Hnj (βnG )βnG − β̃λw 2
i=1

= trace(Hn (βnG )HnT (βnG ))βnG − β̃λw 2


 pn (λmax (Hn (βnG )))2 βnG − β̃λw 2
 b24 pn n2 βnG − β̃λw 2 .

On the other hand, by [16, Theorem 3.6], we know that

βnG − β̃λw 2  2βnG − βn0 2 + 2β̃λw − βn0 2


= 2βnG − βn0 2 + 2δλw − βn0A 2
 2cpn /n + 2δλw − βn0A 2 .

Therefore, by Theorem 3.2, we have



E ς > ς/2)  b24 pn n2 E(βnG − β̃λw 2 I(ˆ
ξj2 I(ˆ ς > ς/2))
j∈Ac
 2b24 pn n2 {cpn /n + [λ2 sn (ς/2)−2γ + cnsn ]/(b21 n2 )}
 4b24 pn n2 {cnpn + λ2 pn (ς/2)−2γ }/(b21 n2 ).

Consequently,
 −2γ 
b 2 pn n2 ς pn pn
P (∃ j ∈ A , |ξj | > λw
c
j )  16M 2γ 42 2 2 cnpn + λ2 pn +c 2
+ 4c 2
λ b1 n 2 nM nς
 T1 + T2 + T3 .

By the conditions (e)–(g), it follows that

T1 = O((M 2γ n2 /λ2 )pn (pn /n))


= O([λn(1−3v)(1+γ)/2 /n]−2/(1+γ)n−v ) → 0,
T2 = O((pn /n)(n/λ)2/(1+γ) )
= O([λn(1−3v)(1+γ)/2 /n]−2/(1+γ)(pn /nv )n−2v ) → 0,
T3 = O((pn /n)ς −2 )
= O([λ pn /nς −γ ]2/γ {(pn /n)[λ pn /n]−2/γ })
= O([(λ pn /n)−1/γ ς]−2 [λn(1−3v)(1+γ)/2 /n]−2/γ n−2v(1+1/γ) p−2/γ
n ) → 0.

This proves (A.5) and completes the proof.


Proof of Theorem 3.4. Prove the model consistency first. From Theorem 3.3, we only need to show
 
P min |δλwj
 | > 0 → 1.
j∈A

√ −γ
By (A.4), we have ∀ j ∈ A, |δnGj − δλwj  | < λ sn ςˆ /(b1 n). Invoking triangle inequality yields that

∀j ∈ A, |δλwj  −γ
 | > |δnGj | − λ sn ςˆ /(b1 n). Therefore, we end up with

√ −γ
min |δλwj 
 | > min |δnGj | − λ sn ςˆ /(b1 n).
j∈A j∈A

√ √ √ √
Note that λ sn ςˆ−γ /(b1 n) = O(1/ n)O(λ sn ς −γ / n)O((ˆ
ς /ς)−γ ) and

ς − ς)2 /ς 2
ς /ς)2 ) = 2 + 2E(ˆ
E((ˆ
 2 + 2EβnG − βn0 2 /ς 2
374 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

 2 + 2cpn /(nς 2 )

by [16, Theorem 3.6]. Moreover, by the proof of T3 in Theorem 3.3, we know that pn /(nς 2 ) = op (1),
ς /ς)2 ) = Op (1). Thus, by the condition (g), we have
which implies immediately that E((ˆ
√ √
λ sn ςˆ−γ /(b1 n) = op (1/ n)Op (1). (A.6)

Note that δnG − βn0A  > |βn0j | − |δnGj |, ∀ j ∈ A. Then, minj∈A |δnGj | > minj∈A |βn0j | − δnG − βn0A .
Since δnG − βn0A  = Op ( sn /n) by [16, Theorem 3.6], we conclude that

min |δλwj 
 | > min |βn0j | − δnG − βn0A  − op (1/ n)Op (1)
j∈A j∈A
√ √ √
= ς − ( sn / n)Op (1) − op (1/ n)Op (1),

which implies that P (minj∈A |δλwj


 | > 0) → 1.
Next, we prove the asymptotic normality. Note that

αT −1/2 ¯ n0A )(δλw − βn0A )


n Kn (βn0A )J(β
= αT −1/2
n Kn
¯ n0A )(δλw − δnG ) + αT
(βn0A )J(β −1/2
n Kn
¯ n0A )(δnG − βn0A ).
(βn0A )J(β

From [16, Theorem 3.8], we have

αT −1/2 ¯ n0A )(δnG − βn0A ) −→


L
n Kn (βn0A )J(β N (0, 1)

(βn0A )J¯(βn0A )(δλw − δnG ) = op (1). For any


−1/2
by the condition (e). Then we only need to show αT
n Kn
bn ∈ Rsn satisfying bn  = 1, we know that


n
−1
R0 R̄−1 Ãi (δn )XiA bn
1/2 1/2
bT
n Kn (δn )bn = bT T
n XiA Ãi (δn )R̄
i=1

n
 λmax (R0 )λ−2
max (R̄) bT T
n XiA Ãi (δn )XiA bn
i=1

n
c λmax (Ãi (δn ))bT T
n XiA XiA bn
i=1

n
 cλmax T
XiA XiA bn 2
i=1

 cnc2

by the conditions (b) and (d). The above conclusion immediately implies that

sup √ λmin (Kn (δn )) = Op (n), (A.7)


sn
δn −βn0A κ n

sup √ λmax (Kn (δn )) = Op (n). (A.8)


sn
δn −βn0A κ n

Hence, together with (A.1), (A.4) and (A.6), we have

(αT −1/2
n Kn (βn0A )J¯(βn0A )(δλw − δnG ))2
¯ n0A )(δλw − δnG )2
 Kn−1/2 (βn0A )J(β
 λ−1 2 ¯   − δnG 2
min (Kn (βn0A ))λmax (J(βn0A ))δλw

 Op (n)Op ([λ sn ςˆ−γ /(b1 n)]2 )
= Op (n)op (1/n)
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 375

= op (1).

Therefore, αT
−1/2 ¯ n0A )(δλw − δnG ) = op (1). This completes the proof.
n Kn (βn0A )J(β
Proof of Theorem 3.7. It is sufficient to show that

sup |bT  −1
n (ΣnA − ΣnA )bn | = op (n ),
bn =1

for any bn ∈ Rsn . Since

 nA − ΣnA = J −1 (βλwA
Σ    ) − Kn (βn0A )]J −1 (βλwA
 )[Kn (βλwA  )
n n

+ [Jn−1 (βλwA ¯−1 −1 


 ) − Jn (βn0A )]Kn (βn0A )Jn (βλwA
 )

+ J¯n−1 (βn0A )Kn (βn0A )[Jn−1 (βλwA ¯−1


 ) − Jn (βn0A )]

 T1 + T2 + T3 ,
−1
and it is sufficient to show that supbn =1 |bT
n Ti bn | = op (n ), i = 1, 2, 3. Note that

 n (βλwA
max(|λmax (K    ) − Kn (βn0A ))|)
 ) − Kn (βn0A ))|, |λmin (Kn (βλwA
sup |bT
n T 1 bn |  .
bn =1 λ2 (Jn (βλwA
min  ))

 n (βλwA
Then we evaluate the eigenvalues of K  ) − Kn (βn0A ) first. Let Si (βn ) =
T
i (βn ) i (βn ) for i =
1, 2, . . . , n. Note that

|bT    )−K
n [Kn (βλwA
 n (βn0A )]bn |  |bT
n (T11 + T12 + T13 )bn |,

where

n
[Ãi (βλwA −1 Si (βλwA −1 Ã1/2 (βλwA
1/2 1/2
 ) − Ãi (βn0A )]R
T
T11 = XiA  )R i  )XiA
i=1

n
T12 = T
XiA
1/2 −1 [Si (βλwA
Ãi (βn0A )R −1 Ã1/2 (βn0A )XiA
 ) − Si (βn0A )]R i
i=1

n
T13 = T
XiA
1/2 −1 Si (βλwA
Ãi (βn0A )R −1 [Ã (βλwA
 )R
1/2
 ) − Ãi (βn0A )]XiA .
1/2
i
i=1

Since Ãi (βλwA


1/2 1/2
 ) − Ãi (βn0A ) is the diagonal matrix, by the condition (c), we have

1/2  1/2   ) − Ãij (βn0A )|


bT T
 ) − Ãi (βn0A )]  XiA bn  max |Ãij (βλwA
n XiA [Ãi (βλwA
j

 cXiA bn βλwA
 − βn0A  max Xij .
j

Note that

EYi − π̃i (βλwA   ) − π̃i (βn0A )2 }


 )  2{EYi − π̃i (βn0A ) + Eπ̃i (βλwA
2 2

 2mc{1/4 + EβλwA
 − βn0A  }
2

 2mc{1/4 + O(sn /n)}


= O(1),

by Theorem 3.2 and (A.6). It implies that Yi − π̃i (βλwA


 ) = Op (1). Consequently,

 −1 i (βλwA
R −2  −1 
 )  λmin (R)λmax (Ãi (βλwA
2   )2
 ))Yi − π̃i (βλwA

 cOp (1),
376 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

by the condition (d). Therefore, by Remark 3.6 and the conditions (a)–(h), we have

n
1/2 
|bT
1/2  −1 i (βλwA 1/2 
n T11 bn |  bT  ) − Ãi (βn0A )]R  ) Ãi (βλwA
 )XiA bn 
T 2
n XiA [Ãi (βλwA
i=1
n
c XiA bn 2 βλwA
 − βn0A  max Xij 
j
i=1

n
 βλwA
 − βn0A  sup Xij λmax
T
XiA XiA bn 2
i,j i=1

 Op ( sn /n)Op ( pn )Op (n)

 Op (pn n) = op (n).

Thus supbn =1 |bT n T11 bn | = op (n). Similarly, we can also prove that supbn =1 |bn T12 bn | = op (n) and
T

supbn =1 |bn T13 bn | = op (n). Therefore,


T

sup |bT    )−K n (βn0A )]bn | = op (n).


n [Kn (βλwA
bn =1

Similarly, supbn =1 |bT 


n [Kn (βn0A ) − Kn (βn0A )]bn | = op (n). Thus,

sup |bT    ) − Kn (βn0A )]bn | = op (n).


n [Kn (βλwA
bn =1

Note that

n
λmin (Jn (βλwA −1 ) min(Ãij (βλwA
 ))  λmin (R  ))λmin
T
XiA XiA  Op (n),
i,j
i=1

−1
by the conditions (b) and (d). Consequently, we have supbn =1 |bT
n T1 bn | = op (n ). Similarly, note that

Jn−1 (βλwA ¯−1 −1 


 ) − Jn (βn0A ) = [Jn (βλwA
¯−1   )] + [J¯−1 (βλwA
 ) − Jn (βλwA n
¯−1
 ) − Jn (βn0A )],

and

Jn−1 (βλwA ¯−1   ) = J¯n−1 (βλwA


 ) − Jn (βλwA
¯   ) − Jn (βλwA
 )[Jn (βλwA
−1 
 )]Jn (βλwA
 ),

J¯n−1 (βλwA ¯−1 ¯−1 ¯ ¯   )]J¯−1 (βλwA


 ) − Jn (βn0A ) = Jn (βn0A )[Jn (βn0A ) − Jn (βλwA n  ),

−1
and we can also prove that supbn =1 |bT n Ti bn | = op (n ), i = 2, 3. This completes the proof.
n n
Proof of Theorem 3.9. Let H̄n (βn ) = i=1 Xi Ai (βn )R̄−1 Ai (βn )Xi and D̄n (βn ) = i=1 XiT Ai (βn )
T 1/2 1/2 1/2

R̄−1 R0 R̄−1 Ai (βn )Xi . Let Σn be the asymptotic covariance matrix of βnG . Then, according to [16,
1/2

Corollary 3.9], Σn = H̄n−1 (βn0 )D̄n (βn0 )H̄n−1 (βn0 ). Similar to (A.1) and (A.7), we have
−1 −1
n−1 bT −1
n Σn b n  n λmin (D̄n (βn0 ))bT 2
n H̄n (βn0 )bn
 n−1 Op (n−1 )λ2max (H̄n (βn0 ))bn 2
= Op (n−2 )Op (n2 )
= Op (1),

for any bn ∈ Rpn satisfying bn  = 1. It follows that

λmin (n−1 Σ−1


n ) = Op (1) and λmax (n
−1 −1
Σn ) = Op (1). (A.9)

And by [16, Theorem 3.10], we know that Σ  n is the consistent estimator of Σn which implies the con-
sistency of the sample eigenvalues. Further, let βS = arg min{βn ∈Rpn } (Z − A(βnG )Xβn )T V −1 (βnG )(Z −
A(βnG )Xβn ) for any candidate model S.
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 377

For any Sλ = A, we consider two different cases according to whether the model is underfitted or
overfitted. In either case, we use the same techniques of [17] or [9] to prove the conclusion of the theorem.
Case 1 (Underfitted model). Note that λn satisfies the conditions in Theorem 3.4. Consequently, its
associated PQF value is of the order oP (1). By the fact that βS = βn0 for any S  A and the definition
(3.5), we have

inf P QFλ  inf n−1 (βλw − βnG )T Σ


 −1   − βnG )
n (βλw
λ∈R− λ∈R−

 inf n−1 (βS − βnG )T Σ


 −1  
n (βS − βnG )
SA

 −1 ) min βS − βnG 2


 λmin (n−1 Σ n
SA
−1 
 −1
 λmin (n−1 Σ n ) min {2 βS − βn0 2 − βnG − βn0 2 }
SA
 
  −1
λmin (n−1 Σ n ) 2−1 min |βn0j |2 − βnG − βn0 2 .
j∈A

Then, with the probability tending to 1, the right-hand side will be of higher order than Cn p2n /n by [16,
Theorem 3.6], (A.9), and the condition (h). Hence P (inf λ∈R− P QFλ > P QFλn ) → 1.
Case 2 (Overfitted model). Similar to (A.9), we know that λmax (ΣnA ) = Op (n). Let ej be an sn × 1
vector with the j-th element equal to 1, and 0 otherwise. Then, from Remark 3.6, we know that

−1/2  L
(eT
j ΣnA ej )  − βn0j ) −→ N (0, 1),
(βλn wj (A.10)

for each j ∈ A. It implies that, for every j ∈ A, βλn wj


1/2
 − βn0j = Op (λmax (ΣnA )) = Op (n
1/2
). Therefore,
 
βλn w − βn0  = δλn w − βn0A  = Op ( sn /n). For any λ ∈ R+ , we have dfλ − sn  1. Then
−1   T −1 Σ −1   − βnG ) + Cn
np−1
n (P QFλ − P QFλn )  −npn (βλn w
 − βnG ) n n (βλn w

 −np−1
n λmax (n Σn )βλn w − βnG 2 + Cn .
−1  −1

Note that np−1   − βnG 2  2np−1 (βλ w − βn0 2 + βnG − βn0 2 ). Then, by [16, Theorem 3.6]
n βλn w n n

and (A.10), we know that np−1   − βnG 2 = Op (1). Since Cn → ∞ as n → ∞, the right-hand side
n βλn w
diverges to infinity with the probability tending to 1, i.e., P (inf λ∈R+ P QFλ > P QFλn ) → 1. Hence, the
proof is completed.

You might also like