0% found this document useful (0 votes)

5 views19 pages

Mathematics: Science China

This paper presents a two-stage shrinkage estimation method for analyzing correlated binary data with a diverging number of parameters, particularly in high-dimensional settings. The authors propose a weighted least-squares (WLS) approach combined with a penalized model for variable selection and parameter estimation, demonstrating its effectiveness through simulation studies and a practical application. The method retains oracle properties and provides consistent estimation even when the working correlation matrix is misspecified.

Uploaded by

yx20021010141592

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views19 pages

Mathematics: Science China

Uploaded by

yx20021010141592

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

SCIENCE CHINA

Mathematics
. ARTICLES . February 2013 Vol. 56 No. 2: 359–377
doi: 10.1007/s11425-012-4564-y

Shrinkage estimation analysis of correlated binary

data with a diverging number of parameters
XU PeiRong1 , FU WenJiang2 & ZHU LiXing3,∗

1School
of Finance and Statistics, East China Normal University, Shanghai 200241, China;
2Department of Epidemiology and Biostatistics, Michigan State University, Michigan, MI 48824, USA;
3Department of Mathematics, Hong Kong Baptist University, Kowloon Tong 999077, Hong Kong, China

Email: xupeirong1108@[Link], fuw@[Link], lzhu@[Link]

Received March 2, 2011; accepted March 29, 2012; published online January 10, 2013

Abstract For analyzing correlated binary data with high-dimensional covariates, we, in this paper, propose
a two-stage shrinkage approach. First, we construct a weighted least-squares (WLS) type function using a
special weighting scheme on the non-conservative vector field of the generalized estimating equations (GEE)
model. Second, we define a penalized WLS in the spirit of the adaptive LASSO for simultaneous variable
selection and parameter estimation. The proposed procedure enjoys the oracle properties in high-dimensional
framework where the number of parameters grows to infinity with the number of clusters. Moreover, we prove
the consistency of the sandwich formula of the covariance matrix even when the working correlation matrix is
misspecified. For the selection of tuning parameter, we develop a consistent penalized quadratic form (PQF)
function criterion. The performance of the proposed method is assessed through a comparison with the existing
methods and through an application to a crossover trial in a pain relief study.

Keywords correlated binary data, variable selection, diverging number of parameters, adaptive LASSO,
GEE, oracle properties, sandwich covariance formula, penalized quadratic form function

MSC(2010) 62F07, 62J12

Citation: Xu P R, Fu W J, Zhu L X. Shrinkage estimation analysis of correlated binary data with a diverging
number of parameters. Sci China Math, 2013, 56: 359–377, doi: 10.1007/s11425-012-4564-y

1 Introduction
High-dimensional correlated data, which consists of repeated measurements on a large number of co-
variates, arise frequently from health and medical studies. For example, in a crossover trial comparing
the effect of placebo with that of the low and high doses of analgesic drug for pain relief (see [4]), each
participating woman was given all three treatments sequentially, one dose at each time with different
ordering of the treatments for different subject. In making the inference on the relationship between the
outcome measure of either no effect of pain relief (0) or some effect of pain relief (1) and the covariates:
the current treatment, the previous treatment, and the current period, the final model will be selected
based on the selection and estimation of the effects while taking into consideration the correlation between
the outcomes from different treatments within the same subject. Various statistical methods of model
parameter estimation have been developed, see [4, 15]. Recent studies of variable selection for analyzing
correlated data received much attention for high-dimensional data structure.
∗ Corresponding author

c Science China Press and Springer-Verlag Berlin Heidelberg 2013 [Link] [Link]
360 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

For example, QIC was proposed as a modification to Akaike’s information criteria (AIC), using quasi-
likelihood instead of likelihood [13]; a generalized version of Mallow’s Cp to measure model adequacy of
prediction was introduced [2]; BIQIF was introduced as a Bayesian information type of criterion based
on the quadratic inference function [18]. These are best subset type variable selection procedures which
become computationally intensive when the number of covariates is moderately large. A generalization
of the bridge and LASSO penalties was applied to generalized estimating equations (GEE) models [8];
similarly, the SCAD penalty (see [6]) was applied to the GEE model [5] following the spirit in [8]; a variable
selection was studied for linear mixed effects model for correlated data [1]. However, the aforementioned
papers all assumed that the dimension of covariates is fixed. Therefore, it is important to develop new
statistical methodology and theory of variable selection and estimation for high-dimensional clustered
data.
In this paper, we propose a two-stage shrinkage estimation procedure for variable selection in the
“large n, diverging p” setup for sparse generalized linear models, where p = pn is a function of the num-
ber of clusters n. We focus on correlated binary data, which is driven by the crossover trial analyzed in
Section 5. We first construct a weighted least-squares (WLS) type objective function to approximate a
potential function, although a real potential function does not exist in the GEE model in general. We
then define a penalized WLS model for simultaneous variable selection and parameter estimation. It
is noted that the proposed potential approximate function is path-dependent, requiring no conservative
gradient vector field. We show that the proposed approach achieves the desirable oracle properties, un-
der appropriate conditions for dimension, the sample size and the strength of the signal. In addition, it
inherits the built-in robustness of the GEE model with consistent estimation allowing misspecification of
the working correlation matrix. And the sandwich variance formula for nonzero part of true parameters
still remains valid in this new context. We further recommend a penalized quadratic form (PQF) func-
tion criterion that facilitates the oracle properties. Simulation studies suggest that, with the PQF, the
propsoed method is more efficient than the LASSO-based penalized estimating equations presented by
[8]. Also, the sandwich variance formula achieves satisfactory performance.
The rest of the paper is organized as follows. In Section 2, we propose our variable selection method
for correlated binary data. In Section 3, we establish the oracle properties of the proposed method, the
consistency of the sandwich variance formula, and the consistency of the PQF function criterion. Section 4
reports the Monte Carlo simulation study results. Section 5 applies the new method to a three crossover
trial dataset. Section 6 gives a brief discussion. All technical proofs are relegated to the Appendix.

2 Penalized weighted least-squares (PWLS)

Suppose that there are n clusters and mi subjects within cluster i, i = 1, 2, . . . , n. Let Yij be the binary
response for the subject in cluster i, and Xij be the corresponding pn × 1 covariate vector. Deﬁne
Yi = (Yi1 , . . . , Yimi )T , and Xi = (Xi1 , . . . , Ximi )T . Provided the mean of Yij depends only on the
covariate vector of subject j, i.e., E(Yij | Xi ) = E(Yij | Xij ) (see [14]), we consider the regression model
logit(E(Yij | Xij )) = Xij
T
βn , (2.1)

where logit(t) = log(t/(1 − t)), βn is a pn -dimensional unknown parameter vector. Denote πij = E(Yij |
Xij ). Then, the marginal variance of Yij is assumed as Var(Yij | Xij ) = πij (1 − πij ), where a dispersion
parameter may be added in the marginal variance function if overdispersion is suspected to be present.
Without loss of generality, the number of repetitions mi are assumed to be equal across the clusters, for
example, m. Then, let πi (βn ) = (πi1 (βn ), . . . , πim (βn ))T , where πij (βn ) = exp(Xij
T T
βn )/(1 + exp(Xij βn )),
Ai (βn ) be the m × m diagonal matrix with elements πij (βn )(1 − πij (βn )), and R(τ ) be an m × m working
correlation matrix, where τ is a ﬁnite dimensional parameters which fully characterizes R(τ ). The GEE
model (see [10]) is deﬁned as

n
1/2 −1 A−1/2 (βn )(Yi − πi (βn )) = 0,
XiT Ai (βn )R (2.2)
i
i=1
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 361

where R is the estimator of R(τ ). A key advantage of the GEE model is that the GEE estimator βnG
achieves consistent estimation for the model parameters even when the working correlation matrix is
misspeciﬁed in “large n, diverging p” setup (see [16]).
Motivated by this appealing fact, we consider a WLS-type function for model (2.1). First, for the
simplicity of notation, let Y = (Y1T , . . . , YnT )T , X = (X1T , . . . , XnT )T , π(βn ) = (π1T (βn ), . . . , πnT (βn ))T ,
1/2 1/2 (βn ). Let V (βn ) and A(βn ) be, respectively, the nm × nm block diagonal
and Vi (βn ) = Ai (βn )RA i
matrix with diagonal elements Vi (βn )’s and Ai (βn )’s. Consider the following approximation to the residual
Y − π(βnG ). Let Z = A(βnG )X βnG − (Y − π(βnG )), we have

Z ≈ A(βnG )X βnG .

Then, the WLS-type function can be constructed as

W LS(βn ) (Z − A(βnG )Xβn )T V −1 (βnG )(Z − A(βnG )Xβn ), (2.3)

and the parameter estimator is the minimizer of this WLS-type function.

Remark 2.1. The main advantage of this approach is that one can construct an approximate potential
function WLS(β) with no requirement of the conservative gradient vector ﬁeld. Further, since Z =
A(βnG )X βnG − (Y − π(βnG )) and βnG is the solution of (2.2), one has

βnG = arg min WLS(βn ).

{βn ∈Rpn }

It implies that the special approximation leads to the GEE estimator as the minimizer. The approximate
allows the construction of the penalty models, such as the LASSO and the SCAD, in the conventional
way by minimizing the “objective loss function” with the added penalty, as we shall see in the following.
However, it has to be recognized that since the WLS does not yield the same vector field as in the GEE
model (2.2), the integral in the vector field is path dependent. We will show that this approximate leads
to an efficient penalty estimator unlike the penalized GEE estimator as defined in [8].
Following [20], we construct a penalized WLS for parameter estimation and achieve model selection
based on the above objective function in a two-stage procedure as follows:
Step 1. Obtain an initial estimator βnG of the estimating equation (2.2), such as the GEE estimator.
Step 2. Define a WLS-type function (2.3) as an objective function by replacing the parameters βn in
V (βn ) and A(βn ) with the estimate βnG .
Step 3. Define the penalized WLS (PWLS) estimator

pn
βλw = arg min WLS(βn ) + λ j |βj |,
w (2.4)
{βn ∈Rpn } j=1

j = |βnGj |−γ , and γ > 0.

where w
Remark 2.2. Note that PWLS (2.4) yields a sparse solution βλw by setting some parameters to 0.
Hence, variable selection is achieved with the nonzero values of βλw .
Remark 2.3. Note that PWLS (2.4) differs from the penalized estimating equations studied in [8] since
the PWLS estimator is achieved through a special algorithm as in the above and the non-conservativeness
of the GEE vector field implies that the path-dependent vector field leads to different estimators through
different paths. Therefore, PWLS (2.4) differs in many ways from the penalized estimating equations
estimator. We shall show in the next section that PWLS leads to efficient estimation compared with the
penalized GEE estimator.
Remark 2.4. Although the penalty function considered in this paper is the adaptive LASSO, PWLS
can be defined with other penalty functions in a more general class. Specifically, given the WLS-type
362 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

function (2.3), we can consider the variable selection methods by minimizing the PWLS function, taking
the form

pn
P (βn ) = WLS(βn ) + n pλ (|βj |). (2.5)
j=1

Formulation (2.5) includes many popular variable selection methods, including the adaptive LASSO
deﬁned in (2.4), the bridge method (see [7]) using the Lq penalty pλ (|βj |) = λ|β|q , the SCAD penalty
(see [6]), the MCP penalty (see [19]), and the SICA penalty (see [11]).

3 Asymptotic properties when pn → ∞

To study the asymptotic behavior of PWLS estimator when pn → ∞, we ﬁrst introduce some notations.
Let A = {j : βn0j = 0, j = 1, . . . , pn } be the true sparse model with non-sparsity size sn = |A|, where
βn0 = (β01 , . . . , β0pn )T denotes the true value. Let S = {j1 , . . . , jd } denote an arbitrary candidate model,
which contains the j1 -th, . . . , jd -th (1 d pn ) predictors. For an arbitrary vector b ∈ Rpn , let bS ∈ R|S|
denote the subvector of b associated with the candidate model S. For an arbitrary d × pn matrix Q, let
QS denote the d × |S| submatrix of Q associated with the candidate model S.

3.1 Oracle properties

In this section, we shall show that with a proper choice of λ, PWLS retains oracle properties when
pn → ∞. The main idea is to ﬁnd an asymptotic characterization of the solution to the PWLS criterion.
In other words, the ﬁrst thing we need to do is propose an estimator based on the oracle information of
the true subset of the model we know in advance.
For this reason, we consider the following model:

logit(E(Yij | Xij )) = XijA

T
δn , (3.1)

T T
where δn is an sn -dimensional unknown vector. Let π̃ij (δn ) = exp(XijA δn )/(1 + exp(XijA δn )). Denote

δnG by the solution of the following estimating equations:

n
T
XiA
1/2 −1 Ã−1/2 (δn )(Yi − π̃i (δn )) = 0,
Ãi (δn )R (3.2)
i
i=1

where π̃i (δn ) = (π̃i1 (δn ), . . . , π̃im (δn ))T , Ãi (δn ) is the m × m diagonal matrix with elements π̃ij (δn )(1 −
π̃ij (δn )). It implies immediately that δnG is the GEE estimator of βn0A . Further, let Ṽi (βn ) = Ãi (βn )R
1/2

1/2 T T T
Ãi (βn ) for i = 1, 2, . . . , n, and π̃(δn ) = (π̃1 (δn ), . . . , π̃n (δn )) . Let Ã(δn ) and Ṽ (δn ) be, respectively, the
nm × nm block diagonal matrixs with diagonal elements Ãi (δn )’s and Ṽi (δn )’s, and Z̃ = Ã(δnG )XA δnG −
(Y − π̃(δnG )). Then, similar to (2.4), we deﬁne an estimator δλw as δλw = arg min{δn ∈Rsn } Q(δn ), where

Q(δn ) = (Z̃ − Ã(δnG )XA δn )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δn ) + λ j |βj |.
w (3.3)
j∈A

Throughout the paper, we use c to denote a generic positive constant which may vary from line to line.
Lemma 3.1. Suppose that the conditions (a)–(d) in the Appendix hold. If we assume the model (3.1)
and n−1 s2n = o(1), then for any κ > 0, there exist two positive constants b1 and b2 such that

b1 n sup √ λmin (Jn (δn )) sup √ λmax (Jn (δn )) b2 n,

sn sn
δn −βn0A κ n δn −βn0A κ n

n
Ãi (δn )R̂−1 Ãi (δn )XiA .
T 1/2 1/2
where Jn (δn ) = i=1 XiA
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 363

Theorem 3.2. Under the model (3.1) and the conditions (a)–(e) in the Appendix, we have
2
E(δλw − βn0A 2 ) 2 λ2 sn max{wj } + cnsn /(b21 n2 ),
j∈A

where b1 is the constant given in Lemma 3.1.

Note that the derived risk bound is nonasymptotic. Theorem 3.2 is very useful for finding the asymp-
totic characterization of the solution of the PWLS criterion, which will be given in the following theorem
specifically.
Theorem 3.3. T
Let us write βn0 = (βn0A , 0T )T . Then, under the conditions (a)–(g) in the Appendix,
T T T
β̃λw = (δλw , 0 ) is the solution to (2.4) with probability tending to 1.
From Theorem 3.3, we know that the PWLS estimator works asymptotically as if it had the oracle infor-
mation that we know the true subset model in advance. This naturally implies that the PWLS may enjoy
n
Ãi (δn )R̄−1 R0 R̄−1
T 1/2
the oracle property, which is confirmed in the next theorem. Let Kn (δn ) = i=1 XiA

Ãi (δn )R̄−1 Ãi (δn )XiA .
1/2 n 1/2 1/2
Ãi (δn )XiA , and J¯n (δn ) = i=1 XiA
T

Theorem 3.4. Under the conditions (a)–(g) in the Appendix, the PWLS estimator satisfies the fol-
lowing properties:
(1) consistency in model selection: limn→∞ P ({j : βλwj
= 0} = A) = 1,
T −1/2 ¯
(2) asymptotic normality: αn Kn (βn0A )Jn (βn0A )(βλwA D
− βn0A ) → N (0, 1) for any αn ∈ R
sn
satis-
fying αn = 1.
Remark 3.5. Theorem 3.4 shows that, with probability approaching 1, the PWLS criterion can
identify the right subset model, and the PWLS estimator has the optimal convergence rate, when the
number of parameters diverges. Moveover, the PWLS criterion has the model selection consistency and
the asymptotic normality even when the correlation structure is misspeciﬁed.
Remark 3.6. Let ΣnA = J¯n−1 (βn0A )Kn (βn0A )J¯n−1 (βn0A ). Together with Theorem 3.4, an application
of the Cramer-Wold device yields that for any q × sn matrix Qn with q ﬁxed and such that Qn ΣnA QT
n
is invertible, we have
−1/2
Qn (βλwA
D
(Qn ΣnA QTn) − βn0A ) −→ Nq (0, Iq ),

where Iq is a q × q identity matrix.

3.2 Estimation of covariance matrix

Since we can select variables and estimate parameters simultaneously, the covariance matrix of the esti-
mated parameters can be obtained directly. Naturally, the well-known sandwich covariance formula can
be used as an estimator for the covariance of the estimate βλwA . Let
, the nonvanishing component of βλw
−1/2
i (δn ) = Ãi (δn )(Yi − π̃i (δn )), i = 1, 2, . . . , n. Similar to that proposed by [10] in the “ﬁxed p” setup and
by [16] in the “large n, diverging p” setup, the sandwich covariance matrix estimator of ΣnA is deﬁned as
nA = Jn−1 (βλwA
Σ )Jn−1 (βλwA
)Kn (βλwA ), (3.4)

where

n
Jn (δn ) = T
XiA
1/2 −1 Ã1/2 (δn )XiA ,
Ãi (δn )R i
i=1
n
n (δn ) =
K T
XiA
1/2 −1 i (δn )
Ãi (δn )R T −1 Ã1/2 (δn )XiA .
i (δn )R i
i=1

The following theorem shows the consistency of the sandwich covariance matrix estimator (3.4).
Theorem 3.7. If the conditions (a)–(g) in the Appendix are satisfied, then we have
nA QT
Qn Σ n − Qn ΣnA Qn = op (n
T −1
),

n = G, where q is fixed, and G is a q × q positive definite matrix.

for any q × sn matrix Qn such that Qn QT
364 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

Remark 3.8. In the “large n, diverging p” setup, Theorem 3.7 provides a consistent estimator for
the variance of the nonvanshing part of the PWLS estimator even when the working correlation matrix
is misspeciﬁed. Then, based on Remark 3.6 and the above sandwich covariance matrix estimator, an
asymptotic (1 − α)% conﬁdence interval (0 < α < 1) for βn0j (j ∈ A) can be constructed as

βλwj T
± zα/2 (ej ΣnA ej )
1/2
,

where ej is an sn × 1 vector with the j-th element equal to 1, and 0 otherwise, zα/2 is the upper α/2
quantile of the standard normal distribution.

3.3 Tuning parameter selection

Note that the pleasing performances of the PWLS estimator heavily hinge on a proper selection of the
tuning parameter. To implement the PWLS method in practice, it is desirable to have an automatic
data-driven method for selecting the tuning parameter λ. We now use a selector that is in spirit similar
to the one proposed by [9].
n 1/2 −1 A1/2 (βn )Xi and Dn (βn ) = n X T
For the sake of simplicity, let Hn (βn ) = i=1 XiT Ai (βn )R i i=1 i
1/2
Ai (βn )R −1 εi (βn )εT (β)R
−1 A1/2 (βn )Xi , where εi (βn ) = A−1/2 (βn )(Yi −πi (βn )). Denote Σ n = Hn−1 (βnG )
i i i
Dn (βnG )Hn−1 (βnG ). Based on the above notations, we define a penalized quadratic form (PQF) function
criterion as
P QFλ = n−1 (βλw − βnG )T Σ
−1 (βλw − βnG ) + Cn × pn × dfλ /n,
n (3.5)
where dfλ is the number of nonzero coefficients in βλ , and Cn > 0 is required to grow to infinity with the
sample size.
We now establish the consistency of the proposed PQF function criterion (3.5) when the number of
parameters diverges. Let Sλ = {j : βλwj is the solution of (2.4) given λ. According
= 0}, where βλw
to whether the resulting model Sλ is underfitted, correctly fitted, or overfitted, we partition R+ into the
following three mutually exclusive regions:

R− = {λ ∈ R+ : Sλ A},
R0 = {λ ∈ R+ : Sλ = A},
R+ = {λ ∈ R+ : Sλ ⊃ ST , Sλ = A}.

In addition, we deﬁne a reference tuning parameter sequence {λn ∈ R+ }∞ n=1 , where λn = n

3/4−(1−3v)(1+γ)/4
.
Since λn satisﬁes the conditions in Theorem 3.4, it follows that Sλn = A with probability tending to one.
We remark that {λn ∈ R+ }∞ n=1 is constructed only for a theoretical proof. It does not imply that the
optimal tuning parameter must be λn . The following theorem shows the consistency of the proposed
PQF function criterion.
Theorem 3.9. Under the conditions (a)–(h) in the Appendix, P (inf λ∈R− ∪R+ P QFλ > P QFλn ) → 1.
Theorem 3.9 implies that in an almost sure convergence sense, any λ failing to identify the true model
cannot be selected as the optimal tuning parameter. Therefore, the model associated with the optimal λ
must be the true one.
Remark 3.10. The divergence rate of Cn can be arbitrarily slow, although in theory we require it
to diverge to inﬁnity. In our simulations, we set Cn = log(log(pn )) as that suggested by [9]. For all our
numerical experiments below, the simulation results are rather encouraging.

4 Monte Carlo simulations

In this section, we assess the ﬁnite sample performance of the proposed PWLS method with PQF function
criterion via a simulation experiment in terms of model selection and estimation accuracy. We compared
our proposed PQF based PWLS method with the LASSO-based penalized estimating equations (PGEE)
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 365

with QGCV (see [8]). We consider the following logistic model. The response variable Yij is binary and
its marginal mean is πij with
T
logit(πij ) = Xij βn0 , j = 1, 2, . . . , 10 and i = 1, 2, . . . , n.

where βn0 is a pn -dimensional vector of parameters with pn = 4n1/3 − 5, with s denoting the largest
integer not greater than s. The covariate vectors Xij are i.i.d. from normal distribution Npn (0pn , Σ)
with Σ whose (i, j)-th element is equal to 0.5|i−j| . For all t with 1 t 10, Yit are of an AR(1)
correlation structure with auto-correlation coefficient ρ. The true coefficient vector is taken to be βn0 =
(0.51dn , 0pn −dn ), where dn = pn /5 . We consider three correlations ρ = 0.3, 0.5, 0.7 and three sample
sizes n = 100, 200, 400. The simulations with 200 replications for every case are conducted using R
codes since we use the correlated random binary data generator by [12]. And three different working
correlation structures are considered: independence working correlation matrix (IND), exchangeable
working correlation matrix (CS), and the first-order auto-correlation working correlation matrix (AR(1)).
The summary of simulation results is depicted in Tables 1–3.
To find the subset which minimizes each criterion, we exhaustively search for all possibilities. Thus,
the corresponding results represent the best subset variable selection under the underlying criterion. We
compare the variable selection procedures in terms of model error and model complexity, summarized
with the simulated average mean square error (MSE) obtained by averaging βn0 − βn0 2 /pn over 200
simulated samples, the average number (per simulation) of truly zero coefficients correctly estimated as
zero, i.e., correct deletions (CD), the average number of truly nonzero coefficients erroneously set to zero,
i.e., erroneous deletions (ED), the percentage of times the correct true model exactly identified (PCM),
and the average model size (MS).
Several observations can be made from Tables 1–3. First, the PQF based PWLS method is robust
against the growth of correlation degree. Second, the PQF based PWLS method is also robust in terms
of misspecification of working correlation matrix, which is reflected from Theorem 3.4. Third, compared
with the LASSO-based PGEE, both of them perform satisfactorily in terms of variable selection, while

Table 1 Simulation results by diﬀerent procedures assuming IND working correlation matrix

n pn dn ρ Criterion MSE CD ED PCM MS

Mean SE
100 13 2 0.3 PWLS 0.366 10.965 (99.70%) 0 (0%) 0.965 2.035 0.184
PGEE 1.214 10.995 (99.95%) 0 (0%) 0.995 2.005 0.071
0.5 PWLS 0.303 10.985 (99.90%) 0 (0%) 0.985 2.015 0.122
PGEE 1.174 10.995 (99.95%) 0 (0%) 0.995 2.015 0.122
0.7 PWLS 0.347 10.990 (99.90%) 0 (0%) 0.990 2.010 0.100
PGEE 1.190 11.000 (100%) 0 (0%) 1.000 2.000 0.000
200 18 3 0.3 PWLS 0.369 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.323 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.5 PWLS 0.413 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.427 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.7 PWLS 0.397 14.995 (99.97%) 0 (0%) 0.995 3.005 0.071
PGEE 1.370 15.000 (100%) 0 (0%) 1.000 3.000 0.000
400 24 4 0.3 PWLS 0.475 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 1.053 19.995 (99.98%) 0 (0%) 0.995 4.005 0.071
0.5 PWLS 0.484 19.995 (99.98%) 0 (0%) 0.995 4.005 0.071
PGEE 1.028 19.975 (99.88%) 0 (0%) 0.975 4.025 0.157
0.7 PWLS 0.462 19.995 (99.98%) 0 (0%) 0.995 4.005 0.071
PGEE 0.981 19.985 (99.90%) 0 (0%) 0.985 4.015 0.122
366 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

Table 2 Simulation results by diﬀerent procedures assuming AR(1) working correlation matrix

n pn dn ρ Criterion MSE CD ED PCM MS

Mean SE
100 13 2 0.3 PWLS 0.370 10.965 (99.70%) 0 (0%) 0.965 2.035 0.184
PGEE 1.244 10.995 (99.95%) 0 (0%) 0.995 2.005 0.071
0.5 PWLS 0.332 10.995 (99.95%) 0 (0%) 0.995 2.005 0.071
PGEE 1.155 11.000 (100%) 0 (0%) 1.000 2.000 0.000
0.7 PWLS 0.392 11.000 (100%) 0 (0%) 1.000 2.000 0.000
PGEE 1.809 11.000 (100%) 0 (0%) 1.000 2.000 0.000
200 18 3 0.3 PWLS 0.371 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.532 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.5 PWLS 0.393 14.990 (99.93%) 0 (0%) 0.990 3.010 0.100
PGEE 1.555 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.7 PWLS 0.417 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.385 15.000 (100%) 0 (0%) 1.000 3.000 0.000
400 24 4 0.3 PWLS 0.477 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 1.587 20.000 (100%) 0 (0%) 1.000 4.000 0.000
0.5 PWLS 0.474 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 1.884 20.000 (100%) 0 (0%) 1.000 4.000 0.000
0.7 PWLS 0.457 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 1.849 20.000 (100%) 0 (0%) 1.000 4.000 0.000

Table 3 Simulation results by diﬀerent procedures assuming CS working correlation matrix

n pn dn ρ Criterion MSE CD ED PCM MS

Mean SE
100 13 2 0.3 PWLS 0.368 10.965 (99.70%) 0 (0%) 0.965 2.035 0.184
PGEE 1.232 10.995 (99.95%) 0 (0%) 0.995 2.005 0.071
0.5 PWLS 0.315 10.990 (99.90%) 0 (0%) 0.990 2.010 0.100
PGEE 1.174 11.000 (100%) 0 (0%) 1.000 2.000 0.000
0.7 PWLS 0.368 10.995 (99.95%) 0 (0%) 0.995 2.005 0.071
PGEE 1.255 11.000 (100%) 0 (0%) 1.000 2.000 0.000
200 18 3 0.3 PWLS 0.368 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.539 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.5 PWLS 0.408 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.626 15.000 (100%) 0 (0%) 1.000 3.000 0.000
0.7 PWLS 0.402 15.000 (100%) 0 (0%) 1.000 3.000 0.000
PGEE 1.628 15.000 (100%) 0 (0%) 1.000 3.000 0.000
400 24 4 0.3 PWLS 0.470 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 1.730 20.000 (100%) 0 (0%) 1.000 4.000 0.000
0.5 PWLS 0.485 19.995 (99.98%) 0 (0%) 0.995 4.005 0.071
PGEE 2.122 20.000 (100%) 0 (0%) 1.000 4.000 0.000
0.7 PWLS 0.453 20.000 (100%) 0 (0%) 1.000 4.000 0.000
PGEE 2.151 20.000 (100%) 0 (0%) 1.000 4.000 0.000

the PQF based PWLS method has much smaller MSE, especially when CS working correlation matrix is
assumed. It conﬁrms the oracle property of the PWLS method empirically.
We next test the accuracy of the sandwich covariance formula. We consider the above model with
ρ = 0.5, the three sample sizes, and IND working correlation structure. For each case, the standard
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 367

Table 4 Standard deviations (SD) (multiplied by 1000) of

estimators assuming IND working correlation matrix

n pn dn Parameter SD SDm SDmad

100 13 2 β̂1 6.711 6.174 0.947
β̂2 6.113 6.036 0.958
200 18 3 β̂1 3.919 3.298 0.342
β̂2 4.601 4.108 0.465
β̂3 3.879 3.314 0.390
400 24 4 β̂1 2.028 1.780 0.147
β̂2 2.523 2.219 0.190
β̂3 2.891 2.197 0.201
β̂4 2.402 1.800 0.139

deviations of the estimated coeﬃcients are averaged among 200 replications. This can be regarded as the
true standard error (SD) and compared with 200 estimated standard error from the sandwich covariance
formula, which are summarized by their median (SDm ) and interquartile range divided by 1.349 (SDmad ).
The detailed results are summarized in Table 4. We observed that the sandwich covariance formula
performs satisfactorily. Similar phenomena are also observed with other working correlation matrix,
which are not reported here.

5 A three-period crossover trial data analysis

We apply the method to the data from a three-period crossover trial of an analgesic drug for relieving
pain from primary dysmenorrhoea (see [4]). Three levels of the analgesic (placebo, low and high doses)
were given to each of the 86 women, who were randomized to one of the six groups receiving treatments
in diﬀerent orders. The placebo was labeled as treatment A, whilst analgesic with low and high doses
were treatments B and C, respectively. The response was measured for each woman at the end of each
treatment period and was coded as (1) for some relief and (0) for no relief. Six explanatory variables
(covariates) are deﬁned as follows:

1, period 2(3),
x1 (x2 ) =
0, otherwise,
1, treatment B(C),
x3 (x4 ) =
0, otherwise,
1, the previous assignment is B(C),
x5 (x6 ) =
0, otherwise.
We consider 64 candidate models including all possible subsets of the six covariates. Table 5 reports
the estimated coefficients with p-values by using the GEE method and shows the models that were chosen
by PWLS and the LASSO-based PGEE with QGCV. From Table 5, it can be seen that PWLS selects
the best model that includes the covariates x1 , x2 and x5 , which are significant at the significance level
α = 0.05, whereas the LASSO-based PGEE selects the full model as the best model. We also note that
PWLS selects the same best model using either the IND, AR(1), or CS working correlation structures,
of which the results are not presented here for saving space.
Since the true values of parameters are unknown, one has to rely on out-of-sample testing to compare
the model performance in terms of forecasting. We then conduct 200 cross-validation experiments. For
each experiment, we randomly split the entire data set D = {1, . . . , 86} into two parts, where D0 is the
training data set with 70 subjects and D1 is the testing data set with 16 subjects. Consequently, denote
368 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

Table 5 Parameter estimate (p-value in parentheses) for analgesic

drug clinical data under the IND working model

GEE PWLS PGEE

x1 2.106 (<0.001) 1.471 2.079
x2 2.068 (<0.001) 1.715 2.037
x3 −0.516 (0.336) - −0.543
x4 −0.342 (0.456) - −0.369
x5 0.929 (0.040) 0.403 0.934
x6 0.800 (0.138) - 0.796

120
Prediction error

100

PWLS PGEE

Different model selection methods

Figure 1 Model comparison in prediction error by 200 random splits of data set

X0 = {Xi : i ∈ D0 }, Y0 = {Yi : i ∈ D0 }, X1 = {Xi : i ∈ D1 }, and Y1 = {Yi : i ∈ D1 }. We use D0 to train

the models, including PWLS with the BIC criterion and the LASSO-based PGEE and test them using
the testing data set D1 . For each observation (Xi , Yi ) in D1 , Ŷi is the prediction of the response. The
methods are compared for the prediction error using the following loss function by [3]

(Yi − Ŷi )T Vi−1 (Yi − Ŷi ),
i∈D1

where Vi is the working covariance matrix estimated together by the entire data set D. For each method,
the prediction errors from 200 random splits are summarized in the boxplot in Figure 1. It can be seen
that PWLS performs better than the LASSO-based PGEE.

6 Extension
The approaches and techniques can be readily extended to general marginal longitudinal generalized
linear model, although the focus of the paper is on correlated binary data. The key step is the existence
of a root-(n/pn)-consistent initial estimator. From [16, Theorem 5.1] we know that the GEE estimator
βnG is root-(n/pn )-consistent for general marginal longitudinal generalized linear model. This is the
reason why we can readily extend the PWLS method to general cases. The generalized PWLS may also
enjoy the oracle property, which is conﬁrmed in the following theorem.
Theorem 6.1. Under the conditions (a)–(g) in the Appendix and [16, Conditions (A5) and (A6)], the
PWLS estimator satisfies the following properties:
(1) consistency in model selection: limn→∞ P ({j : βλwj
= 0} = A) = 1;
T −1/2 ¯
(2) asymptotic normality: α Kn (βn0A )Jn (βn0A )(βλwA L
− βn0A ) → N (0, 1) for any αn ∈ R
sn
satis-
n
fying αn = 1,
−1/2
where Kn (δn ) and J¯n (δn ) have the same expressions as in Subsection 3.1 with Ãij (δn ) = Var(Yij | Xij ),
i = 1, . . . , n, j = 1, . . . , mi .
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 369

In addition, if the primary interest is on subject-specific effects, then the mixed-effects models, espe-
cially generalized linear mixed models, are probably more appropriate. Model selection for the mixed-
effects models deserves further investigation and is ongoing.

Acknowledgements This work was supported by National Natural Science Foundation of China (Grant No.
11201306), the Innovation Program of Shanghai Municipal Education Commission (Grant No. 13YZ065), the
Fundamental Research Project of Shanghai Normal University (Grant No. SK201207), the scholarship under the
State Scholarship Fund by the China Scholarship Council in 2011 and the Research Grant Council of Hong Kong,
Hong Kong, China (Grant No. #HKBU2028/10P).

References
1 Bondell H D, Krishna A, Ghosh S K. Joint variable selection for fixed and random effects in linear mixed-effects models.
Biometrics, 2010, 66: 1069–1077
2 Cantoni E, Flemming J M, Ronchetti E. Variable selection for marginal longitudinal generalized linear models. Bio-
metrics, 2005, 61: 507–514
3 Cantoni E, Filed C, Flemming J M, et al. Longitudinal variable selection by cross-validation in the case of many
covariates. Stat Med, 2005, 26: 919–930
4 Diggle P J, Heagerty P J, Liang K Y, et al. Analysis of Longitudinal Data, 2nd edition. New York: Oxford University
Press, 2002
5 Dziak J J, Li R Z. An overview on variable selection for longitudinal data. In: Quantitative Medical Data Analysis
Using Mathematical Tools and Statistical Techniques Chapter One. Singapore: World Scientific Publishing Co., 2006
6 Fan J Q, Li R Z. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc,
2001, 96: 1348–1359
7 Frank I E, Friedman J H. A statistical view of some chemometrics regression tools (with discussion). Technometrics,
1993, 35: 109–148
8 Fu W J. Penalized estimating equations. Biometrics, 2003, 59: 126–132
9 Leng C L, Li B. Least squares approximation with a diverging number of parameters. Statist Prob Lett, 2010, 80:
254–261
10 Liang K Y, Zeger S L. Longitudinal data analysis using generalised linear models. Biometrics, 1986, 73: 12–22
11 Lv J C, Fan Y Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann
Statist, 2009, 37: 3498–3528
12 Oman S D. Easily simulated multivariate binary distributions with given positive and negative correlations. Comput
Stat Data Anal, 2009, 53: 999–1005
13 Pan W. Akaike’s Information Criterion in Generalized Estimating Equations. Biometrics, 2001, 57: 120–125
14 Pepe M S, Anderson G L. A cautionary note on inference for marginal regression models with longitudinal data and
general correlated response data. Comm Statist Ser B, 1994, 23: 939–951
15 Verbeke G, Molenberghs G. Linear Mixed Models for Longitudinal Data. New York: Springer, 2000
16 Wang H, Leng C. Unified lasso estimation via least square approximation. J Amer Statist Assoc, 2007, 102: 1039–1048
17 Wang L. GEE analysis of clustered binary data with diverging number of covariates. Ann Statist, 2011, 39: 389–417
18 Wang L, Qu A. Consistent model selection and data-driven smooth tests for longitudinal data in the estimating
equations approach. J R Statist Soc B, 2009, 71: 177–190
19 Zhang C H. Nearly unbiased variable selection under minimax concave penalty. Ann Statist, 2010, 38: 894–942
20 Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc, 2006, 101: 1418–1429
21 Zou H, Zhang H H. On the adaptive elastic-net with a diverging number of parameters. Ann Statist, 2009, 37:
1733–1751

Appendix
In this section, the proof of theorems will be given. We ﬁrst introduce the following regularity conditions:
√
(a) supi,j Xij = Op ( pn ).
(b) Denote λmin (Q) and λmax (Q) as the minimum and maximum eigenvalues of a matrix Q, respectively.
Then, assume that there exist two positive constants c1 and c2 such that

n
n
−1 −1
c1 λmin n XiT Xi λmax n XiT Xi c2 .
i=1 i=1
370 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

(c) There exist two positive constants c3 and c4 such that 0 < c3 πij (βn0 ) c4 1, for i = 1, 2, . . . , n,
j = 1, 2, . . . , m.
(d) The eigenvalues of the common true correlation matrix R0 are bounded away from zero and +∞.
Hereafter, for a matrix B, define B = [trace(BB T )]1/2 as its Frobenius norm. The estimated working
correlation matrix R satisfies R −1 − R̄−1 = Op ( pn /n), where R̄ is a positive-definite matrix with
eigenvalues bounded away from zero and +∞. Note R̄ maybe not the true correlation matrix R0 .
(e) limn→∞ log(pn )/ log(n) = v, for some 0 v < 1/3.
√ √
(f) limn→∞ λ/ n = 0, and limn→∞ λn((1−3v)(1+γ)−1)/2 / n = ∞.
√ √ 1/γ
(g) limn→∞ n/(λ pn ) minj∈A |βn0j | = ∞.
(h) limn→∞ n/(pn Cn ) minj∈A |βn0j | = ∞, and limn→∞ Cn p2n /n = 0.
2

Conditions (a)–(e) are general regularity conditions that ensure the GEE estimator of βn0A is a root-
(n/sn )-consistent estimator under model (3.1) (see [16]). Conditions (f) and (g) are similar to [21,
Conditions (A5) and (A6)]. Condition (g) characterizes the strength of the signals, i.e., the nonzero coef-
ﬁcients can vanish but at the rate that can be distinguished by the penalized least squares. Condition (h)
is the same as [9, Condition (v)], which ensures the consistency of the proposed tuning parameter selector.
n
Ãi (δn )R̄−1 Ãi (δn )XiA . Note that
1/2 1/2
Proof of Lemma 3.1. Let J¯n (δn ) = i=1 XiA T

n
n
λmax T
XiA XiA λmax XiT Xi .
i=1 i=1

For any bn ∈ Rsn satisfying bn = 1, we have

n
¯ 1/2 −1 1/2
bT
n Jn (δn )bn = bT T
n XiA Ãi (δn )R̄ Ãi (δn )XiA bn
i=1

n
λ−1
min (R̄)λmax (Ãi (δn ))λmax
T
XiA XiA bn 2
i=1

= Op (n)

by the conditions (b) and (d). It implies that

sup √ sup |bT ¯

n Jn (δn )bn | = Op (n). (A.1)
sn b =1
δn −βn0A κ n
n

Similar to the proof of [16, Lemma 3.3], we have

√
sup √ sup |bT ¯
n [Jn (δn ) − Jn (δn )]bn | = Op ( nsn ).
sn b =1
δn −βn0A κ n
n

Therefore,

sup √ sup |bT

n Jn (δn )bn | sup √ sup |bT ¯
n [Jn (δn ) − Jn (δn )]bn |
sn b =1 sn b =1
δn −βn0A κ n
n δn −βn0A κ n
n

+ sup √ sup |bT ¯

n Jn (δn )bn |
sn b =1
δn −βn0A κ n
n

√
= Op ( nsn ) + Op (n)
= Op (n).

Note that Jn (δn ) is symmetric. The above conclusion immediately implies that

sup √ λmin (Jn (δn )) = Op (n),

sn
δn −βn0A κ n

sup √ λmax (Jn (δn )) = Op (n).

sn
δn −βn0A κ n
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 371

Consequently, there exist two positive constants b1 and b2 such that

b1 n sup √ λmin (Jn (δn )) sup √ λmax (Jn (δn )) b2 n.

sn sn
δn −βn0A κ n δn −βn0A κ n

Proof of Theorem 3.2. According to the deﬁnition of Z̃, it is easy to see that

δnG = (XA
T
Ã(δnG )Ṽ −1 (δnG )Ã(δnG )XA )−1 XA
T
Ã(δnG )Ṽ −1 (δnG )Z̃,

which implies immediately that δnG satisﬁes that

δnG = arg min (Z̃ − Ã(δnG )XA δn )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δn ), (A.2)
{δn ∈Rsn }

and
T
XA Ã(δnG )Ṽ −1 (δnG ){Z̃ − Ã(δnG )XA δnG } = 0. (A.3)
By the deﬁnition of δλw , we know that

(Z̃ − Ã(δnG )XA δλw )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δλw ) + λ j |δλwj
w |
j∈A

(Z̃ − Ã(δnG )XA δnG )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δnG ) + λ j |δnGj |.
w
j∈A

Thus, by (A.3), we have

λ j (|δnGj | − |δλwj
w |)
j∈A

Ṽ −1/2 (δnG )(Z̃ − Ã(δnG )XA δλw )2 − Ṽ −1/2 (δnG )(Z̃ − Ã(δnG )XA δnG )2
= (δnG − δλw )T XA
T
Ã(δnG )Ṽ −1 (δnG )Ã(δnG )XA (δnG − δλw )
= (δnG − δλw )T Jn (δnG )(δnG − δλw ).

Note that j∈A j (|δnGj | − |δλwj
w |) j∈A j2 δnG − δλw . Therefore, by Lemma 3.1, we conclude
w
that
λ j2 δnG − δλw b1 nδnG − δλw 2 ,
w
j∈A

which results in the inequality

λ √
δnG − δλw j2 λ sn max{w
w j }/(b1 n). (A.4)
b1 n j∈A
j∈A

And by [16, Theorem 3.6], we know that under the model (3.1), we have δnG − βn0A = Op ( sn /n).
Consequently,

E(δλw − βn0A 2 ) 2Eδλw − δnG 2 + 2EδnG − βn0A 2

2
2 λ2 sn max{w j } + cnsn /(b21 n2 ),
j∈A

where c is a positive constant.

n 1/2
Proof of Theorem 3.3. Let xij be the j-th column of Xi , j = 1, 2, . . . , pn . Denote ξj = i=1 xT
ij Ai (βnG )
−1 A−1/2 (βnG )(Zi − Ai (βnG )XiA δλw ). By the Karush-Kuhn-Tucker (KKT) optimality condition, it
R i
suﬃces to show
P (∀ j ∈ Ac , | − 2ξj | λw
j ) → 1. (A.5)
372 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

Let ς = minj∈A |βn0j |, and ςˆ = minj∈A |βnGj |. Note that

ς ς/2) P (βnG − βn0 ς/2)

P (ˆ
4EβnG − βn0 2 /ς 2
4cpn /(nς 2 )

Therefore,

P (∃j ∈ Ac , |ξj | > λw
j ) P (|ξj | > λw
j )
j∈Ac

P (|ξj | > λw
j , ςˆ > ς/2) + P (ˆ
ς ς/2)
j∈Ac

4M 2γ cpn 4cpn
E ξj2 I(ˆ
ς > ς/2) + + .
λ2 nM 2 nς 2
j∈Ac

Now, we focus on the expectation term of the above inequality. Since βnG is the solution to (2.2), it is
easy to see that
n
1/2
xT −1 A−1/2 (βnG )(Yi − πi (βnG )) = 0,
ij Ai (βnG )R i
i=1

for j = 1, . . . , pn . Then, based on the deﬁnition of Z, we have

n
1/2
ξj = xT −1 A1/2 (βnG )Xi (βnG − β̃λw ).
ij Ai (βnG )R i
i=1

n 1/2 −1 A1/2 (βn )Xi = (Hn1 (βn ), . . . , Hnpn (βn ))T . Similar to the proof of
Let Hn (βn ) = i=1 XiT Ai (βn )R i
Lemma 3.1, we can prove that there exist two positive constants b3 and b4 such that

b3 n sup √ λmin (Hn (βn )) sup √ λmax (Hn (βn )) b4 n.

pn pn
βn −βn0 κ n βn −βn0 κ n

Then, we have

T
|ξj |2 = |Hnj (βnG )(βnG − β̃λw )|2
j∈Ac j∈Ac

T
Hnj (βnG )Hnj (βnG )βnG − β̃λw 2
j∈Ac
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 373

pn
T
Hnj (βnG )Hnj (βnG )βnG − β̃λw 2
i=1

= trace(Hn (βnG )HnT (βnG ))βnG − β̃λw 2

pn (λmax (Hn (βnG )))2 βnG − β̃λw 2
b24 pn n2 βnG − β̃λw 2 .

On the other hand, by [16, Theorem 3.6], we know that

βnG − β̃λw 2 2βnG − βn0 2 + 2β̃λw − βn0 2

= 2βnG − βn0 2 + 2δλw − βn0A 2
2cpn /n + 2δλw − βn0A 2 .

Therefore, by Theorem 3.2, we have

E ς > ς/2) b24 pn n2 E(βnG − β̃λw 2 I(ˆ
ξj2 I(ˆ ς > ς/2))
j∈Ac
2b24 pn n2 {cpn /n + [λ2 sn (ς/2)−2γ + cnsn ]/(b21 n2 )}
4b24 pn n2 {cnpn + λ2 pn (ς/2)−2γ }/(b21 n2 ).

Consequently,
−2γ
b 2 pn n2 ς pn pn
P (∃ j ∈ A , |ξj | > λw
c
j ) 16M 2γ 42 2 2 cnpn + λ2 pn +c 2
+ 4c 2
λ b1 n 2 nM nς
T1 + T2 + T3 .

By the conditions (e)–(g), it follows that

T1 = O((M 2γ n2 /λ2 )pn (pn /n))

= O([λn(1−3v)(1+γ)/2 /n]−2/(1+γ)n−v ) → 0,
T2 = O((pn /n)(n/λ)2/(1+γ) )
= O([λn(1−3v)(1+γ)/2 /n]−2/(1+γ)(pn /nv )n−2v ) → 0,
T3 = O((pn /n)ς −2 )
= O([λ pn /nς −γ ]2/γ {(pn /n)[λ pn /n]−2/γ })
= O([(λ pn /n)−1/γ ς]−2 [λn(1−3v)(1+γ)/2 /n]−2/γ n−2v(1+1/γ) p−2/γ
n ) → 0.

This proves (A.5) and completes the proof.

Proof of Theorem 3.4. Prove the model consistency ﬁrst. From Theorem 3.3, we only need to show

P min |δλwj
| > 0 → 1.
j∈A

√ −γ
min |δλwj
| > min |δnGj | − λ sn ςˆ /(b1 n).
j∈A j∈A

√ √ √ √
Note that λ sn ςˆ−γ /(b1 n) = O(1/ n)O(λ sn ς −γ / n)O((ˆ
ς /ς)−γ ) and

ς − ς)2 /ς 2
ς /ς)2 ) = 2 + 2E(ˆ
E((ˆ
2 + 2EβnG − βn0 2 /ς 2
374 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

2 + 2cpn /(nς 2 )

by [16, Theorem 3.6]. Moreover, by the proof of T3 in Theorem 3.3, we know that pn /(nς 2 ) = op (1),
ς /ς)2 ) = Op (1). Thus, by the condition (g), we have
which implies immediately that E((ˆ
√ √
λ sn ςˆ−γ /(b1 n) = op (1/ n)Op (1). (A.6)

which implies that P (minj∈A |δλwj

| > 0) → 1.
Next, we prove the asymptotic normality. Note that

αT −1/2 ¯ n0A )(δλw − βn0A )

n Kn (βn0A )J(β
= αT −1/2
n Kn
¯ n0A )(δλw − δnG ) + αT
(βn0A )J(β −1/2
n Kn
¯ n0A )(δnG − βn0A ).
(βn0A )J(β

From [16, Theorem 3.8], we have

αT −1/2 ¯ n0A )(δnG − βn0A ) −→

L
n Kn (βn0A )J(β N (0, 1)

(βn0A )J¯(βn0A )(δλw − δnG ) = op (1). For any

−1/2
by the condition (e). Then we only need to show αT
n Kn
bn ∈ Rsn satisfying bn = 1, we know that

n
−1
R0 R̄−1 Ãi (δn )XiA bn
1/2 1/2
bT
n Kn (δn )bn = bT T
n XiA Ãi (δn )R̄
i=1

n
λmax (R0 )λ−2
max (R̄) bT T
n XiA Ãi (δn )XiA bn
i=1

n
c λmax (Ãi (δn ))bT T
n XiA XiA bn
i=1

n
cλmax T
XiA XiA bn 2
i=1

cnc2

by the conditions (b) and (d). The above conclusion immediately implies that

sup √ λmin (Kn (δn )) = Op (n), (A.7)

sn
δn −βn0A κ n

sup √ λmax (Kn (δn )) = Op (n). (A.8)

sn
δn −βn0A κ n

Hence, together with (A.1), (A.4) and (A.6), we have

(αT −1/2
n Kn (βn0A )J¯(βn0A )(δλw − δnG ))2
¯ n0A )(δλw − δnG )2
Kn−1/2 (βn0A )J(β
λ−1 2 ¯ − δnG 2
min (Kn (βn0A ))λmax (J(βn0A ))δλw
√
Op (n)Op ([λ sn ςˆ−γ /(b1 n)]2 )
= Op (n)op (1/n)
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 375

= op (1).

Therefore, αT
−1/2 ¯ n0A )(δλw − δnG ) = op (1). This completes the proof.
n Kn (βn0A )J(β
Proof of Theorem 3.7. It is suﬃcient to show that

sup |bT −1
n (ΣnA − ΣnA )bn | = op (n ),
bn =1

for any bn ∈ Rsn . Since

nA − ΣnA = J −1 (βλwA
Σ ) − Kn (βn0A )]J −1 (βλwA
)[Kn (βλwA )
n n

+ [Jn−1 (βλwA ¯−1 −1

) − Jn (βn0A )]Kn (βn0A )Jn (βλwA
)

+ J¯n−1 (βn0A )Kn (βn0A )[Jn−1 (βλwA ¯−1

) − Jn (βn0A )]

T1 + T2 + T3 ,
−1
and it is suﬃcient to show that supbn =1 |bT
n Ti bn | = op (n ), i = 1, 2, 3. Note that

n (βλwA
Then we evaluate the eigenvalues of K ) − Kn (βn0A ) ﬁrst. Let Si (βn ) =
T
i (βn ) i (βn ) for i =
1, 2, . . . , n. Note that

|bT )−K
n [Kn (βλwA
n (βn0A )]bn | |bT
n (T11 + T12 + T13 )bn |,

where

n
[Ãi (βλwA −1 Si (βλwA −1 Ã1/2 (βλwA
1/2 1/2
) − Ãi (βn0A )]R
T
T11 = XiA )R i )XiA
i=1

n
T12 = T
XiA
1/2 −1 [Si (βλwA
Ãi (βn0A )R −1 Ã1/2 (βn0A )XiA
) − Si (βn0A )]R i
i=1

n
T13 = T
XiA
1/2 −1 Si (βλwA
Ãi (βn0A )R −1 [Ã (βλwA
)R
1/2
) − Ãi (βn0A )]XiA .
1/2
i
i=1

Since Ãi (βλwA

1/2 1/2
) − Ãi (βn0A ) is the diagonal matrix, by the condition (c), we have

1/2 1/2 ) − Ãij (βn0A )|

bT T
) − Ãi (βn0A )] XiA bn max |Ãij (βλwA
n XiA [Ãi (βλwA
j

cXiA bn βλwA
− βn0A max Xij .
j

Note that

EYi − π̃i (βλwA ) − π̃i (βn0A )2 }

) 2{EYi − π̃i (βn0A ) + Eπ̃i (βλwA
2 2

2mc{1/4 + EβλwA
− βn0A }
2

2mc{1/4 + O(sn /n)}

= O(1),

by Theorem 3.2 and (A.6). It implies that Yi − π̃i (βλwA

) = Op (1). Consequently,

−1 i (βλwA
R −2 −1
) λmin (R)λmax (Ãi (βλwA
2 )2
))Yi − π̃i (βλwA

cOp (1),
376 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2

by the condition (d). Therefore, by Remark 3.6 and the conditions (a)–(h), we have

n
1/2
|bT
1/2 −1 i (βλwA 1/2
n T11 bn | bT ) − Ãi (βn0A )]R ) Ãi (βλwA
)XiA bn
T 2
n XiA [Ãi (βλwA
i=1
n
c XiA bn 2 βλwA
− βn0A max Xij
j
i=1

n
βλwA
− βn0A sup Xij λmax
T
XiA XiA bn 2
i,j i=1
√
Op ( sn /n)Op ( pn )Op (n)
√
Op (pn n) = op (n).

Thus supbn =1 |bT n T11 bn | = op (n). Similarly, we can also prove that supbn =1 |bn T12 bn | = op (n) and
T

supbn =1 |bn T13 bn | = op (n). Therefore,

sup |bT )−K n (βn0A )]bn | = op (n).

n [Kn (βλwA
bn =1

Similarly, supbn =1 |bT

n [Kn (βn0A ) − Kn (βn0A )]bn | = op (n). Thus,

sup |bT ) − Kn (βn0A )]bn | = op (n).

n [Kn (βλwA
bn =1

Note that

n
λmin (Jn (βλwA −1 ) min(Ãij (βλwA
)) λmin (R ))λmin
T
XiA XiA Op (n),
i,j
i=1

−1
by the conditions (b) and (d). Consequently, we have supbn =1 |bT
n T1 bn | = op (n ). Similarly, note that

Jn−1 (βλwA ¯−1 −1

) − Jn (βn0A ) = [Jn (βλwA
¯−1 )] + [J¯−1 (βλwA
) − Jn (βλwA n
¯−1
) − Jn (βn0A )],

and

Jn−1 (βλwA ¯−1 ) = J¯n−1 (βλwA

) − Jn (βλwA
¯ ) − Jn (βλwA
)[Jn (βλwA
−1
)]Jn (βλwA
),

J¯n−1 (βλwA ¯−1 ¯−1 ¯ ¯ )]J¯−1 (βλwA

) − Jn (βn0A ) = Jn (βn0A )[Jn (βn0A ) − Jn (βλwA n ),

−1
and we can also prove that supbn =1 |bT n Ti bn | = op (n ), i = 2, 3. This completes the proof.
n n
Proof of Theorem 3.9. Let H̄n (βn ) = i=1 Xi Ai (βn )R̄−1 Ai (βn )Xi and D̄n (βn ) = i=1 XiT Ai (βn )
T 1/2 1/2 1/2

R̄−1 R0 R̄−1 Ai (βn )Xi . Let Σn be the asymptotic covariance matrix of βnG . Then, according to [16,
1/2

Corollary 3.9], Σn = H̄n−1 (βn0 )D̄n (βn0 )H̄n−1 (βn0 ). Similar to (A.1) and (A.7), we have
−1 −1
n−1 bT −1
n Σn b n n λmin (D̄n (βn0 ))bT 2
n H̄n (βn0 )bn
n−1 Op (n−1 )λ2max (H̄n (βn0 ))bn 2
= Op (n−2 )Op (n2 )
= Op (1),

for any bn ∈ Rpn satisfying bn = 1. It follows that

λmin (n−1 Σ−1

n ) = Op (1) and λmax (n
−1 −1
Σn ) = Op (1). (A.9)

And by [16, Theorem 3.10], we know that Σ n is the consistent estimator of Σn which implies the con-
sistency of the sample eigenvalues. Further, let βS = arg min{βn ∈Rpn } (Z − A(βnG )Xβn )T V −1 (βnG )(Z −
A(βnG )Xβn ) for any candidate model S.
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 377

For any Sλ = A, we consider two different cases according to whether the model is underfitted or
overfitted. In either case, we use the same techniques of [17] or [9] to prove the conclusion of the theorem.
Case 1 (Underfitted model). Note that λn satisfies the conditions in Theorem 3.4. Consequently, its
associated PQF value is of the order oP (1). By the fact that βS = βn0 for any S A and the definition
(3.5), we have

inf P QFλ inf n−1 (βλw − βnG )T Σ

−1 − βnG )
n (βλw
λ∈R− λ∈R−

inf n−1 (βS − βnG )T Σ

−1
n (βS − βnG )
SA

−1 ) min βS − βnG 2

λmin (n−1 Σ n
SA
−1
−1
λmin (n−1 Σ n ) min {2 βS − βn0 2 − βnG − βn0 2 }
SA

−1
λmin (n−1 Σ n ) 2−1 min |βn0j |2 − βnG − βn0 2 .
j∈A

Then, with the probability tending to 1, the right-hand side will be of higher order than Cn p2n /n by [16,
Theorem 3.6], (A.9), and the condition (h). Hence P (inf λ∈R− P QFλ > P QFλn ) → 1.
Case 2 (Overﬁtted model). Similar to (A.9), we know that λmax (ΣnA ) = Op (n). Let ej be an sn × 1
vector with the j-th element equal to 1, and 0 otherwise. Then, from Remark 3.6, we know that

−1/2 L
(eT
j ΣnA ej ) − βn0j ) −→ N (0, 1),
(βλn wj (A.10)

for each j ∈ A. It implies that, for every j ∈ A, βλn wj

1/2
− βn0j = Op (λmax (ΣnA )) = Op (n
1/2
). Therefore,

βλn w − βn0 = δλn w − βn0A = Op ( sn /n). For any λ ∈ R+ , we have dfλ − sn 1. Then
−1 T −1 Σ −1 − βnG ) + Cn
np−1
n (P QFλ − P QFλn ) −npn (βλn w
− βnG ) n n (βλn w

−np−1
n λmax (n Σn )βλn w − βnG 2 + Cn .
−1 −1

Note that np−1 − βnG 2 2np−1 (βλ w − βn0 2 + βnG − βn0 2 ). Then, by [16, Theorem 3.6]
n βλn w n n

and (A.10), we know that np−1 − βnG 2 = Op (1). Since Cn → ∞ as n → ∞, the right-hand side
n βλn w
diverges to inﬁnity with the probability tending to 1, i.e., P (inf λ∈R+ P QFλ > P QFλn ) → 1. Hence, the
proof is completed.

Mspqlinference v6 BIOMTRKA Postprintversion
No ratings yet
Mspqlinference v6 BIOMTRKA Postprintversion
23 pages
Bayesian Estimation of Generalized Exponential Distribution
No ratings yet
Bayesian Estimation of Generalized Exponential Distribution
22 pages
Variance-Covariance Matrix On Mixed Estimator of Spline
No ratings yet
Variance-Covariance Matrix On Mixed Estimator of Spline
15 pages
Adaptive Data Analysis in Loopstar Case
No ratings yet
Adaptive Data Analysis in Loopstar Case
6 pages
Shrinkage Estimators for Sparse Vectors
No ratings yet
Shrinkage Estimators for Sparse Vectors
21 pages
Scalable SSANOVA for Big Data Analysis
No ratings yet
Scalable SSANOVA for Big Data Analysis
22 pages
Bayesian Data Selection with SVC
No ratings yet
Bayesian Data Selection with SVC
72 pages
Bayesian Classifiers in MATLAB
No ratings yet
Bayesian Classifiers in MATLAB
21 pages
Conditional Inference for GAMMs
No ratings yet
Conditional Inference for GAMMs
17 pages
Model Selection & Hypothesis Testing Guide
No ratings yet
Model Selection & Hypothesis Testing Guide
6 pages
Bayesian Estimation of Exponential Gamma Distribution
No ratings yet
Bayesian Estimation of Exponential Gamma Distribution
4 pages
Estimating Generalized Inverted Exponential Distribution
No ratings yet
Estimating Generalized Inverted Exponential Distribution
23 pages
Overdispersion Models and Estimation
No ratings yet
Overdispersion Models and Estimation
20 pages
Understanding Bayesian Inference
No ratings yet
Understanding Bayesian Inference
7 pages
Bayesian Parameter Estimation for EE Distribution
No ratings yet
Bayesian Parameter Estimation for EE Distribution
8 pages
Sure Independence Screening for High Dimensions
No ratings yet
Sure Independence Screening for High Dimensions
43 pages
Sim 3315
No ratings yet
Sim 3315
12 pages
Projected PCA in High-Dimensional Factor Models
No ratings yet
Projected PCA in High-Dimensional Factor Models
36 pages
Variable Selection in Multivariate Models
No ratings yet
Variable Selection in Multivariate Models
20 pages
M MBIC
No ratings yet
M MBIC
4 pages
Lower Bounds in Statistical Estimation
No ratings yet
Lower Bounds in Statistical Estimation
45 pages
Adaptive LASSO For General Transformation Models With Right
No ratings yet
Adaptive LASSO For General Transformation Models With Right
15 pages
Schwarz
No ratings yet
Schwarz
5 pages
E-Bayesian vs Hierarchical Estimation
No ratings yet
E-Bayesian vs Hierarchical Estimation
16 pages
Multivariate GLMM for Count Data Analysis
No ratings yet
Multivariate GLMM for Count Data Analysis
22 pages
Huey Dissertation Final
No ratings yet
Huey Dissertation Final
257 pages
Understanding Linear Classification Techniques
No ratings yet
Understanding Linear Classification Techniques
31 pages
Extended Bayesian Criteria for Model Selection
No ratings yet
Extended Bayesian Criteria for Model Selection
27 pages
Conditional Sparse Variable Selection
No ratings yet
Conditional Sparse Variable Selection
33 pages
Parameter Estimation in Pattern Analysis
No ratings yet
Parameter Estimation in Pattern Analysis
15 pages
Understanding Generalized Estimating Equations
No ratings yet
Understanding Generalized Estimating Equations
10 pages
Sparse Regression: Feature Selection Methods
No ratings yet
Sparse Regression: Feature Selection Methods
24 pages
Mixed Models in R: lme4 Package Overview
No ratings yet
Mixed Models in R: lme4 Package Overview
21 pages
08-A Direct Approach For Sparse Quadratic Discriminant Analysis
No ratings yet
08-A Direct Approach For Sparse Quadratic Discriminant Analysis
37 pages
Linear Classification Overview
No ratings yet
Linear Classification Overview
33 pages
Bayesian Estimation of 3-Component Mixture
No ratings yet
Bayesian Estimation of 3-Component Mixture
29 pages
Bayesian Estimation of Common Parameters
No ratings yet
Bayesian Estimation of Common Parameters
12 pages
Modal Regression in High Dimensions
No ratings yet
Modal Regression in High Dimensions
13 pages
Multivariate Outlier Detection Method
No ratings yet
Multivariate Outlier Detection Method
30 pages
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
No ratings yet
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
24 pages
Bayesian Posterior Moments in Regression
No ratings yet
Bayesian Posterior Moments in Regression
25 pages
Bayesian Multinomial Probit Models
No ratings yet
Bayesian Multinomial Probit Models
26 pages
Binary Data Analysis in Clinical Trials
No ratings yet
Binary Data Analysis in Clinical Trials
7 pages
Signal Selector Notes
No ratings yet
Signal Selector Notes
15 pages
Distinguishing Cause From Effect Using Observational Data: Methods and Benchmarks
No ratings yet
Distinguishing Cause From Effect Using Observational Data: Methods and Benchmarks
102 pages
Quantile Regression for Longitudinal Data
No ratings yet
Quantile Regression for Longitudinal Data
16 pages
Inference and Optimal Censoring Schemes For Progres Censored B-S Distr. - (Kundu)
No ratings yet
Inference and Optimal Censoring Schemes For Progres Censored B-S Distr. - (Kundu)
26 pages
Bayesian and Classical Estimations of GE Distribution
No ratings yet
Bayesian and Classical Estimations of GE Distribution
15 pages
Semiparametric Models and GEE Estimators
No ratings yet
Semiparametric Models and GEE Estimators
47 pages
Fan VariableSelectionvia 2001
No ratings yet
Fan VariableSelectionvia 2001
14 pages
Single-Study Causal Inference Methods
No ratings yet
Single-Study Causal Inference Methods
27 pages
Support Vector Machines & Bayesian Learning
No ratings yet
Support Vector Machines & Bayesian Learning
18 pages
Bayesian Inference for Nonparanormal Models
No ratings yet
Bayesian Inference for Nonparanormal Models
40 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Computational Limits in Invariant Causal Prediction
No ratings yet
Computational Limits in Invariant Causal Prediction
70 pages
Madas Advanced Level Probability Practice
No ratings yet
Madas Advanced Level Probability Practice
10 pages
Sugeno Fuzzy Logic Controllers Explained
No ratings yet
Sugeno Fuzzy Logic Controllers Explained
33 pages
KTU Data Compression Techniques Overview
100% (1)
KTU Data Compression Techniques Overview
15 pages
Linear Programming and Numerical Methods
No ratings yet
Linear Programming and Numerical Methods
3 pages
Rationals Review 8 - Practice Test
No ratings yet
Rationals Review 8 - Practice Test
2 pages
Understanding Delta Encoding Techniques
No ratings yet
Understanding Delta Encoding Techniques
41 pages
Mathematics Midterm Exam Guide
No ratings yet
Mathematics Midterm Exam Guide
14 pages
Introduction to State Space Control
No ratings yet
Introduction to State Space Control
117 pages
Data Analytics: Definition and Examples
No ratings yet
Data Analytics: Definition and Examples
3 pages
Principal Component Regression (PCR) : Prism Software
No ratings yet
Principal Component Regression (PCR) : Prism Software
28 pages
Data Mining Homework 2 Instructions
No ratings yet
Data Mining Homework 2 Instructions
7 pages
Menu-Driven Array Operations in C
No ratings yet
Menu-Driven Array Operations in C
20 pages
Bootstrap Method for Variance Estimation
No ratings yet
Bootstrap Method for Variance Estimation
10 pages
How Informative Is The Order Book Beyond The BestLevels Machine Learning Perspective
No ratings yet
How Informative Is The Order Book Beyond The BestLevels Machine Learning Perspective
33 pages
CPM and PERT Project Scheduling Overview
100% (1)
CPM and PERT Project Scheduling Overview
48 pages
Phases of Computer Problem Solving
No ratings yet
Phases of Computer Problem Solving
2 pages
Applications of Differentiation Concepts
No ratings yet
Applications of Differentiation Concepts
151 pages
CS 623 Database Management Systems Agenda
No ratings yet
CS 623 Database Management Systems Agenda
29 pages
Quantum Mechanics II: Perturbation Theory
No ratings yet
Quantum Mechanics II: Perturbation Theory
2 pages
LeetCode Problem Tracker Template
No ratings yet
LeetCode Problem Tracker Template
64 pages
Dissertation of Alexey Lindo (Final Version)
No ratings yet
Dissertation of Alexey Lindo (Final Version)
104 pages
DC Motor Speed Control with ANN
No ratings yet
DC Motor Speed Control with ANN
6 pages
ALDEP: Automated Layout Design Overview
100% (1)
ALDEP: Automated Layout Design Overview
15 pages
Decentralized Control of Uncertain Systems
No ratings yet
Decentralized Control of Uncertain Systems
4 pages
Traveling Salesman Problem Overview
No ratings yet
Traveling Salesman Problem Overview
3 pages
Solution Manual Digital Control and State Variable Methods
75% (16)
Solution Manual Digital Control and State Variable Methods
128 pages
Algorithm Design Assignment 1 Guide
No ratings yet
Algorithm Design Assignment 1 Guide
3 pages
Class 12 Indefinite Integrals Solutions
No ratings yet
Class 12 Indefinite Integrals Solutions
205 pages
Program Ms
No ratings yet
Program Ms
9 pages
CBLSTM for Machine Health Monitoring
No ratings yet
CBLSTM for Machine Health Monitoring
18 pages