Mathematics: Science China
Mathematics: Science China
Mathematics
. ARTICLES . February 2013 Vol. 56 No. 2: 359–377
doi: 10.1007/s11425-012-4564-y
1School
of Finance and Statistics, East China Normal University, Shanghai 200241, China;
2Department of Epidemiology and Biostatistics, Michigan State University, Michigan, MI 48824, USA;
3Department of Mathematics, Hong Kong Baptist University, Kowloon Tong 999077, Hong Kong, China
Received March 2, 2011; accepted March 29, 2012; published online January 10, 2013
Abstract For analyzing correlated binary data with high-dimensional covariates, we, in this paper, propose
a two-stage shrinkage approach. First, we construct a weighted least-squares (WLS) type function using a
special weighting scheme on the non-conservative vector field of the generalized estimating equations (GEE)
model. Second, we define a penalized WLS in the spirit of the adaptive LASSO for simultaneous variable
selection and parameter estimation. The proposed procedure enjoys the oracle properties in high-dimensional
framework where the number of parameters grows to infinity with the number of clusters. Moreover, we prove
the consistency of the sandwich formula of the covariance matrix even when the working correlation matrix is
misspecified. For the selection of tuning parameter, we develop a consistent penalized quadratic form (PQF)
function criterion. The performance of the proposed method is assessed through a comparison with the existing
methods and through an application to a crossover trial in a pain relief study.
Keywords correlated binary data, variable selection, diverging number of parameters, adaptive LASSO,
GEE, oracle properties, sandwich covariance formula, penalized quadratic form function
Citation: Xu P R, Fu W J, Zhu L X. Shrinkage estimation analysis of correlated binary data with a diverging
number of parameters. Sci China Math, 2013, 56: 359–377, doi: 10.1007/s11425-012-4564-y
1 Introduction
High-dimensional correlated data, which consists of repeated measurements on a large number of co-
variates, arise frequently from health and medical studies. For example, in a crossover trial comparing
the effect of placebo with that of the low and high doses of analgesic drug for pain relief (see [4]), each
participating woman was given all three treatments sequentially, one dose at each time with different
ordering of the treatments for different subject. In making the inference on the relationship between the
outcome measure of either no effect of pain relief (0) or some effect of pain relief (1) and the covariates:
the current treatment, the previous treatment, and the current period, the final model will be selected
based on the selection and estimation of the effects while taking into consideration the correlation between
the outcomes from different treatments within the same subject. Various statistical methods of model
parameter estimation have been developed, see [4, 15]. Recent studies of variable selection for analyzing
correlated data received much attention for high-dimensional data structure.
∗ Corresponding author
c Science China Press and Springer-Verlag Berlin Heidelberg 2013 [Link] [Link]
360 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2
For example, QIC was proposed as a modification to Akaike’s information criteria (AIC), using quasi-
likelihood instead of likelihood [13]; a generalized version of Mallow’s Cp to measure model adequacy of
prediction was introduced [2]; BIQIF was introduced as a Bayesian information type of criterion based
on the quadratic inference function [18]. These are best subset type variable selection procedures which
become computationally intensive when the number of covariates is moderately large. A generalization
of the bridge and LASSO penalties was applied to generalized estimating equations (GEE) models [8];
similarly, the SCAD penalty (see [6]) was applied to the GEE model [5] following the spirit in [8]; a variable
selection was studied for linear mixed effects model for correlated data [1]. However, the aforementioned
papers all assumed that the dimension of covariates is fixed. Therefore, it is important to develop new
statistical methodology and theory of variable selection and estimation for high-dimensional clustered
data.
In this paper, we propose a two-stage shrinkage estimation procedure for variable selection in the
“large n, diverging p” setup for sparse generalized linear models, where p = pn is a function of the num-
ber of clusters n. We focus on correlated binary data, which is driven by the crossover trial analyzed in
Section 5. We first construct a weighted least-squares (WLS) type objective function to approximate a
potential function, although a real potential function does not exist in the GEE model in general. We
then define a penalized WLS model for simultaneous variable selection and parameter estimation. It
is noted that the proposed potential approximate function is path-dependent, requiring no conservative
gradient vector field. We show that the proposed approach achieves the desirable oracle properties, un-
der appropriate conditions for dimension, the sample size and the strength of the signal. In addition, it
inherits the built-in robustness of the GEE model with consistent estimation allowing misspecification of
the working correlation matrix. And the sandwich variance formula for nonzero part of true parameters
still remains valid in this new context. We further recommend a penalized quadratic form (PQF) func-
tion criterion that facilitates the oracle properties. Simulation studies suggest that, with the PQF, the
propsoed method is more efficient than the LASSO-based penalized estimating equations presented by
[8]. Also, the sandwich variance formula achieves satisfactory performance.
The rest of the paper is organized as follows. In Section 2, we propose our variable selection method
for correlated binary data. In Section 3, we establish the oracle properties of the proposed method, the
consistency of the sandwich variance formula, and the consistency of the PQF function criterion. Section 4
reports the Monte Carlo simulation study results. Section 5 applies the new method to a three crossover
trial dataset. Section 6 gives a brief discussion. All technical proofs are relegated to the Appendix.
where logit(t) = log(t/(1 − t)), βn is a pn -dimensional unknown parameter vector. Denote πij = E(Yij |
Xij ). Then, the marginal variance of Yij is assumed as Var(Yij | Xij ) = πij (1 − πij ), where a dispersion
parameter may be added in the marginal variance function if overdispersion is suspected to be present.
Without loss of generality, the number of repetitions mi are assumed to be equal across the clusters, for
example, m. Then, let πi (βn ) = (πi1 (βn ), . . . , πim (βn ))T , where πij (βn ) = exp(Xij
T T
βn )/(1 + exp(Xij βn )),
Ai (βn ) be the m × m diagonal matrix with elements πij (βn )(1 − πij (βn )), and R(τ ) be an m × m working
correlation matrix, where τ is a finite dimensional parameters which fully characterizes R(τ ). The GEE
model (see [10]) is defined as
n
1/2 −1 A−1/2 (βn )(Yi − πi (βn )) = 0,
XiT Ai (βn )R (2.2)
i
i=1
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 361
where R is the estimator of R(τ ). A key advantage of the GEE model is that the GEE estimator βnG
achieves consistent estimation for the model parameters even when the working correlation matrix is
misspecified in “large n, diverging p” setup (see [16]).
Motivated by this appealing fact, we consider a WLS-type function for model (2.1). First, for the
simplicity of notation, let Y = (Y1T , . . . , YnT )T , X = (X1T , . . . , XnT )T , π(βn ) = (π1T (βn ), . . . , πnT (βn ))T ,
1/2 1/2 (βn ). Let V (βn ) and A(βn ) be, respectively, the nm × nm block diagonal
and Vi (βn ) = Ai (βn )RA i
matrix with diagonal elements Vi (βn )’s and Ai (βn )’s. Consider the following approximation to the residual
Y − π(βnG ). Let Z = A(βnG )X βnG − (Y − π(βnG )), we have
Z ≈ A(βnG )X βnG .
It implies that the special approximation leads to the GEE estimator as the minimizer. The approximate
allows the construction of the penalty models, such as the LASSO and the SCAD, in the conventional
way by minimizing the “objective loss function” with the added penalty, as we shall see in the following.
However, it has to be recognized that since the WLS does not yield the same vector field as in the GEE
model (2.2), the integral in the vector field is path dependent. We will show that this approximate leads
to an efficient penalty estimator unlike the penalized GEE estimator as defined in [8].
Following [20], we construct a penalized WLS for parameter estimation and achieve model selection
based on the above objective function in a two-stage procedure as follows:
Step 1. Obtain an initial estimator βnG of the estimating equation (2.2), such as the GEE estimator.
Step 2. Define a WLS-type function (2.3) as an objective function by replacing the parameters βn in
V (βn ) and A(βn ) with the estimate βnG .
Step 3. Define the penalized WLS (PWLS) estimator
pn
βλw = arg min WLS(βn ) + λ j |βj |,
w (2.4)
{βn ∈Rpn } j=1
function (2.3), we can consider the variable selection methods by minimizing the PWLS function, taking
the form
pn
P (βn ) = WLS(βn ) + n pλ (|βj |). (2.5)
j=1
Formulation (2.5) includes many popular variable selection methods, including the adaptive LASSO
defined in (2.4), the bridge method (see [7]) using the Lq penalty pλ (|βj |) = λ|β|q , the SCAD penalty
(see [6]), the MCP penalty (see [19]), and the SICA penalty (see [11]).
To study the asymptotic behavior of PWLS estimator when pn → ∞, we first introduce some notations.
Let A = {j : βn0j = 0, j = 1, . . . , pn } be the true sparse model with non-sparsity size sn = |A|, where
βn0 = (β01 , . . . , β0pn )T denotes the true value. Let S = {j1 , . . . , jd } denote an arbitrary candidate model,
which contains the j1 -th, . . . , jd -th (1 d pn ) predictors. For an arbitrary vector b ∈ Rpn , let bS ∈ R|S|
denote the subvector of b associated with the candidate model S. For an arbitrary d × pn matrix Q, let
QS denote the d × |S| submatrix of Q associated with the candidate model S.
In this section, we shall show that with a proper choice of λ, PWLS retains oracle properties when
pn → ∞. The main idea is to find an asymptotic characterization of the solution to the PWLS criterion.
In other words, the first thing we need to do is propose an estimator based on the oracle information of
the true subset of the model we know in advance.
For this reason, we consider the following model:
T T
where δn is an sn -dimensional unknown vector. Let π̃ij (δn ) = exp(XijA δn )/(1 + exp(XijA δn )). Denote
δnG by the solution of the following estimating equations:
n
T
XiA
1/2 −1 Ã−1/2 (δn )(Yi − π̃i (δn )) = 0,
Ãi (δn )R (3.2)
i
i=1
where π̃i (δn ) = (π̃i1 (δn ), . . . , π̃im (δn ))T , Ãi (δn ) is the m × m diagonal matrix with elements π̃ij (δn )(1 −
π̃ij (δn )). It implies immediately that δnG is the GEE estimator of βn0A . Further, let Ṽi (βn ) = Ãi (βn )R
1/2
1/2 T T T
Ãi (βn ) for i = 1, 2, . . . , n, and π̃(δn ) = (π̃1 (δn ), . . . , π̃n (δn )) . Let Ã(δn ) and Ṽ (δn ) be, respectively, the
nm × nm block diagonal matrixs with diagonal elements Ãi (δn )’s and Ṽi (δn )’s, and Z̃ = Ã(δnG )XA δnG −
(Y − π̃(δnG )). Then, similar to (2.4), we define an estimator δλw as δλw = arg min{δn ∈Rsn } Q(δn ), where
Q(δn ) = (Z̃ − Ã(δnG )XA δn )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δn ) + λ j |βj |.
w (3.3)
j∈A
Throughout the paper, we use c to denote a generic positive constant which may vary from line to line.
Lemma 3.1. Suppose that the conditions (a)–(d) in the Appendix hold. If we assume the model (3.1)
and n−1 s2n = o(1), then for any κ > 0, there exist two positive constants b1 and b2 such that
n
Ãi (δn )R̂−1 Ãi (δn )XiA .
T 1/2 1/2
where Jn (δn ) = i=1 XiA
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 363
Theorem 3.2. Under the model (3.1) and the conditions (a)–(e) in the Appendix, we have
2
E(δλw − βn0A 2 ) 2 λ2 sn max{wj } + cnsn /(b21 n2 ),
j∈A
Theorem 3.4. Under the conditions (a)–(g) in the Appendix, the PWLS estimator satisfies the fol-
lowing properties:
(1) consistency in model selection: limn→∞ P ({j : βλwj
= 0} = A) = 1,
T −1/2 ¯
(2) asymptotic normality: αn Kn (βn0A )Jn (βn0A )(βλwA D
− βn0A ) → N (0, 1) for any αn ∈ R
sn
satis-
fying αn = 1.
Remark 3.5. Theorem 3.4 shows that, with probability approaching 1, the PWLS criterion can
identify the right subset model, and the PWLS estimator has the optimal convergence rate, when the
number of parameters diverges. Moveover, the PWLS criterion has the model selection consistency and
the asymptotic normality even when the correlation structure is misspecified.
Remark 3.6. Let ΣnA = J¯n−1 (βn0A )Kn (βn0A )J¯n−1 (βn0A ). Together with Theorem 3.4, an application
of the Cramer-Wold device yields that for any q × sn matrix Qn with q fixed and such that Qn ΣnA QT
n
is invertible, we have
−1/2
Qn (βλwA
D
(Qn ΣnA QTn) − βn0A ) −→ Nq (0, Iq ),
Since we can select variables and estimate parameters simultaneously, the covariance matrix of the esti-
mated parameters can be obtained directly. Naturally, the well-known sandwich covariance formula can
be used as an estimator for the covariance of the estimate βλwA . Let
, the nonvanishing component of βλw
−1/2
i (δn ) = Ãi (δn )(Yi − π̃i (δn )), i = 1, 2, . . . , n. Similar to that proposed by [10] in the “fixed p” setup and
by [16] in the “large n, diverging p” setup, the sandwich covariance matrix estimator of ΣnA is defined as
nA = Jn−1 (βλwA
Σ )Jn−1 (βλwA
)Kn (βλwA ), (3.4)
where
n
Jn (δn ) = T
XiA
1/2 −1 Ã1/2 (δn )XiA ,
Ãi (δn )R i
i=1
n
n (δn ) =
K T
XiA
1/2 −1 i (δn )
Ãi (δn )R T −1 Ã1/2 (δn )XiA .
i (δn )R i
i=1
The following theorem shows the consistency of the sandwich covariance matrix estimator (3.4).
Theorem 3.7. If the conditions (a)–(g) in the Appendix are satisfied, then we have
nA QT
Qn Σ n − Qn ΣnA Qn = op (n
T −1
),
Remark 3.8. In the “large n, diverging p” setup, Theorem 3.7 provides a consistent estimator for
the variance of the nonvanshing part of the PWLS estimator even when the working correlation matrix
is misspecified. Then, based on Remark 3.6 and the above sandwich covariance matrix estimator, an
asymptotic (1 − α)% confidence interval (0 < α < 1) for βn0j (j ∈ A) can be constructed as
βλwj T
± zα/2 (ej ΣnA ej )
1/2
,
where ej is an sn × 1 vector with the j-th element equal to 1, and 0 otherwise, zα/2 is the upper α/2
quantile of the standard normal distribution.
Note that the pleasing performances of the PWLS estimator heavily hinge on a proper selection of the
tuning parameter. To implement the PWLS method in practice, it is desirable to have an automatic
data-driven method for selecting the tuning parameter λ. We now use a selector that is in spirit similar
to the one proposed by [9].
n 1/2 −1 A1/2 (βn )Xi and Dn (βn ) = n X T
For the sake of simplicity, let Hn (βn ) = i=1 XiT Ai (βn )R i i=1 i
1/2
Ai (βn )R −1 εi (βn )εT (β)R
−1 A1/2 (βn )Xi , where εi (βn ) = A−1/2 (βn )(Yi −πi (βn )). Denote Σ n = Hn−1 (βnG )
i i i
Dn (βnG )Hn−1 (βnG ). Based on the above notations, we define a penalized quadratic form (PQF) function
criterion as
P QFλ = n−1 (βλw − βnG )T Σ
−1 (βλw − βnG ) + Cn × pn × dfλ /n,
n (3.5)
where dfλ is the number of nonzero coefficients in βλ , and Cn > 0 is required to grow to infinity with the
sample size.
We now establish the consistency of the proposed PQF function criterion (3.5) when the number of
parameters diverges. Let Sλ = {j : βλwj is the solution of (2.4) given λ. According
= 0}, where βλw
to whether the resulting model Sλ is underfitted, correctly fitted, or overfitted, we partition R+ into the
following three mutually exclusive regions:
R− = {λ ∈ R+ : Sλ A},
R0 = {λ ∈ R+ : Sλ = A},
R+ = {λ ∈ R+ : Sλ ⊃ ST , Sλ = A}.
In this section, we assess the finite sample performance of the proposed PWLS method with PQF function
criterion via a simulation experiment in terms of model selection and estimation accuracy. We compared
our proposed PQF based PWLS method with the LASSO-based penalized estimating equations (PGEE)
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 365
with QGCV (see [8]). We consider the following logistic model. The response variable Yij is binary and
its marginal mean is πij with
T
logit(πij ) = Xij βn0 , j = 1, 2, . . . , 10 and i = 1, 2, . . . , n.
where βn0 is a pn -dimensional vector of parameters with pn = 4n1/3 − 5, with s denoting the largest
integer not greater than s. The covariate vectors Xij are i.i.d. from normal distribution Npn (0pn , Σ)
with Σ whose (i, j)-th element is equal to 0.5|i−j| . For all t with 1 t 10, Yit are of an AR(1)
correlation structure with auto-correlation coefficient ρ. The true coefficient vector is taken to be βn0 =
(0.51dn , 0pn −dn ), where dn = pn /5 . We consider three correlations ρ = 0.3, 0.5, 0.7 and three sample
sizes n = 100, 200, 400. The simulations with 200 replications for every case are conducted using R
codes since we use the correlated random binary data generator by [12]. And three different working
correlation structures are considered: independence working correlation matrix (IND), exchangeable
working correlation matrix (CS), and the first-order auto-correlation working correlation matrix (AR(1)).
The summary of simulation results is depicted in Tables 1–3.
To find the subset which minimizes each criterion, we exhaustively search for all possibilities. Thus,
the corresponding results represent the best subset variable selection under the underlying criterion. We
compare the variable selection procedures in terms of model error and model complexity, summarized
with the simulated average mean square error (MSE) obtained by averaging βn0 − βn0 2 /pn over 200
simulated samples, the average number (per simulation) of truly zero coefficients correctly estimated as
zero, i.e., correct deletions (CD), the average number of truly nonzero coefficients erroneously set to zero,
i.e., erroneous deletions (ED), the percentage of times the correct true model exactly identified (PCM),
and the average model size (MS).
Several observations can be made from Tables 1–3. First, the PQF based PWLS method is robust
against the growth of correlation degree. Second, the PQF based PWLS method is also robust in terms
of misspecification of working correlation matrix, which is reflected from Theorem 3.4. Third, compared
with the LASSO-based PGEE, both of them perform satisfactorily in terms of variable selection, while
Table 1 Simulation results by different procedures assuming IND working correlation matrix
Table 2 Simulation results by different procedures assuming AR(1) working correlation matrix
the PQF based PWLS method has much smaller MSE, especially when CS working correlation matrix is
assumed. It confirms the oracle property of the PWLS method empirically.
We next test the accuracy of the sandwich covariance formula. We consider the above model with
ρ = 0.5, the three sample sizes, and IND working correlation structure. For each case, the standard
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 367
deviations of the estimated coefficients are averaged among 200 replications. This can be regarded as the
true standard error (SD) and compared with 200 estimated standard error from the sandwich covariance
formula, which are summarized by their median (SDm ) and interquartile range divided by 1.349 (SDmad ).
The detailed results are summarized in Table 4. We observed that the sandwich covariance formula
performs satisfactorily. Similar phenomena are also observed with other working correlation matrix,
which are not reported here.
1, period 2(3),
x1 (x2 ) =
0, otherwise,
1, treatment B(C),
x3 (x4 ) =
0, otherwise,
1, the previous assignment is B(C),
x5 (x6 ) =
0, otherwise.
We consider 64 candidate models including all possible subsets of the six covariates. Table 5 reports
the estimated coefficients with p-values by using the GEE method and shows the models that were chosen
by PWLS and the LASSO-based PGEE with QGCV. From Table 5, it can be seen that PWLS selects
the best model that includes the covariates x1 , x2 and x5 , which are significant at the significance level
α = 0.05, whereas the LASSO-based PGEE selects the full model as the best model. We also note that
PWLS selects the same best model using either the IND, AR(1), or CS working correlation structures,
of which the results are not presented here for saving space.
Since the true values of parameters are unknown, one has to rely on out-of-sample testing to compare
the model performance in terms of forecasting. We then conduct 200 cross-validation experiments. For
each experiment, we randomly split the entire data set D = {1, . . . , 86} into two parts, where D0 is the
training data set with 70 subjects and D1 is the testing data set with 16 subjects. Consequently, denote
368 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2
120
Prediction error
100
80
60
40
PWLS PGEE
where Vi is the working covariance matrix estimated together by the entire data set D. For each method,
the prediction errors from 200 random splits are summarized in the boxplot in Figure 1. It can be seen
that PWLS performs better than the LASSO-based PGEE.
6 Extension
The approaches and techniques can be readily extended to general marginal longitudinal generalized
linear model, although the focus of the paper is on correlated binary data. The key step is the existence
of a root-(n/pn)-consistent initial estimator. From [16, Theorem 5.1] we know that the GEE estimator
βnG is root-(n/pn )-consistent for general marginal longitudinal generalized linear model. This is the
reason why we can readily extend the PWLS method to general cases. The generalized PWLS may also
enjoy the oracle property, which is confirmed in the following theorem.
Theorem 6.1. Under the conditions (a)–(g) in the Appendix and [16, Conditions (A5) and (A6)], the
PWLS estimator satisfies the following properties:
(1) consistency in model selection: limn→∞ P ({j : βλwj
= 0} = A) = 1;
T −1/2 ¯
(2) asymptotic normality: α Kn (βn0A )Jn (βn0A )(βλwA L
− βn0A ) → N (0, 1) for any αn ∈ R
sn
satis-
n
fying αn = 1,
−1/2
where Kn (δn ) and J¯n (δn ) have the same expressions as in Subsection 3.1 with Ãij (δn ) = Var(Yij | Xij ),
i = 1, . . . , n, j = 1, . . . , mi .
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 369
In addition, if the primary interest is on subject-specific effects, then the mixed-effects models, espe-
cially generalized linear mixed models, are probably more appropriate. Model selection for the mixed-
effects models deserves further investigation and is ongoing.
Acknowledgements This work was supported by National Natural Science Foundation of China (Grant No.
11201306), the Innovation Program of Shanghai Municipal Education Commission (Grant No. 13YZ065), the
Fundamental Research Project of Shanghai Normal University (Grant No. SK201207), the scholarship under the
State Scholarship Fund by the China Scholarship Council in 2011 and the Research Grant Council of Hong Kong,
Hong Kong, China (Grant No. #HKBU2028/10P).
References
1 Bondell H D, Krishna A, Ghosh S K. Joint variable selection for fixed and random effects in linear mixed-effects models.
Biometrics, 2010, 66: 1069–1077
2 Cantoni E, Flemming J M, Ronchetti E. Variable selection for marginal longitudinal generalized linear models. Bio-
metrics, 2005, 61: 507–514
3 Cantoni E, Filed C, Flemming J M, et al. Longitudinal variable selection by cross-validation in the case of many
covariates. Stat Med, 2005, 26: 919–930
4 Diggle P J, Heagerty P J, Liang K Y, et al. Analysis of Longitudinal Data, 2nd edition. New York: Oxford University
Press, 2002
5 Dziak J J, Li R Z. An overview on variable selection for longitudinal data. In: Quantitative Medical Data Analysis
Using Mathematical Tools and Statistical Techniques Chapter One. Singapore: World Scientific Publishing Co., 2006
6 Fan J Q, Li R Z. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc,
2001, 96: 1348–1359
7 Frank I E, Friedman J H. A statistical view of some chemometrics regression tools (with discussion). Technometrics,
1993, 35: 109–148
8 Fu W J. Penalized estimating equations. Biometrics, 2003, 59: 126–132
9 Leng C L, Li B. Least squares approximation with a diverging number of parameters. Statist Prob Lett, 2010, 80:
254–261
10 Liang K Y, Zeger S L. Longitudinal data analysis using generalised linear models. Biometrics, 1986, 73: 12–22
11 Lv J C, Fan Y Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann
Statist, 2009, 37: 3498–3528
12 Oman S D. Easily simulated multivariate binary distributions with given positive and negative correlations. Comput
Stat Data Anal, 2009, 53: 999–1005
13 Pan W. Akaike’s Information Criterion in Generalized Estimating Equations. Biometrics, 2001, 57: 120–125
14 Pepe M S, Anderson G L. A cautionary note on inference for marginal regression models with longitudinal data and
general correlated response data. Comm Statist Ser B, 1994, 23: 939–951
15 Verbeke G, Molenberghs G. Linear Mixed Models for Longitudinal Data. New York: Springer, 2000
16 Wang H, Leng C. Unified lasso estimation via least square approximation. J Amer Statist Assoc, 2007, 102: 1039–1048
17 Wang L. GEE analysis of clustered binary data with diverging number of covariates. Ann Statist, 2011, 39: 389–417
18 Wang L, Qu A. Consistent model selection and data-driven smooth tests for longitudinal data in the estimating
equations approach. J R Statist Soc B, 2009, 71: 177–190
19 Zhang C H. Nearly unbiased variable selection under minimax concave penalty. Ann Statist, 2010, 38: 894–942
20 Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc, 2006, 101: 1418–1429
21 Zou H, Zhang H H. On the adaptive elastic-net with a diverging number of parameters. Ann Statist, 2009, 37:
1733–1751
Appendix
In this section, the proof of theorems will be given. We first introduce the following regularity conditions:
√
(a) supi,j Xij = Op ( pn ).
(b) Denote λmin (Q) and λmax (Q) as the minimum and maximum eigenvalues of a matrix Q, respectively.
Then, assume that there exist two positive constants c1 and c2 such that
n
n
−1 −1
c1 λmin n XiT Xi λmax n XiT Xi c2 .
i=1 i=1
370 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2
(c) There exist two positive constants c3 and c4 such that 0 < c3 πij (βn0 ) c4 1, for i = 1, 2, . . . , n,
j = 1, 2, . . . , m.
(d) The eigenvalues of the common true correlation matrix R0 are bounded away from zero and +∞.
Hereafter, for a matrix B, define B = [trace(BB T )]1/2 as its Frobenius norm. The estimated working
correlation matrix R satisfies R −1 − R̄−1 = Op ( pn /n), where R̄ is a positive-definite matrix with
eigenvalues bounded away from zero and +∞. Note R̄ maybe not the true correlation matrix R0 .
(e) limn→∞ log(pn )/ log(n) = v, for some 0 v < 1/3.
√ √
(f) limn→∞ λ/ n = 0, and limn→∞ λn((1−3v)(1+γ)−1)/2 / n = ∞.
√ √ 1/γ
(g) limn→∞ n/(λ pn ) minj∈A |βn0j | = ∞.
(h) limn→∞ n/(pn Cn ) minj∈A |βn0j | = ∞, and limn→∞ Cn p2n /n = 0.
2
Conditions (a)–(e) are general regularity conditions that ensure the GEE estimator of βn0A is a root-
(n/sn )-consistent estimator under model (3.1) (see [16]). Conditions (f) and (g) are similar to [21,
Conditions (A5) and (A6)]. Condition (g) characterizes the strength of the signals, i.e., the nonzero coef-
ficients can vanish but at the rate that can be distinguished by the penalized least squares. Condition (h)
is the same as [9, Condition (v)], which ensures the consistency of the proposed tuning parameter selector.
n
Ãi (δn )R̄−1 Ãi (δn )XiA . Note that
1/2 1/2
Proof of Lemma 3.1. Let J¯n (δn ) = i=1 XiA T
n
n
λmax T
XiA XiA λmax XiT Xi .
i=1 i=1
= Op (n)
Therefore,
√
= Op ( nsn ) + Op (n)
= Op (n).
Note that Jn (δn ) is symmetric. The above conclusion immediately implies that
Proof of Theorem 3.2. According to the definition of Z̃, it is easy to see that
δnG = (XA
T
Ã(δnG )Ṽ −1 (δnG )Ã(δnG )XA )−1 XA
T
Ã(δnG )Ṽ −1 (δnG )Z̃,
δnG = arg min (Z̃ − Ã(δnG )XA δn )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δn ), (A.2)
{δn ∈Rsn }
and
T
XA Ã(δnG )Ṽ −1 (δnG ){Z̃ − Ã(δnG )XA δnG } = 0. (A.3)
By the definition of δλw , we know that
(Z̃ − Ã(δnG )XA δλw )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δλw ) + λ j |δλwj
w |
j∈A
(Z̃ − Ã(δnG )XA δnG )T Ṽ −1 (δnG )(Z̃ − Ã(δnG )XA δnG ) + λ j |δnGj |.
w
j∈A
Ṽ −1/2 (δnG )(Z̃ − Ã(δnG )XA δλw )2 − Ṽ −1/2 (δnG )(Z̃ − Ã(δnG )XA δnG )2
= (δnG − δλw )T XA
T
Ã(δnG )Ṽ −1 (δnG )Ã(δnG )XA (δnG − δλw )
= (δnG − δλw )T Jn (δnG )(δnG − δλw ).
Note that j∈A j (|δnGj | − |δλwj
w |) j∈A j2 δnG − δλw . Therefore, by Lemma 3.1, we conclude
w
that
λ j2 δnG − δλw b1 nδnG − δλw 2 ,
w
j∈A
And by [16, Theorem 3.6], we know that under the model (3.1), we have δnG − βn0A = Op ( sn /n).
Consequently,
under the model (2.1), where we have used Markov inequality to obtain the second inequality and [16,
Theorem 3.6] in the last step. On the other hand, let M = (λ/n)1/(1+γ) . An application of Markov
inequality and [16, Theorem 3.6] yields that
P (| − 2ξj | > λw
j , ςˆ > ς/2)
j∈Ac
P (|βnGj | > M ) + j , ςˆ > ς/2, |βnGj | M )
P (| − 2ξj | > λw
j∈Ac j∈Ac
E |βnGj |2 /M 2 + P (| − 2ξj | > λM −γ , ςˆ > ς/2)
j∈Ac j∈Ac
EβnG − βn0 2 /M 2 + 4M 2γ E ξj2 I(ˆ
ς > ς/2) /λ2
j∈Ac
cpn /(nM 2 ) + 4M 2γ E ξj2 I(ˆ
ς > ς/2) /λ2 .
j∈Ac
Therefore,
P (∃j ∈ Ac , |ξj | > λw
j ) P (|ξj | > λw
j )
j∈Ac
P (|ξj | > λw
j , ςˆ > ς/2) + P (ˆ
ς ς/2)
j∈Ac
4M 2γ cpn 4cpn
E ξj2 I(ˆ
ς > ς/2) + + .
λ2 nM 2 nς 2
j∈Ac
Now, we focus on the expectation term of the above inequality. Since βnG is the solution to (2.2), it is
easy to see that
n
1/2
xT −1 A−1/2 (βnG )(Yi − πi (βnG )) = 0,
ij Ai (βnG )R i
i=1
n 1/2 −1 A1/2 (βn )Xi = (Hn1 (βn ), . . . , Hnpn (βn ))T . Similar to the proof of
Let Hn (βn ) = i=1 XiT Ai (βn )R i
Lemma 3.1, we can prove that there exist two positive constants b3 and b4 such that
Then, we have
T
|ξj |2 = |Hnj (βnG )(βnG − β̃λw )|2
j∈Ac j∈Ac
T
Hnj (βnG )Hnj (βnG )βnG − β̃λw 2
j∈Ac
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 373
pn
T
Hnj (βnG )Hnj (βnG )βnG − β̃λw 2
i=1
Consequently,
−2γ
b 2 pn n2 ς pn pn
P (∃ j ∈ A , |ξj | > λw
c
j ) 16M 2γ 42 2 2 cnpn + λ2 pn +c 2
+ 4c 2
λ b1 n 2 nM nς
T1 + T2 + T3 .
√ −γ
By (A.4), we have ∀ j ∈ A, |δnGj − δλwj | < λ sn ςˆ /(b1 n). Invoking triangle inequality yields that
√
∀j ∈ A, |δλwj −γ
| > |δnGj | − λ sn ςˆ /(b1 n). Therefore, we end up with
√ −γ
min |δλwj
| > min |δnGj | − λ sn ςˆ /(b1 n).
j∈A j∈A
√ √ √ √
Note that λ sn ςˆ−γ /(b1 n) = O(1/ n)O(λ sn ς −γ / n)O((ˆ
ς /ς)−γ ) and
ς − ς)2 /ς 2
ς /ς)2 ) = 2 + 2E(ˆ
E((ˆ
2 + 2EβnG − βn0 2 /ς 2
374 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2
2 + 2cpn /(nς 2 )
by [16, Theorem 3.6]. Moreover, by the proof of T3 in Theorem 3.3, we know that pn /(nς 2 ) = op (1),
ς /ς)2 ) = Op (1). Thus, by the condition (g), we have
which implies immediately that E((ˆ
√ √
λ sn ςˆ−γ /(b1 n) = op (1/ n)Op (1). (A.6)
Note that δnG − βn0A > |βn0j | − |δnGj |, ∀ j ∈ A. Then, minj∈A |δnGj | > minj∈A |βn0j | − δnG − βn0A .
Since δnG − βn0A = Op ( sn /n) by [16, Theorem 3.6], we conclude that
√
min |δλwj
| > min |βn0j | − δnG − βn0A − op (1/ n)Op (1)
j∈A j∈A
√ √ √
= ς − ( sn / n)Op (1) − op (1/ n)Op (1),
n
−1
R0 R̄−1 Ãi (δn )XiA bn
1/2 1/2
bT
n Kn (δn )bn = bT T
n XiA Ãi (δn )R̄
i=1
n
λmax (R0 )λ−2
max (R̄) bT T
n XiA Ãi (δn )XiA bn
i=1
n
c λmax (Ãi (δn ))bT T
n XiA XiA bn
i=1
n
cλmax T
XiA XiA bn 2
i=1
cnc2
by the conditions (b) and (d). The above conclusion immediately implies that
(αT −1/2
n Kn (βn0A )J¯(βn0A )(δλw − δnG ))2
¯ n0A )(δλw − δnG )2
Kn−1/2 (βn0A )J(β
λ−1 2 ¯ − δnG 2
min (Kn (βn0A ))λmax (J(βn0A ))δλw
√
Op (n)Op ([λ sn ςˆ−γ /(b1 n)]2 )
= Op (n)op (1/n)
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 375
= op (1).
Therefore, αT
−1/2 ¯ n0A )(δλw − δnG ) = op (1). This completes the proof.
n Kn (βn0A )J(β
Proof of Theorem 3.7. It is sufficient to show that
sup |bT −1
n (ΣnA − ΣnA )bn | = op (n ),
bn =1
nA − ΣnA = J −1 (βλwA
Σ ) − Kn (βn0A )]J −1 (βλwA
)[Kn (βλwA )
n n
T1 + T2 + T3 ,
−1
and it is sufficient to show that supbn =1 |bT
n Ti bn | = op (n ), i = 1, 2, 3. Note that
n (βλwA
max(|λmax (K ) − Kn (βn0A ))|)
) − Kn (βn0A ))|, |λmin (Kn (βλwA
sup |bT
n T 1 bn | .
bn =1 λ2 (Jn (βλwA
min ))
n (βλwA
Then we evaluate the eigenvalues of K ) − Kn (βn0A ) first. Let Si (βn ) =
T
i (βn ) i (βn ) for i =
1, 2, . . . , n. Note that
|bT )−K
n [Kn (βλwA
n (βn0A )]bn | |bT
n (T11 + T12 + T13 )bn |,
where
n
[Ãi (βλwA −1 Si (βλwA −1 Ã1/2 (βλwA
1/2 1/2
) − Ãi (βn0A )]R
T
T11 = XiA )R i )XiA
i=1
n
T12 = T
XiA
1/2 −1 [Si (βλwA
Ãi (βn0A )R −1 Ã1/2 (βn0A )XiA
) − Si (βn0A )]R i
i=1
n
T13 = T
XiA
1/2 −1 Si (βλwA
Ãi (βn0A )R −1 [Ã (βλwA
)R
1/2
) − Ãi (βn0A )]XiA .
1/2
i
i=1
cXiA bn βλwA
− βn0A max Xij .
j
Note that
2mc{1/4 + EβλwA
− βn0A }
2
−1 i (βλwA
R −2 −1
) λmin (R)λmax (Ãi (βλwA
2 )2
))Yi − π̃i (βλwA
cOp (1),
376 Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2
by the condition (d). Therefore, by Remark 3.6 and the conditions (a)–(h), we have
n
1/2
|bT
1/2 −1 i (βλwA 1/2
n T11 bn | bT ) − Ãi (βn0A )]R ) Ãi (βλwA
)XiA bn
T 2
n XiA [Ãi (βλwA
i=1
n
c XiA bn 2 βλwA
− βn0A max Xij
j
i=1
n
βλwA
− βn0A sup Xij λmax
T
XiA XiA bn 2
i,j i=1
√
Op ( sn /n)Op ( pn )Op (n)
√
Op (pn n) = op (n).
Thus supbn =1 |bT n T11 bn | = op (n). Similarly, we can also prove that supbn =1 |bn T12 bn | = op (n) and
T
Note that
n
λmin (Jn (βλwA −1 ) min(Ãij (βλwA
)) λmin (R ))λmin
T
XiA XiA Op (n),
i,j
i=1
−1
by the conditions (b) and (d). Consequently, we have supbn =1 |bT
n T1 bn | = op (n ). Similarly, note that
and
−1
and we can also prove that supbn =1 |bT n Ti bn | = op (n ), i = 2, 3. This completes the proof.
n n
Proof of Theorem 3.9. Let H̄n (βn ) = i=1 Xi Ai (βn )R̄−1 Ai (βn )Xi and D̄n (βn ) = i=1 XiT Ai (βn )
T 1/2 1/2 1/2
R̄−1 R0 R̄−1 Ai (βn )Xi . Let Σn be the asymptotic covariance matrix of βnG . Then, according to [16,
1/2
Corollary 3.9], Σn = H̄n−1 (βn0 )D̄n (βn0 )H̄n−1 (βn0 ). Similar to (A.1) and (A.7), we have
−1 −1
n−1 bT −1
n Σn b n n λmin (D̄n (βn0 ))bT 2
n H̄n (βn0 )bn
n−1 Op (n−1 )λ2max (H̄n (βn0 ))bn 2
= Op (n−2 )Op (n2 )
= Op (1),
And by [16, Theorem 3.10], we know that Σ n is the consistent estimator of Σn which implies the con-
sistency of the sample eigenvalues. Further, let βS = arg min{βn ∈Rpn } (Z − A(βnG )Xβn )T V −1 (βnG )(Z −
A(βnG )Xβn ) for any candidate model S.
Xu P R et al. Sci China Math February 2013 Vol. 56 No. 2 377
For any Sλ = A, we consider two different cases according to whether the model is underfitted or
overfitted. In either case, we use the same techniques of [17] or [9] to prove the conclusion of the theorem.
Case 1 (Underfitted model). Note that λn satisfies the conditions in Theorem 3.4. Consequently, its
associated PQF value is of the order oP (1). By the fact that βS = βn0 for any S A and the definition
(3.5), we have
Then, with the probability tending to 1, the right-hand side will be of higher order than Cn p2n /n by [16,
Theorem 3.6], (A.9), and the condition (h). Hence P (inf λ∈R− P QFλ > P QFλn ) → 1.
Case 2 (Overfitted model). Similar to (A.9), we know that λmax (ΣnA ) = Op (n). Let ej be an sn × 1
vector with the j-th element equal to 1, and 0 otherwise. Then, from Remark 3.6, we know that
−1/2 L
(eT
j ΣnA ej ) − βn0j ) −→ N (0, 1),
(βλn wj (A.10)
−np−1
n λmax (n Σn )βλn w − βnG 2 + Cn .
−1 −1
Note that np−1 − βnG 2 2np−1 (βλ w − βn0 2 + βnG − βn0 2 ). Then, by [16, Theorem 3.6]
n βλn w n n
and (A.10), we know that np−1 − βnG 2 = Op (1). Since Cn → ∞ as n → ∞, the right-hand side
n βλn w
diverges to infinity with the probability tending to 1, i.e., P (inf λ∈R+ P QFλ > P QFλn ) → 1. Hence, the
proof is completed.