0% found this document useful (0 votes)
8 views117 pages

Understanding Regression Discontinuity Design

Uploaded by

DavidLeOn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views117 pages

Understanding Regression Discontinuity Design

Uploaded by

DavidLeOn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Regression Discontinuty

Juan Ponce J.
Maestría en Economía del Desarrollo
Flacso-Ecuador
Introduction to RD
The logic of RD

• Social programs often use an index to decide who is eligible to enroll in the program and
who is not.
• In this cases, program authorities typically determine a threshold or cutoff score, below
which households are deemed poor and are eligible for the program.
• Examples. Selben and RS in Ecuador. SISBEN in Colombia. CASEN in Chile.
• Other types of programs select beneficiaries using age. Only people above a certain age
are eligible to receive the program. Other examples use test scores.
RD conditions

• For programs that have a continuous eligibility index with a clearly defined cutoff score to
determine who is eligible and who is not, we can use a regression discontinuity design.
Two main conditions are needed:
• A continuous eligibility index. A continuous measure on which the population of
interest can be ranked, such as a poverty index, a test score, age, etc.
• A clearly defined cutoff score, that is, a point on the index above or below which the
population is classified as eligible for the program.
Identification

• Units above and below the cutoff (in a vicinity) are likely to be very similar, except that
one group receive the program and the other do not.
• The RD strategy exploits the discontinuity around the cutoff score to estimate the
counterfactual.
• As we move further away from the eligibility cutoff, eligible and ineligible units will
become more different by construction, but we have a measure of how different they are
based on the eligibility criteria and therefore we can control for those differences.
LATE

• The regression discontinuity measures the difference in post intervention outcomes


between the units near the eligibility cutoff.
• We can also combine RD with a diff in diff.
• The RD method allow us to successfully estimate the impact of a program without
excluding any eligible population.
• Note that the estimated impact is only valid in the neighborhood around the eligibility
cutoff score. LATE.
Limitations of RD

• RD estimates local average around the eligibility cutoff at the point where treatment and
comparison units are most similar.
• As we get closer to the cutoff, the units that are to above (the left) and below (the right)
of it will look more similar. As good as an experiment.
• RD is useful to evaluate whether the program should be expanded or reduced at the
margin.
• Relatively large evaluation samples are required to obtain sufficient statistical power
when applying RD.
• Tradeoff between bias and efficiency.
Frameworks for RD analysis

• We have two types of RD:


• 1. Sharp design. Full compliance of the assignment rule.
• 2. Fuzzy design. Partial compliance of the assignment rule.

• For each type of RD, we have two main frameworks for RD analysis.
• 1. Based on continuity assumptions. Comparability is conceptualized as continuity of
average potential outcomes near the cutoff. [Link]
• 2. Based on local randomization assumptions. Comparability is conceptualized as
conditions that mimic a randomized experiment in a neighborhood around the cutoff.
Continuity-based framework for RD
analysis
The RD design

• In the RD design, all units in the study receive a score (also known as running variable,
forcing variable, or index ), and a treatment is assigned to those units whose score is
above a known cutoff and not assigned to those units whose score is below the cutoff.
• In the RD design, unlike in other nonexperimental studies, the assignment of the
treatment follows a rule that is known (at least to the researcher) and hence empirically
verifiable.
• We call Sharp RD design any RD design where the treatment condition assigned is
identical to the treatment condition actually received for all units. Any RD design where
compliance with treatment assignment is imperfect is referred to as Fuzzy RD design.
Sharp design

• To formalize, we assume that there are n units, indexed by i = 1, 2, ….., n, each unit has a
score or running variable Xi, and c is a known cutoff. Units with Xi ≥ c are assigned to the
treatment condition, and units with Xi < c are assigned to the control condition.
• This treatment assignment, denoted Ti, is defined as Ti = 1(Xi ≥ c), where 𝕝 . is the
indicator function, and it implies that the probability of treatment assignment as a
function of the score changes discontinuously at the cutoff.
• A defining feature of all RD designs is that the conditional probability of actually receiving
treatment given the score changes discontinuously at the cutoff.
• In a Sharp RD design, this probability changes exactly from zero to one at the cutoff. Since
in the Sharp RD design treatment assigned and treatment received are identical.
Sharp design
Potential outcomes

• Assume that each unit has two potential outcomes, Yi(1) and Yi(0), corresponding, respectively, to
the outcomes that would be observed under the treatment or control conditions.

• Treatment effects are defined in terms of comparisons between features of (the distribution of)
both potential outcomes, such as their means, variances or quantiles.

• If unit i receives the treatment, we will observe Yi(1), the unit's outcome under treatment, and Yi(0)
will remain latent or unobserved.

• If i receives the control condition, we will observe Yi(0) but not Yi(1).

• This results in the so-called fundamental problem of causal inference, and implies that the
treatment effect at the individual level is fundamentally unknowable.
Observed outcomes

• The observed outcome is:


𝑌𝑖 0 𝑖𝑓 𝑋𝑖 < 𝑐
• 𝑌𝑖 = 1 − 𝑇𝑖 ∙ 𝑌𝑖 0 + 𝑇𝑖 ∙ 𝑌𝑖 1 =
𝑌𝑖 1 𝑖𝑓 𝑋𝑖 ≥ 𝑐

• In the specific context of the Sharp RD design, the fundamental problem of causal
inference occurs because we only observe the outcome under control, Yi(0), for those
units whose score is below the cutoff, and we only observe the outcome under
treatment, Yi(1), for those units whose score is above the cutoff.
• The next figure plots the average potential outcomes given the score, E[Yi(1)|Xi = x] and
E[Yi(0)|Xi = x], against the score. In statistics, conditional expectation functions such as
these are usually called regression functions.
Observed outcomes
Observed outcomes

• The regression function E[Yi(1)|Xi] is observed for values of the score to the right of the
cutoff -because when Xi ≥ c, the observed outcome Yi is equal to the potential outcome
under treatment, Yi(1), for every i. This is represented with the solid red line. However, to
the left of the cutoff, all units are untreated, and therefore E[Yi(1)|Xi] is not observed
(represented by a dashed red line).
• A similar phenomenon occurs for E[Yi(0)|Xi], which is observed for values of the score to
the left of the cutoff (solid blue line), Xi < c, but unobserved for Xi ≥ c (dashed blue line).
• Thus, the observed average outcome given the score is
𝐸 𝑦𝑖 0 𝑋𝑖 𝑖𝑓 𝑋𝑖 < 𝑐
• 𝐸 𝑌𝑖 𝑋𝑖 =
𝐸 𝑦𝑖 1 𝑋𝑖 𝑖𝑓 𝑋𝑖 ≥ 𝑐
Lack of common support and extrapolation

• The Sharp RD design exhibits an extreme case of lack of common support, as units in the
control and treatment groups cannot have the same value of the running variable (Xi).
• RD analysis fundamentally relies on extrapolation towards the cutoff point.
• A central goal of empirical RD analysis is to adequately perform (local) extrapolation in
order to compare control and treatment units.
• The average treatment effect at a given value of the score, E[Yi(1)|Xi =x] - E[Yi(0)|Xi = x],
is the vertical distance between the two regression curves at that value.
The sharp RD impact

• This distance cannot be directly estimated because we never observe both curves for the
same value of x.
• However, a special situation occurs at the cutoff c: this is the only point at which we
“almost” observe both curves (with score c-𝜀 for a small and positive 𝜀).
• In the figure, the vertical distance at c is
• 𝐸[𝑌𝑖(1)|𝑋𝑖 = 𝑐] − 𝐸[𝑌𝑖(0)|𝑋𝑖 = 𝑐] ≡ 𝜇+ − 𝜇− .
• This is precisely the treatment effect that can be estimated with a Sharp RD design.
The Sharp RD impact

• The Sharp RD treatment effect is thus formally defined as


• 𝜏𝑆𝑅𝐷 = 𝐸 𝑌𝑖 1 − 𝑌𝑖 0 𝑋𝑖 = 𝑐]
• This parameter captures the (reduced form) treatment effect for units with score values
Xi = c.
• It answers the following question: what would be the average outcome change for units
with score level Xi = c if we switched their status from control to treated?.
• This treatment effect is, by construction, local in nature.
• Since the definition of a Sharp RD design implies that all units with Xi = c are treated, 𝜏𝑆𝑅𝐷
can be interpreted as a (local, RD) average treatment effect on the treated.
The fundamental assumption

• The assumption of comparability between units with very similar values of the score but on
opposite sides of the cutoff is the fundamental concept on which all RD designs are based.

• If the regression functions E[Yi(1)|Xi = x] and E[Yi(0)|Xi = x], seen as functions of x, are
continuous at x = c, then in a Sharp RD design we have

• 𝐸 𝑌𝑖 1 − 𝑌𝑖 0 𝑋𝑖 = 𝑐 = lim 𝐸 𝑌𝑖 𝑋𝑖 = 𝑐 − lim 𝐸 𝑌𝑖 𝑋𝑖 = 𝑐 (1)


𝑥↓𝑐 𝑥↑𝑐

• The result in Equation (1) says that, if the average potential outcomes are continuous
functions of the score at c, the difference between the limits of the treated and control
average observed outcomes as the score converges to the cutoff is equal to the average
treatment effect at the cutoff.
Continuity

• Continuity means that as the score x gets closer and closer to the cutoff c, the average
potential outcome function E[Yi(0)|Xi = x] gets closer and closer to its value at the cutoff,
E[Yi(0)|Xi = c] (and analogously for E[Yi(1)|Xi = x]).
• Thus, continuity gives a formal justification for estimating the Sharp RD effect by focusing
on observations above and below the cutoff in a very small neighborhood around it.
• Continuity offers one justification for using observations just below the cutoff to
approximate the average outcome that units just above the cutoff would have had if they
had received the control condition instead of the treatment.
Local nature of RD effect

• The Sharp RD parameter presented can be interpreted as causal in the sense that it
captures the average difference in potential outcomes under treatment versus control.
• This average difference is calculated at a single point on the support of a continuous
random variable (the score Xi), and as a result captures a causal effect that is local in
nature.
• The RD treatment effect tends to have limited external validity, that is, the RD effect is
generally not representative of the treatment effects that would occur for units with
scores away from the cutoff.
RD Plots
RD Plots

• The typical RD plot presents two summaries:

• (i) a global polynomial fit, represented by a solid line, and

• (ii) local sample means, represented by dots.

• The global polynomial fit is simply a smooth approximation to the unknown regression
functions based on a fourth- or fifth-order polynomial regression fit of the outcome on the
score, fitted separately above and below the cutoff, and using the original raw data.

• The local sample means are created by first choosing disjoint (i.e., non-overlapping) intervals
or “bins” of the score, calculating the mean of the outcome for the observations falling within
each bin, and then plotting the average outcome in each bin against the mid point of the bin.
Example
Choosing the bins

• There are two different types of bins that can be used in the construction of RD plots: bins that
have equal length, or bins that contain (roughly) the same number of observations but whose
length may differ.

• We refer to these two types as evenly-spaced (ES) and quantile-spaced (QS) bins, respectively.

• In an RD plot with ES bins, each of the local means represented by a dot may be computed using a
different number of observations and thus may be more or less precisely calculated than the other
local means in the plot, affecting comparability.

• In contrast, QS bins contain approximately the same number of observations by construction.

• Moreover, a quantile-spaced RD plot has the advantage of providing a quick visual representation
of the density of observations over the support of the running variable.
Number of bins

• There are two methods to define the number of bins:

• 1. Integrated mean square error (IMSE) method. Minimize an asymptotic approximation to the
integrated mean-squared error (IMSE) of the local means estimator, that is, the sum of the
expansions of the (integrated) variance and squared bias.

• If we choose a large number of bins, we have a small bias because the bins are smaller and the local
constant fit is better; but this reduction in bias comes at a cost, as increasing the number of bins
leads to fewer observations per bin and thus more variability within bin. The IMSE-optimal J- and J+
are the numbers of bins that balance squared-bias and variance so that the IMSE is (approximately)
minimized.

• However, the IMSE-optimal method often results in a very smooth plot where the local means
nearly overlap with the global polynomial fit, and may not be appropriate to capture the local
variability of the data near the cutoff.
Numbes of bins

• 2. Mimicking Variance Method (MV). The second method to select the number of bins
chooses the vales of J- and J+ so that the binned means have an asymptotic (integrated)
variability that is approximately equal to the variability of the raw data. In other words,
the number of bins is chosen so that the overall variability of the binned means “mimics”
the overall variability in the raw scatter plot of the data.
• This method produces a much higher number of bins than we obtained with the IMSE
criterion for both ES and QS bins.
• It's recommended to start with MV bins to better illustrate the variability of the outcome
as a function of the score, ideally comparing ES to QS bins to highlight the distributional
features of the score. Then, if needed, the researcher can select the number of bins to be
IMSE-optimal in order to explore the global features of the regression function.
Estimation
Estimation

• Two related but distinct goals: point estimation of RD treatment effect (i.e., give a scalar estimate of
the vertical distance between the regression functions at the cutoff), and statistical inference about
the RD treatment effect (i.e., construct valid statistical hypothesis tests and confidence intervals).
• Estimation typically proceeds by using (local to the cutoff) polynomial methods to approximate the
regression function E[Yi|Xi = x] on each side of the cutoff separately. In practical terms, this involves
using least-squares methods to fit a polynomial of the observed outcome on the score.

• When all the observations are used for estimation, these polynomial fits are global or parametric in
nature. In contrast, when estimation employs only observations with scores near the cutoff, the
polynomial fits are local, “flexible," or “non-parametric." Our upcoming discussion focuses
exclusively on local polynomial methods, which are by now the standard framework for RD
empirical analysis.
Local Polynomial approach

• Because the running variable is assumed continuous, there are no (or sometimes in
practice very few) observations whose score is c or very nearly so.
• Thus, local extrapolation in RD designs is unavoidable in general.
• In other words, in order to form estimates of the average control response at the cutoff,
E[Yi(0)|Xi =c], and of the average treatment response at the cutoff, E[Yi(1)|Xi = c], we
must rely on observations further away from the cutoff.
• In the Sharp RD design, for example, the treatment effect 𝜏𝑆𝑅𝐷 is the vertical distance
between the E[Yi(1)|Xi = x] and E[Yi(0)|Xi = x] at x = c.
• Estimation and inference proceed by first approximating these unknown regression
functions, and then computing the estimated treatment effect and/or the statistical
inference procedure of interest.
Local Polynomial approach

• Applied to the RD point estimation problem, this principle suggests that the unknown regression functions
E[Yi(0)|Xi = x] and E[Yi(1)|Xi = x] can be approximated by a polynomial function of the score.
• Early empirical work employed the idea of polynomial approximation globally, that is, tried to approximate these
functions using flexible higher-order polynomials, usually of fourth or fifth order, over the entire support of the
data.
• It is now widely recognized that a global polynomial approach does not deliver point estimators and inference
procedures with good properties for the RD treatment effect, the main object of interest. The reason is that
global polynomial approximations tend to deliver a good approximation overall, but a poor approximation at
boundary points.
• Modern RD empirical work employs local polynomial methods, which focus on approximating the regression
functions only near the cutoff. Because this approach localizes the polynomial fit to the cutoff (discarding
observations sufficiently far away) and employs a low-order polynomial approximation (usually linear or
quadratic), it is substantially more robust and less sensitive to boundary and overfitting problems.
Local Polynomial point estimation

• Local polynomial methods implement linear regression fits using only observations near the cutoff point,
separately for control and treatment units.

• Specifically, this approach uses only observations that are between c - h and c + h, where h > 0 is a so-
called bandwidth that determines the size of the neighborhood around the cutoff where the empirical RD
analysis is conducted. Within this bandwidth, it is common to adopt a weighting scheme to ensure that
the observations closer to c receive more weight than those further away; the weights are determined by
a kernel function K(.).

• Local polynomial estimation consists of the following basic steps.

• 1. Choose a polynomial order p and a kernel function K(.).

• 2. Choose a bandwidth h.
Local Polynomial point estimation

• 3. For observations above the cutoff (i.e., observations with Xi ≥ c), fit a weighted least
squares regression of the outcome Yi on a constant and (𝑋𝑖 − 𝑐); (𝑋𝑖 − 𝑐)2 , … … ,
𝑋 −𝑐
(𝑋𝑖 − 𝑐)𝑝 , where p is the chosen polynomial order, with weight 𝐾 𝑖 for each

observation.
• The estimated intercept from this local weighted regression, 𝑢ො + , is an estimate of the
point
• 𝑢ො + = 𝐸[𝑌𝑖 (1)|𝑋𝑖 = 𝑐]:

• 𝑢ො + = 𝑌෠𝑖 = 𝑢ො + + 𝑢ො +,1 𝑋𝑖 − 𝑐 + 𝑢ො +,2 𝑋𝑖 − 𝑐 2 + ⋯ … + 𝑢ො +,𝑝 𝑋𝑖 − 𝑐 𝑝


Local Polynomial point estimation
• 4. For observations below the cutoff (i.e., observations with Xi < c), fit a weighted least
squares regression of the outcome Yi on a constant and (𝑋𝑖 − 𝑐); (𝑋𝑖 − 𝑐)2 , … … ,
𝑋 −𝑐
(𝑋𝑖 − 𝑐)𝑝 , where p is the chosen polynomial order, with weight 𝐾 𝑖 for each

observation.
• The estimated intercept from this local weighted regression, 𝑢ො − , is an estimate of the
point
• 𝑢ො − = 𝐸[𝑌𝑖 (0)|𝑋𝑖 = 𝑐]:

• 𝑢ො − = 𝑌෠𝑖 = 𝑢ො − + 𝑢ො −,1 𝑋𝑖 − 𝑐 + 𝑢ො −,2 𝑋𝑖 − 𝑐 2


+ ⋯ … + 𝑢ො −,𝑝 𝑋𝑖 − 𝑐 𝑝

• 5. Calculate the Sharp RD point estimate: 𝜏Ƹ𝑆𝑅𝐷 = 𝑢ො + − 𝑢ො −


Local polynomial of order 1. Figure
Choice of the Kernel function

• The kernel function K(.) assigns non-negative weights to each transformed observation
𝑋𝑖 −𝑐
, based on the distance between the observation's score Xi and the cutoff c.

• The recommended choice is the triangular kernel function, 𝐾(𝑢) = (1 − |𝑢|)𝕝(|𝑢| ≤


1), because when used in conjunction with a bandwidth that optimizes the mean
squared error (MSE), it leads to a point estimator with optimal properties.
• The triangular kernel function assigns zero weight to all observations with score outside
the interval [c – h, c + h], and positive weights to all observations within this interval. The
weight is maximized at Xi = c, and declines symmetrically and linearly as the value of the
score gets farther from the cutoff.
Choice of the Kernel function

• Uniform Kernel 𝐾(𝑢) = 𝕝(|𝑢| ≤ 1), also gives zero weight to observations with score
outside [c-h, c+h], but equal weight to all observations whose scores are within this
interval. Employing a local linear estimation with bandwidth h and uniform kernel is
therefore equivalent to estimating a simple linear regression without weights using only
observations whose distance from the cutoff is at most h.
• A third weighting scheme sometimes encountered in practice is the Epanechnikov kernel,
𝐾(𝑢) = (1 − 𝑢2 )𝕝(|𝑢| ≤ 1), which gives a quadratic decaying weight to observations
with 𝑋𝑖 ∈ [𝑐 − ℎ, 𝑐 + ℎ] and zero weight to the rest.
• In practice, estimation and inference results are typically not very sensitive to the
particular choice of kernel used.
Different Kernel weights
Choosing the polynomial order

• A more consequential decision is the choice of the local polynomial order, which must consider
various factors.

• First, a polynomial of order zero -a constant fit - has undesirable theoretical properties at boundary
points, which is precisely where RD estimation must occur.

• Second, for a given bandwidth, increasing the order of the polynomial generally improves the
accuracy of the approximation but also increases the variability of the treatment effect estimator.

• Third, as mentioned above, higher-order polynomials tend to produce overfitting of the data and
lead to unreliable results near boundary points.

• Combined, these factors have led researchers to prefer the local linear RD estimator, which by now
is the default point estimator in most applications.
Bandwidth selection

• The bandwidth h controls the width of the neighborhood around the cutoff that is used to fit the
local polynomial that approximates the unknown regression functions.

• h directly affects the properties of local polynomial estimation and inference procedures, and
empirical findings are often sensitive to its particular value.

• Choosing a smaller h will reduce the misspecification error (also known as “smoothing Bias”) of the
local polynomial approximation, but will simultaneously tend to increase the variance of the
estimated coefficients because fewer observations will be available for estimation. On the other
hand, a larger h will result in more smoothing bias if the unknown function differs considerably
from the polynomial model used for approximation, but will reduce the variance because the
number of observations in the interval [c – h, c + h] will be larger.

• For this reason, the choice of bandwidth is said to involve a “bias-variance trade-off.”
Bandwidth selection

• Since RD empirical results are often sensitive to the choice of bandwidth, it


is important to select h in a data-driven, automatic way to avoid
specification searching and ad hoc decisions.
• The most popular approach in practice seeks to minimize the MSE of the
local polynomial RD point estimator, 𝜏Ƹ𝑆𝑅𝐷 , given a choice of polynomial
order and kernel function.
• Since the MSE of an estimator is the sum of its squared bias and its
variance, this approach effectively chooses h to optimize a bias-variance
trade-off.
Bandwidth selection

• The general form of the approximate (conditional) MSE for the RD treatment effect is
• 𝑀𝑆𝐸 𝜏Ƹ𝑆𝑅𝐷 = 𝐵𝑖𝑎𝑠 2 𝜏Ƹ𝑆𝑅𝐷 + 𝑉𝑎𝑟 𝜏Ƹ𝑆𝑅𝐷 = ℬ2 + 𝒱
• where the approximate (conditional) bias and variance of the estimator are
1
• ℬ = ℎ2(𝑝+1) 𝐵 and 𝒱 = 𝑉
𝑛ℎ

• The MSE-optimal bandwidth is given by


𝑉 1/(2𝑝+3)
• ℎ𝑀𝑆𝐸 = 𝑛−1/(2𝑝+3)
2 𝑝+1 𝐵2
Bias in local
aproximation
Inference
Inference

• It is incoherent to simultaneously select a bandwidth according to a bias-variance trade-


off and then proceed as if the bias were zero.
• The problem is that the bandwidth choices are developed for point estimation purposes,
and as such they pay no attention to their effects in terms of distributional properties of
typical t-tests or related statistics.
• There are two general approaches that can be used to address this key problem. One
approach is to use the bandwidth hMSE for both estimation and inference, but modify the
usual t-statistic to account for the effects of misspecification due to the large bandwidth,
as well as for the additional sampling error introduced by such modification.
• The other is to use hMSE only for point estimation, and then choose a different
bandwidth for inference purposes.
Using the MSE-Optimal Bandwidth for Inference

• The local polynomial RD point estimator 𝜏Ƹ𝑆𝑅𝐷 has an approximate large-sample distribution
𝜏ො 𝑆𝑅𝐷 −𝜏𝑆𝑅𝐷 − ℬ
• ~𝒩(0,1)
𝒱

• Where ℬ and 𝒱 are, respectively, the asymptotic bias and variance of the RD local polynomial
estimator of order p, discussed previously in the context of MSE expansions and bandwidth
selection.

• This distributional result is similar to those encountered, for example, in standard linear regression
problems -with the important distinction that now the bias term ℬ features explicitly; this term
highlights the trade-off between bandwidth choice and misspecification bias locally to the cutoff.

• The variance term 𝒱 can be calculated as in (weighted) least-squares problems, for instance
accounting for heteroskedasticity and/or clustered data.
Using the MSE-Optimal Bandwidth for Inference

• Given the distributional approximation for the RD local polynomial estimator, an asymptotic
95% confidence interval for 𝜏𝑆𝑅𝐷 is approximately given by

• 𝐶𝐼 = 𝜏Ƹ𝑆𝑅𝐷 − ℬ ± 1.96 ∙ 𝒱
• This confidence interval depends on the unknown bias or misspecification error ℬ, and any
practical procedure that ignores it will lead to incorrect inferences unless this term is
negligible.

• The bias term arises because the local polynomial approach is a non-parametric
approximation: instead of assuming that the underlying regression functions are pth order
polynomials (as would occur in OLS estimation), this approach uses the polynomial to
approximate the unknown regression functions.
Conventional Inference

• This naive approach to statistical inference treats the local polynomial approach as
parametric within the neighborhood around the cutoff and de facto ignores the bias term,
a procedure that leads to invalid inferences in all cases except when the approximation
error is so small that it can be ignored.
• When the bias term is zero, the approximate distribution of the RD estimator is
𝜏ො 𝑆𝑅𝐷 −𝜏𝑆𝑅𝐷
• ~𝒩(0,1)
𝒱

• And the CI is

• 𝐶𝐼𝑈𝑆 = 𝜏Ƹ𝑆𝑅𝐷 ± 1.96 ∙ 𝒱


Standard Bias Correction

• As an alternative to undersmoothing (i.e., to choosing a bandwidth smaller than the MSE-


optimal bandwidth), inference could be based on the MSE-optimal bandwidth so long as
the induced misspecification error is manually estimated and removed from the
distributional approximation.
• This approach, known as bias correction, first estimates the bias term ℬ with the
estimator ℬ෡ (which in fact is already estimated for implementation of MSE-optimal
bandwidth selection), and then constructs confidence intervals that are centered at the
bias-corrected point estimate:

• 𝐶𝐼𝑏𝑐 = ෡ ± 1.96 ∙
𝜏Ƹ𝑆𝑅𝐷 − ℬ 𝒱
Standard Bias Correction

• The bias term depends on the “curvature” of the unknown regression functions captured via their derivative of
order p+1 at the cutoff. These unknown derivatives can be estimated with a local polynomial of order q = p + 1
or higher, which requires another choice of bandwidth, denoted b.

• Therefore, the RD point estimate 𝜏𝑆𝑅𝐷


Ƹ ෡ employs the
employs the bandwidth h, while the bias estimate ℬ
additional bandwidth b.

• The ratio 𝜌 = ℎ/𝑏 is important, as it relates to the variability of the bias correction estimate relative to the RD

point estimator. Standard bias correction methods require 𝜌 = > 0 , that is, a small 𝜌. In particular, note this
𝑏

rules out 𝜌 = = 1, that is, standard bias correction does not allow h = b.
𝑏

• The bias-corrected confidence intervals CIbc allow for a wider range of bandwidths h and, in particular, result in
valid inferences when the MSE-optimal bandwidth is used. However, they typically have poor performance in
applications because the variability introduced in the bias estimation step is not incorporated in the variance
term used.
Robust Bias Correction

• The robust bias correction approach delivers valid inferences even when the MSE-optimal
bandwidth for point estimation is used -no undersmoothing is necessary- and remains valid even

when 𝜌 = 𝑏 = 1 (h = b), which implies that exactly the same data can be used for both point
estimation and inference.
• Robust bias-corrected confidence intervals are based on the bias correction procedure described
above, by which the estimated bias term ℬ෡ is removed from the RD point estimator.

• However, in contrast to CIbc, the derivation allows the estimated bias term to converge in
distribution to a random variable and thus contribute to the distributional approximation of the RD
point estimator.
• This results in a new asymptotic variance 𝒱bc that, unlike the variance 𝒱 used in CIus and CIbc,
incorporates the contribution of the bias correction step to the variability of the bias-corrected
point estimator.
Robust Bias Correction

• Because the new variance 𝒱bc incorporates the extra variability introduced in the bias estimation step, it is
larger than the conventional OLS variance 𝒱 when the same bandwidth is used.

• This approach leads to the robust bias-corrected confidence interval:

• 𝐶𝐼𝑟𝑏𝑐 = ෡ ± 1.96 ∙
𝜏Ƹ𝑆𝑅𝐷 − ℬ 𝒱𝑏𝑐

• which is constructed by subtracting the bias estimate from the local polynomial estimator and using the new
variance formula for Studentization.

• This robust confidence interval results in valid inferences when the MSE-optimal bandwidth is used, because it
has smaller coverage errors and is therefore less sensitive to tuning parameter choices.

• In practice, the confidence interval can be implemented by setting 𝜌 = = 1 (h = b) and choosing h = hMSE, or
𝑏
by selecting both h and b to be MSE-optimal for the corresponding estimators, in which case 𝜌 is set to
hMSE/bMSE or their respective data-driven implementations.
Robust Bias Correction

• The most important feature of the robust bias-corrected confidence


interval CIrbc is that it can be used with the MSE-optimal point estimator
𝜏Ƹ𝑆𝑅𝐷 when this estimator is constructed using the MSE-optimal bandwidth
choice hMSE.
• In other words, using the robust bias-corrected confidence interval allows
researchers to use the same observations with score 𝑋𝑖 ∈ [𝑐 −
ℎ𝑀𝑆𝐸; 𝑐 + ℎ𝑀𝑆𝐸] for both optimal point estimation and valid statistical
inference.
Summary of inference
Validation and falsification
Validation tests

• Five empirical validation tests.


• (i) the null treatment effect on predetermined covariates or placebo
outcomes,
• (ii) the continuity of the score density around the cutoff,
• (iii) the treatment effect at artificial cutoff values,
• (iv) the exclusion of observations near the cutoff, and
• (v) the sensitivity to bandwidth choices.
Predetermined covariates and placebo outcomes

• One of the most important RD falsification tests involves examining whether, near the
cutoff, treated units are similar to control units in terms of observable characteristics.
• Thus, except for their treatment status, units just above and just below the cutoff should
be similar in all variables that could not have been affected by the treatment.
• These variables can be divided into two groups: variables that are determined before the
treatment is assigned -which we call predetermined covariates;
• Variables that are determined after the treatment is assigned but, according to
substantive knowledge about the treatment's causal mechanism, could not possibly have
been affected by the treatment -which we call placebo outcomes.
Predetermined covariates and placebo outcomes

• All predetermined covariates and placebo outcomes should be analyzed in the same way as
the outcome of interest: for each predetermined covariate or placebo outcome, researchers
should first choose an optimal bandwidth, and then use local polynomial techniques within
that bandwidth to estimate the “treatment effect” and employ valid inference procedures
such as the robust bias-corrected methods discussed previously.

• The fundamental idea behind this test is that, since the predetermined covariate (or placebo
outcome) could not have been affected by the treatment, the null hypothesis of no treatment
effect should not be rejected if the RD design is valid.

• The reasoning is that if covariates or placebo outcomes that are known to correlate strongly
with the outcome of interest are discontinuous at the cutoff, the continuity of the potential
outcome functions is unlikely to hold, and thus the validity of the design is called into question.
Density of the running variable

• The second type of falsification test examines whether, in a local neighborhood near the
cutoff, the number of observations below the cutoff is surprisingly different from the
number of observations above it.
• The underlying assumption is that, if units do not have the ability to precisely manipulate
the value of the score that they receive, the number of treated observations just above
the cutoff should be approximately similar to the number of control observations below
it.
• 1. Histogram
• 2. Density test. McCrary test.
McCrary
• Test de McCary: -.0035 (.0027)
• Acepto Ho. No manipulación
Placebo cutoffs

• Another useful falsification analysis examines treatment effects at artificial


or placebo cutoff values.
• This test replaces the true cutoff value by another value at which the
treatment status does not really change, and performs estimation and
inference using this artificial cutoff point.
• The expectation is that no significant treatment effect will occur at placebo
cutoff values.
Sensitivity to observations near the cutoff

• Another falsification approach seeks to investigate how sensitive the results are to the response of
units who are located very close to the cutoff.
• If systematic manipulation of score values has occurred, it is natural to assume that the units closest
to the cutoff are those most likely to have engaged in manipulation.
• The idea behind this approach is to exclude such units and then repeat the estimation and
inference analysis using the remaining sample.
• This idea is sometimes referred to as a “donut hole” approach.
• Even when manipulation of the score is not suspected, this strategy is also useful to assess the
sensitivity of the results to the unavoidable extrapolation involved in local polynomial estimation, as
the few observations closest to the cutoff are likely to be the most influential when fitting the local
polynomials.
Sensitivity to observations near the cutoff

• Note that, although the total number of observations will always decrease
when observations closest to the cutoff are excluded, the effective number
of observations used in the analysis may increase or decrease, depending
on how the bandwidth changes.
• In practice, it is natural to repeat this exercise a few times to assess the
actual sensitivity for different amounts of excluded units.
Sensitivity to Bandwidth Choice

• In contrast to the donut hole approach, which investigates sensitivity as


units from the center of the neighborhood around the cutoff are removed,
the method we discuss now investigates sensitivity as units are added or
removed at the end points of the neighborhood.
• It requires employing local polynomial methods with different bandwidth
choices.
• It is well understood how the bandwidth will affect the results: as the
bandwidth increases, the bias of the local polynomial estimator increases
and its variance decreases.
Local randomization approach
Introduction

• Under the local randomization framework for RD designs, the score values are assumed to be as-if
randomly assigned in a small window around the cutoff, so that placement above or below the
cutoff and hence treatment assignment can be interpreted to be as-if experimental.

• This contrasts with the continuity based approach, where extrapolation to the cutoff plays a
predominant role. Once the local randomization assumption is invoked, the analysis can proceed by
using tools from the analysis of experiments.

• This alternative framework can be used as a complement and robustness check to the continuity
based analysis when the running variable is continuous (under appropriate assumptions), and is a
natural framework for analysis when the running variable is discrete and has few mass points.

• [Link]
Introduction

• The local randomization approach to RD analysis formalizes the idea that the RD design
behaves like a randomized experiment near the cutoff by imposing explicit
randomization-type assumptions that are stronger than the continuity-based conditions.
• When the running variable is continuous, the local randomization approach typically
requires stronger assumptions than the continuity-based approach; in these cases, it is
natural to use the continuity-based approach for the main RD analysis, and to use the
local randomization approach as a robustness check.
• But in settings where the running variable is discrete or other departures from the
canonical RD framework occur, the local randomization approach no longer imposes the
strongest assumptions and can be a natural and useful method for estimation and
inference in practice.
Introduction

• When the RD design is based on a local randomization assumption, instead


of assuming that the unknown regression functions 𝜇1 𝑥 = 𝐸[𝑌𝑖 (1)|𝑋𝑖 =
𝑥] and 𝜇0 𝑥 = 𝐸[𝑌𝑖 (0)|𝑋𝑖 = 𝑥] are continuous at the cutoff, the
researcher assumes that there is a small window around the cutoff,
defined as 𝑾 = [𝒄 − 𝒘; 𝒄 + 𝒘], such that for all units whose scores fall
in that window their placement above or below the cutoff is assigned as
it would have been assigned in a randomized experiment (an assumption
that is sometimes called as if random assignment).
• There are important differences between the RD design and an actual
randomized experiment.
RD versus experiment

• The crucial feature of a randomized experiment recast as an RD design is that the


running variable, by virtue of being a randomly generated number, is unrelated to
the average potential outcomes.
• This is the reason why, in Figure 2.1(a), 𝜇1 𝑥 = 𝐸[𝑌𝑖 (1)|𝑋𝑖 = 𝑥] and 𝜇0 𝑥 =
𝐸[𝑌𝑖 (0)|𝑋𝑖 = 𝑥] are constant for all values of x.
• In contrast, in the standard continuity-based RD design there is no requirement
that the potential outcomes be unrelated to the running variable over its support.
Figure 2.1(b) illustrates a continuity-based RD design where the average
treatment effect at the cutoff, 𝜏𝑆𝑅𝐷 , is the same as in the experimental setting in
Figure 2.1(a) but where the average potential outcomes are non-constant
functions of the score.
RD versus experiment. Figure 1
RD versus experiment

• In a continuity-based approach, the RD treatment effect in 1(b) can be estimated by calculating the
limit of the conditional average of the observed outcomes given the score as the score approaches
the cutoff for the treatment and control groups separately, lim 𝐸 𝑌𝑖 𝑋𝑖 = 𝑥 − lim 𝐸 𝑌𝑖 𝑋𝑖 = 𝑥 .
𝑥↓𝑐 𝑥↑𝑐

• The estimation of these limits requires that the researcher approximate the regression functions,
and this approximation will typically contain an error that may directly affect estimation and
inference.

• This is in stark contrast to the experiment depicted in Figure 1(a), where estimation does not
require functional form assumptions: by construction, the regression functions are constant in the
entire region where the score is randomly assigned.

• This shows that RD designs are not canonical randomized experiments but rather natural
experiments, and thus belong to the toolkit of observational studies methods.
Local randomization approach

• A point often overlooked is that the known functional form of the regression functions in
a true experiment does not follow from the random assignment of the score per se, but
rather from the lack of relationship between the score and the potential outcomes that is
assumed to be a consequence of the randomization. If the value of the score were
randomly assigned but had a direct effect on the average outcomes, the regression
functions in Figure 1(a) would not necessarily be flat.
• A local randomization approach to RD analysis must thus be based not only on the
assumption that placement above or below the cutoff is randomly assigned within a
window of the cutoff, but also on the assumption that the value of the score within this
window is unrelated to the potential outcomes -a condition that is not guaranteed by the
random assignment of the score Xi (nor by the random assignment of the treatment Ti).
Local randomization approach

• To formalize, let 𝑊 = [𝑐 − 𝑤; 𝑐 + 𝑤] for some window length w > 0, and


XW be the vector of scores for all i such that 𝑋𝑖 ∈ 𝑊. The basic local
randomization framework can be summarized by the two following
conditions:
• (LR1) The joint probability distribution of scores within W is known.
• (LR2) The potential outcomes are not affected by the score within W.
• The first condition requires that, inside the window, the assignment
mechanism of the score is known, as would happen in a randomized
experiment.
Local randomization approach

• LR1 requires that 𝑃𝑤 [𝑋𝑤 ≤ 𝑥] = 𝐹(𝑥) for some known joint c.d.f. F(x).
For example, this condition holds when all units have the same probability
of receiving all possible score values in W, and therefore equal probability
of being assign to control (Xi < c) or treatment (𝑋𝑖 ≥ 𝑐) when the window
W is symmetric around the cutoff c.
• The second condition, LR2, is the exclusion restriction ensuring that the
potential outcomes are not a function of the score for those units with
score inside W, as would be expected in a true double-blind randomized
experiment.
Local randomization approach

• Under LR1 and LR2, for all units with 𝑋𝑖 ∈ 𝑊 = [𝑐 − 𝑤; 𝑐 + 𝑤], placement
above or below the cutoff is unrelated to the potential outcomes, and the
potential outcomes are unrelated to the running variable; therefore, the
regression functions are at inside W.
• This is illustrated in Figure 2 where for the case of random potential outcomes
𝜇1 𝑥 = 𝐸[𝑌𝑖 (1)|𝑋𝑖 = 𝑥] and 𝜇0 𝑥 = 𝐸[𝑌𝑖 (0)|𝑋𝑖 = 𝑥] are constant for all values
of x in W.
• The contrast between Figures 1(a), 1(b), and 2 illustrates the differences between
an actual experiment where the score is a randomly generated number, a
continuity-based RD design, and a local randomization RD design.
Figure 2.
Differences: EXP, RD cont and RD exp

• In the actual experiment, the potential outcomes are unrelated to the score for all possible score
values. In this case, the functional forms of E[Yi(1)|Xi = x] and E[Yi(0)|Xi = x] are known.
• In the continuity-based RD design, the potential outcomes can be related to the score everywhere;
the functions E[Yi(1)|Xi = x] and E[Yi(0)|Xi = x] are unknown but assumed to be smooth, and
estimation and inference is based on approximating them near the cutoff.
• Finally, in the local randomization RD design, the potential outcomes can be related to the running
variable far from the cutoff, but there is a window around the cutoff where this relationship ceases.
In this case, the functions E[Yi(1)|Xi = x] and E[Yi(0)|Xi = x] are unknown over the entire support of
the running variable, but inside the window W they are assumed to be constant functions of x -and
are therefore known.
• A key advantage of the local randomization approach is that it enables finite sample inference
methods, which remain valid and can be used even when only a handful of observations very close
to the cutoff are included in the analysis.
Local Randomization. Estimation and
Inference
Estimation

• The implementation of experimental methods to analyze RD designs thus


requires knowledge or estimation of two important ingredients:
• (i) the window W where the local randomization assumption is invoked;
and
• (ii) the randomization mechanism that is needed to approximate the
assignment of units within W to the treatment and control conditions (i.e.,
to placement above or below the cutoff).
• Given a choice of W and assignment mechanism, under a local
randomization RD approach, we can analyze the data as we would analyze
an experiment.
Randomization mechanisms: comparing the
three approaches
Parameters of interest

• Regardless of whether a Fisher, Neyman or super-population approach is adopted, we can


now define parameters of interest.
• Let EW[.] denote the expectation computed with respect to the probability PW, that is, the
expectation computed conditionally for those units with Xi ∈ W.
• The local randomization sharp RD treatment effect is the average treatment effect inside
W:
1
• 𝜃𝑆𝑅𝐷 = σ𝑖:𝑋𝑖 ∈𝑊 𝐸𝑤 𝑌𝑖 1 − 𝑌𝑖 0 .
𝑁𝑤

• The definition of 𝜃𝑆𝑅𝐷 is designed to cover both random and non-random potential
outcomes under different sampling schemes.
Estimation

1
• In a Neyman framework, it reduces to 𝜃𝑆𝑅𝐷 = σ𝑖:𝑋𝑖 ∈𝑊 𝑌𝑖 1 − 𝑌𝑖 0 because the
𝑁𝑤
potential outcomes are fixed and the (conditional) expectation integrates to one.
• In the super-population framework under i.i.d. sampling, we have
• 𝜃𝑆𝑅𝐷 = 𝐸 𝑌𝑖 1 − 𝑌𝑖 0 |𝑋𝑖 ∈ 𝑊]
• The parameter 𝜃𝑆𝑅𝐷 is different from the more conventional continuity-based RD
parameter 𝜏𝑆𝑅𝐷 .
• While 𝜃𝑆𝑅𝐷 is an average effect inside an interval (the window W), 𝜏𝑆𝑅𝐷 is an average at a
single point (the cutoff c) where the number of observations is zero whenever the score is
continuously distributed.
Estimation

• Naturally, the smaller the window W is, the more conceptually similar 𝜃𝑆𝑅𝐷
and 𝜏𝑆𝑅𝐷 become.
• Under the local randomization assumptions invoked within W, we have
1 𝑇𝑖 𝑌𝑖 1 (1−𝑇𝑖 )𝑌𝑖
• 𝜃𝑆𝑅𝐷 = σ𝑖:𝑋𝑖∈𝑊 𝐸𝑤 − σ𝑖:𝑋𝑖 ∈𝑊 𝐸𝑤
𝑁𝑤 𝑃𝑤 [𝑇𝑖 =1] 𝑁𝑤 1−𝑃𝑤 [𝑇𝑖 =1]


• regardless of whether the potential outcomes are fixed or random.
Estimation

• This identification result expresses the counterfactual RD effect 𝜃𝑆𝑅𝐷 as a function of


observed random variables, and suggests the weighted difference-in-means estimator

• 𝜃መ𝑆𝑅𝐷 = 𝑌ത𝑤,+ − 𝑌ത𝑤,−


1 1
• 𝑌ത𝑤,+ = σ𝑖:𝑋𝑖 ∈𝑊 𝜔𝑖 𝑇𝑖 𝑌𝑖 ; 𝑌ത𝑤,− = σ𝑖:𝑋𝑖∈𝑊 𝜔𝑖 (1 − 𝑇𝑖 )𝑌𝑖
𝑁𝑤,+ 𝑁𝑤,−

• where 𝜔𝑖 denotes an appropriately defined weighting scheme for unit i.

• The standard difference- in-means estimator is a particular case of 𝜃መ𝑆𝑅𝐷 with 𝜔𝑖 = 1 for
all units.
Inference

• When the assignment mechanism follows a fixed-margins randomization (or


complete randomization), this choice of weighting scheme makes 𝜃෠𝑆𝑅𝐷 unbiased
for 𝜃𝑆𝑅𝐷 , that is, 𝐸𝑤 𝜃෠𝑆𝑅𝐷 = 𝜃𝑆𝑅𝐷 .
• By implication, whenever the assignment mechanism does not follow a fixed-
margins randomization, the unweighted difference-in-means estimator is not
unbiased for 𝜃𝑆𝑅𝐷 , although it is consistent under standard large sample
arguments.
• Thus, whenever the randomization mechanism is assumed to be different from a
fixed-margins randomization, the use of the unweighted difference-in-means
estimator must be justified based on large sample approximations.
Inference

• For inference, both the Neyman and the super-population approaches rely on a Gaussian
approximation justified by appropriate central limit theorems.

• A possibly conservative estimator of the variance of 𝜃መ𝑆𝑅𝐷 can be constructed using


standard least squares results. A 100(1-𝛼)% confidence interval can be constructed in the
usual way by relying on a Gaussian large sample approximation to the statistic of interest.
For example, an approximate two-sided 95% confidence interval is

• 𝐶𝐼𝐿𝑆 = 𝜃መ𝑆𝑅𝐷 ∓ 1.96 ∙ 𝑉෠

• where 𝑉෠ denotes an appropriate choice of variance estimator, which can depend on the
specific framework considered.
Inference

• Hypothesis testing is based on Gaussian approximations as well.


• The Neyman or super-population null hypothesis is
1 1
• 𝐻0 : σ 𝐸 𝑌𝑖 1 = σ 𝐸 𝑌𝑖 0
𝑁𝑤 𝑖:𝑋𝑖 ∈𝑊 𝑤 𝑁𝑤 𝑖:𝑋𝑖 ∈𝑊 𝑤
• In contrast to Fisher's sharp null hypothesis 𝐻0𝐹 ,this null hypothesis does not allow us to
calculate the full profile of potential outcomes for every possible realization of the
treatment assignment vector.
• Thus, unlike the Fisherian approach, the large sample approach to hypothesis testing
relies on an approximation and is therefore not exact but, when valid, it allow us to rely
on well-known methods for estimation and inference based on least squares and related
approaches.
The Window
How to choose the window

• One option is to choose W in an ad hoc way. The disadvantage of an ad-hoc


method is that it lacks transparency and objectivity.
• A preferred alternative is to use a principled data-driven procedure. A leading
example is based on predetermined covariates -variables that capture important
characteristics of the units and whose values are determined before the
treatment is assigned and received.
• This approach requires assuming that there exists at least one predetermined
covariate, Zi, that is associated with the running variable only outside the window
W = W0 where the local randomization assumptions hold.
How to choose the window

• Specifically, the requirement is that Zi be associated with the score in windows larger
than W0, possibly due to correlation between the score and another characteristic that
also affects Zi, but independent of the score in W0 and all smaller windows. Moreover,
because Zi is a predetermined covariate, the effect of the treatment on Zi is zero by
construction.
• We define a generic null hypothesis H0 stating that the treatment is unrelated to Zi (or
that Zi is “balanced" between the groups).

• This hypothesis could be the Fisherian hypothesis 𝐻0𝐹 or the large sample hypothesis H0.
• The procedure starts with the smallest possible window -W1 in Figure 1 and tests H0.
How to choose the window

• Since there is no treatment effect inside W1, H0 will fail to be rejected.


• A larger window W2 is selected, and the null hypothesis is tested again inside W2. The
procedure keeps increasing the length of the window and re-testing H0 in each larger
window, until a window is reached where H0 is rejected at the chosen significance level
𝛼 ∗ ∈ (0,1).
• In the figure, assuming the test has perfect power, H0 will not be rejected in W0, nor will
it be rejected in W2 or W1.
• The chosen window is the largest window such that H0 fails to be rejected inside that
window and in all windows contained in it.
Figure 1.
Choosing the window

• The practical implementation of the procedure requires several choices:


• 1. Null hypothesis. Since the procedure will typically involve some windows with very few
observations, we recommend using the Fisherian methods for the sharp null hypothesis,

• 𝐻0𝐹 : 𝑍𝑖 1 = 𝑍𝑖 0 for all i.


• 2. Relevant covariates. The covariates employed should be related to both the outcome
and the treatment assignment. If multiple covariates are chosen, the procedure can be
applied using either the p-value of an omnibus test statistic, or by testing H0 for each
covariate separately and using the minimum p-value across all covariates.
Choosing the window

• 3. Test statistic. Typical choices of the statistic used to test H0 include the
difference-in-means, the Kolmogorov-Smirnov statistic, and the Wilcoxon
rank-sum statistic.
• 4. Randomization mechanism. If Fisherian inference is used, typical choices
of randomization mechanisms are complete randomization and Bernoulli
assignment. Large sample methods do not require the full specification of
the assignment mechanism, but do require general assumptions about the
type of assignment such as no stratification.
Choosing the window

• 5. Minimum number of observations in the smallest window. If the smallest window


where H0 is tested is too small, it will contain too few observations and the power to
reject the null hypothesis when it is false will be too low. Thus, the smallest window
should contain a minimum number of observations to ensure acceptable power; we
recommend at least roughly ten observations on either side of the cutoff.
• 6. Level 𝛼 ∗ . Because the main concern is failing to reject a false H0, the threshold
significance level that determines when H0 is rejected should be higher than the usual
0.05.
• 7. When we test H0 at a higher level, we tolerate a higher probability of Type I error and a
lower probability of concluding that the covariate is unrelated to the treatment
assignment when in fact it is. We recommend setting 𝛼 ∗ = 0.15 if possible, and ideally no
smaller than 0.10.
Validation and falsification analyisis
Validation and falsification analyisis

• Falsification and validation analyses are as important in the local randomization


framework as they are in the continuity-based framework; the difference resides in their
implementation. Instead of providing empirical evidence in favor of continuity
assumptions as in the continuity-based approach, the main goal in a local randomization
approach is to provide evidence consistent with the local randomization assumptions.
• Four types of empirical falsification tests for a local randomization RD design, all of which
were discussed in in the context of the continuity-based approach: (i) tests of a null
treatment effect on pre-treatment covariates or placebo outcomes, (ii) tests to assess the
density of the score around the cutoff, (iii) treatment effect estimation at artificial cutoffs
values, and (v) sensitivity to neighborhood choices.
When to use local randomization approach

• The continuity-based approach is more convenient when there are enough observations
near the cutoff to approximate the regression functions with reasonable accuracy -but
possibly inadequate when the number of observations is small.
• In applications with few observations, the local randomization approach has the
advantage of requiring minimal extrapolation and avoiding the use of smoothing
methods.
• Another situation in which a local randomization approach may be preferable to a
continuity-based approach is when the running variable is discrete -i.e., when multiple
units share the same value of the score. When the score is discrete, the continuity-based
approach is not directly applicable, and the local randomization is often a natural and
useful alternative.
The Fuzzy RD design
Fuzzy design

• Some units fail to comply with the treatment condition that is assigned to
them.
• The phenomenon of units receiving a treatment condition different from
the condition that is originally assigned to them is generally known as
imperfect compliance or non-compliance.
• The RD design with imperfect compliance is usually referred to as the
Fuzzy RD design, to distinguish it from the Sharp RD design where
compliance is perfect.
Fuzzy design

• The Fuzzy RD treatment assignment rule is still 𝑇𝑖 = 1(𝑋𝑖 ≥ 𝑐) but compliance with this
assignment is imperfect.
• As a consequence, although the probability of receiving treatment still jumps abruptly at
the cutoff, it no longer changes from 0 to 1 as in the Sharp RD case.
• Di to denote whether the treatment was actually received by unit i. Our notation now
distinguishes between the treatment assigned, Ti, and the treatment received, Di.
• We can thus say that the key characteristic of the Fuzzy RD design is that there are some
units for which 𝑇𝑖 ≠ 𝐷𝑖 .
Figure 1. Sharp and fuzzy design
Estimation and inference
Estimates using OLS

• The estimation using OLS is

• 𝑌𝑖𝑅𝐷 = 𝐶𝑖𝑅𝐷′ 𝛽𝑟𝑑 + 𝑓 𝑋𝑖𝑅𝐷 + 𝜏𝑟𝑑 𝑊𝑖𝑅𝐷 + 𝜀𝑖


The Fuzzy Regression Discontinuity (FRD) design

• The probability of receiving the treatment does not change from 0 to 1 at


the threshold.
• Instead, the design allows for a smaller jump in the probability of
assignment to the treatment at the threshold:

• limPr (𝑊𝑖 = 1|𝑋𝑖 = 𝑥) ≠ limPr (𝑊𝑖 = 1|𝑋𝑖 = 𝑥)


𝑥↓𝑐 𝑥↑𝑐

• Without requiring the jump to equal 1.


Example

Relation between treatment and poverty index

.8
.6
prob treatment

.4
.2
0

48.000000 50.000000 52.000000 54.000000


poverty index
FRD as an IV

• The estimand is the ratio of the jump in the regression of the outcome on the covariate to
the jump in the regression of the treatment indicator.
• Formally:
𝑙𝑖𝑚𝑥↓𝑐 𝐸𝑌 𝑋 = 𝑥 −𝑙𝑖𝑚𝑥↑𝑐 𝐸 𝑌 𝑋 = 𝑥
• 𝜏𝐹𝑅𝐷 =
𝑙𝑖𝑚𝑥↓𝑐 𝐸 𝑊 𝑋 = 𝑥 −𝑙𝑖𝑚𝑥↑𝑐 𝐸 𝑊 𝑋 = 𝑥

• A FRD is an instrumental variable logic.


• So we use the same structure of IV.
Compliers

• Complier is a unit such that


• 𝑙𝑖𝑚 𝑊𝑖 𝑥 = 0 and 𝑙𝑖𝑚 𝑊𝑖 𝑥 = 1
𝑥↓𝑋𝑖 𝑥↑𝑋𝑖
• Compliers are units that would get the treatment if the cutoff were at Xi or below, but that
would not get the treatment if the cutoff were higher than Xi.
• Nevertakers are units with: 𝑙𝑖𝑚 𝑊𝑖 𝑥 = 0 and 𝑙𝑖𝑚 𝑊𝑖 𝑥 = 0
𝑥↓𝑋𝑖 𝑥↑𝑋𝑖
• And always takers are: 𝑙𝑖𝑚 𝑊𝑖 𝑥 = 1 and 𝑙𝑖𝑚 𝑊𝑖 𝑥 = 1
𝑥↓𝑋𝑖 𝑥↑𝑋𝑖
LATE (LATE) estimate

𝑙𝑖𝑚𝑥↓𝑐 𝐸 𝑌 𝑋 = 𝑥 −𝑙𝑖𝑚𝑥↑𝑐 𝐸 𝑌 𝑋 = 𝑥
• 𝜏𝐹𝑅𝐷 =
𝑙𝑖𝑚𝑥↓𝑐 𝐸 𝑊 𝑋 = 𝑥 −𝑙𝑖𝑚𝑥↑𝑐 𝐸 𝑊 𝑋 = 𝑥

• =E[Yi(1) – Yi(0) | unit i is a complier and Xi=c]

• The estimand is an average effect of the treatment, but only averaged for units with Xi=c
(by RD), and only for compliers (people who are affected by the threshold).
Estimation using IV logic

• 2SLS estimation
• 𝑌𝑖𝑅𝐷 = 𝐶𝑖𝑅𝐷′ 𝛽𝑟𝑑2 + 𝑓 𝑋𝑖𝑅𝐷 + 𝜏𝑟𝑑2 𝑊
෡𝑖𝑅𝐷 + 𝜖𝑖

• Where W hat comes from the first stage:


• 𝑊𝑖𝑅𝐷 = 𝐶𝑖𝑅𝐷′ 𝛼1𝑟𝑑 + 𝑓 𝑋𝑖𝑅𝐷 + 𝛼2𝑟𝑑 𝑍𝑖𝑅𝐷 + 𝜇𝑖

• The reduced form:


• 𝑌𝑖𝑅𝐷 = 𝐶𝑖𝑅𝐷′ 𝛾1𝑟𝑑 + 𝑓 𝑋𝑖𝑅𝐷 + 𝛾2𝑟𝑑 𝑍𝑖𝑅𝐷 + 𝜔𝑖
External validity

• Both SRD and FRD provide estimates of the average effect for a subpopulation, namely
the subpopulation with covariate value equal to Xi=c.
• The FRD restricts to the compliers.
• So the design has a limited degree of external validity. But it could be of policy interest in
cases where the policy question concerns changing the location of the threshold.
• But the RD has a relatively high degree of internal validity.
Examples of papers using RD

• This quasi-experimental design was first introduced by Thistlethwaite and Campbell


(1960) (they evaluate the impact of a National Merit Award on obtaining additional
college scholarships and on career aspirations).
• Angrist and Lavy (1999) . The impact of calss size on test scores using Maimonides rule.
• Hahn et al. (1999). Impact of an anti-discrimination law that only applies to firms with at
least 15 employees.
• Matsudaria (2007) studies the effect of a remedial summer school program that is
mandatory for students who score less than some cutoff level on a test.
Examples

• Chay, MacEwan and Urquiola (2005) evaluates the impact of p-900 chilean program using
come cutoff in test score at school level.
• Card et al., (2004) evaluates the impact of medicare by using age.
• Van Der Klaauw (2002) evaluate the impact of financial aid on college attendance by using
SAT.
• Black (1999). The paper estimates the value that parents places on school quality by
calculating how much more people pay for houses located in areas with better schools.
The paper compares houses on opposite sides of attendance district boundaries (the
geographic lines that determine which school a child attends within a school district).
Examples

• Mark Pitt, and Shahidur Khandker, 1998. This paper estimates the impact of participation,
by gender, in the Grameen Bank and two other group-based credit programs in
Bangladesh on the labor supply, schooling, household expenditure, and assets.
• Using a RD design it finds that credit program has a larger effect when women are the
program participants. After controlling for village fixed-effect, the paper uses a sample
that includes households in villages with treatment choice (program villages) but that are
excluded from making a treatment choice by some exogenous rule: household owing
more than one-half acre of land are precluded from joining any of the credit programs
Examples

• Jacob B. A., and Lefgren L. (2004). The program.- In 1996, the Chicago Public School
System (CPS) placed 71 of its 489 elementary schools on academic probation.
• These probation schools received special funding for staff development as well as
technical assistance and enhanced monitoring.
• Eligibility for probation was determined on the basis of standardized reading scores
(schools with fewer than 15 percent of students scored at or above national norms in
reading were subject to probation). This creates a highly non-linear relationship between
a school’s reading achievement in 1996 and the likelihood that the school was on
probation in subsequent years.
Examples

• Ludwing and Miller (2007) evaluate the impact of the program Head Start using a RD.
• The program was established in 1965 to provide preschool, health, and other social
services to por children age three to five and their families.
• In 1965 the Office of Economic Opportunity (OEO) provided technical assistance to the
300 poorest counties to develop Head Start proposals. The result was a large and lasting
discontinuity in Head Start funding rates at the OEO cutoff for grant-writing assistance.
• Positive impact on infant mortality and educational attainment.

You might also like