Understanding Regression Discontinuity Design
Understanding Regression Discontinuity Design
Juan Ponce J.
Maestría en Economía del Desarrollo
Flacso-Ecuador
Introduction to RD
The logic of RD
• Social programs often use an index to decide who is eligible to enroll in the program and
who is not.
• In this cases, program authorities typically determine a threshold or cutoff score, below
which households are deemed poor and are eligible for the program.
• Examples. Selben and RS in Ecuador. SISBEN in Colombia. CASEN in Chile.
• Other types of programs select beneficiaries using age. Only people above a certain age
are eligible to receive the program. Other examples use test scores.
RD conditions
• For programs that have a continuous eligibility index with a clearly defined cutoff score to
determine who is eligible and who is not, we can use a regression discontinuity design.
Two main conditions are needed:
• A continuous eligibility index. A continuous measure on which the population of
interest can be ranked, such as a poverty index, a test score, age, etc.
• A clearly defined cutoff score, that is, a point on the index above or below which the
population is classified as eligible for the program.
Identification
• Units above and below the cutoff (in a vicinity) are likely to be very similar, except that
one group receive the program and the other do not.
• The RD strategy exploits the discontinuity around the cutoff score to estimate the
counterfactual.
• As we move further away from the eligibility cutoff, eligible and ineligible units will
become more different by construction, but we have a measure of how different they are
based on the eligibility criteria and therefore we can control for those differences.
LATE
• RD estimates local average around the eligibility cutoff at the point where treatment and
comparison units are most similar.
• As we get closer to the cutoff, the units that are to above (the left) and below (the right)
of it will look more similar. As good as an experiment.
• RD is useful to evaluate whether the program should be expanded or reduced at the
margin.
• Relatively large evaluation samples are required to obtain sufficient statistical power
when applying RD.
• Tradeoff between bias and efficiency.
Frameworks for RD analysis
• For each type of RD, we have two main frameworks for RD analysis.
• 1. Based on continuity assumptions. Comparability is conceptualized as continuity of
average potential outcomes near the cutoff. [Link]
• 2. Based on local randomization assumptions. Comparability is conceptualized as
conditions that mimic a randomized experiment in a neighborhood around the cutoff.
Continuity-based framework for RD
analysis
The RD design
• In the RD design, all units in the study receive a score (also known as running variable,
forcing variable, or index ), and a treatment is assigned to those units whose score is
above a known cutoff and not assigned to those units whose score is below the cutoff.
• In the RD design, unlike in other nonexperimental studies, the assignment of the
treatment follows a rule that is known (at least to the researcher) and hence empirically
verifiable.
• We call Sharp RD design any RD design where the treatment condition assigned is
identical to the treatment condition actually received for all units. Any RD design where
compliance with treatment assignment is imperfect is referred to as Fuzzy RD design.
Sharp design
• To formalize, we assume that there are n units, indexed by i = 1, 2, ….., n, each unit has a
score or running variable Xi, and c is a known cutoff. Units with Xi ≥ c are assigned to the
treatment condition, and units with Xi < c are assigned to the control condition.
• This treatment assignment, denoted Ti, is defined as Ti = 1(Xi ≥ c), where 𝕝 . is the
indicator function, and it implies that the probability of treatment assignment as a
function of the score changes discontinuously at the cutoff.
• A defining feature of all RD designs is that the conditional probability of actually receiving
treatment given the score changes discontinuously at the cutoff.
• In a Sharp RD design, this probability changes exactly from zero to one at the cutoff. Since
in the Sharp RD design treatment assigned and treatment received are identical.
Sharp design
Potential outcomes
• Assume that each unit has two potential outcomes, Yi(1) and Yi(0), corresponding, respectively, to
the outcomes that would be observed under the treatment or control conditions.
• Treatment effects are defined in terms of comparisons between features of (the distribution of)
both potential outcomes, such as their means, variances or quantiles.
• If unit i receives the treatment, we will observe Yi(1), the unit's outcome under treatment, and Yi(0)
will remain latent or unobserved.
• If i receives the control condition, we will observe Yi(0) but not Yi(1).
• This results in the so-called fundamental problem of causal inference, and implies that the
treatment effect at the individual level is fundamentally unknowable.
Observed outcomes
• In the specific context of the Sharp RD design, the fundamental problem of causal
inference occurs because we only observe the outcome under control, Yi(0), for those
units whose score is below the cutoff, and we only observe the outcome under
treatment, Yi(1), for those units whose score is above the cutoff.
• The next figure plots the average potential outcomes given the score, E[Yi(1)|Xi = x] and
E[Yi(0)|Xi = x], against the score. In statistics, conditional expectation functions such as
these are usually called regression functions.
Observed outcomes
Observed outcomes
• The regression function E[Yi(1)|Xi] is observed for values of the score to the right of the
cutoff -because when Xi ≥ c, the observed outcome Yi is equal to the potential outcome
under treatment, Yi(1), for every i. This is represented with the solid red line. However, to
the left of the cutoff, all units are untreated, and therefore E[Yi(1)|Xi] is not observed
(represented by a dashed red line).
• A similar phenomenon occurs for E[Yi(0)|Xi], which is observed for values of the score to
the left of the cutoff (solid blue line), Xi < c, but unobserved for Xi ≥ c (dashed blue line).
• Thus, the observed average outcome given the score is
𝐸 𝑦𝑖 0 𝑋𝑖 𝑖𝑓 𝑋𝑖 < 𝑐
• 𝐸 𝑌𝑖 𝑋𝑖 =
𝐸 𝑦𝑖 1 𝑋𝑖 𝑖𝑓 𝑋𝑖 ≥ 𝑐
Lack of common support and extrapolation
• The Sharp RD design exhibits an extreme case of lack of common support, as units in the
control and treatment groups cannot have the same value of the running variable (Xi).
• RD analysis fundamentally relies on extrapolation towards the cutoff point.
• A central goal of empirical RD analysis is to adequately perform (local) extrapolation in
order to compare control and treatment units.
• The average treatment effect at a given value of the score, E[Yi(1)|Xi =x] - E[Yi(0)|Xi = x],
is the vertical distance between the two regression curves at that value.
The sharp RD impact
• This distance cannot be directly estimated because we never observe both curves for the
same value of x.
• However, a special situation occurs at the cutoff c: this is the only point at which we
“almost” observe both curves (with score c-𝜀 for a small and positive 𝜀).
• In the figure, the vertical distance at c is
• 𝐸[𝑌𝑖(1)|𝑋𝑖 = 𝑐] − 𝐸[𝑌𝑖(0)|𝑋𝑖 = 𝑐] ≡ 𝜇+ − 𝜇− .
• This is precisely the treatment effect that can be estimated with a Sharp RD design.
The Sharp RD impact
• The assumption of comparability between units with very similar values of the score but on
opposite sides of the cutoff is the fundamental concept on which all RD designs are based.
• If the regression functions E[Yi(1)|Xi = x] and E[Yi(0)|Xi = x], seen as functions of x, are
continuous at x = c, then in a Sharp RD design we have
• The result in Equation (1) says that, if the average potential outcomes are continuous
functions of the score at c, the difference between the limits of the treated and control
average observed outcomes as the score converges to the cutoff is equal to the average
treatment effect at the cutoff.
Continuity
• Continuity means that as the score x gets closer and closer to the cutoff c, the average
potential outcome function E[Yi(0)|Xi = x] gets closer and closer to its value at the cutoff,
E[Yi(0)|Xi = c] (and analogously for E[Yi(1)|Xi = x]).
• Thus, continuity gives a formal justification for estimating the Sharp RD effect by focusing
on observations above and below the cutoff in a very small neighborhood around it.
• Continuity offers one justification for using observations just below the cutoff to
approximate the average outcome that units just above the cutoff would have had if they
had received the control condition instead of the treatment.
Local nature of RD effect
• The Sharp RD parameter presented can be interpreted as causal in the sense that it
captures the average difference in potential outcomes under treatment versus control.
• This average difference is calculated at a single point on the support of a continuous
random variable (the score Xi), and as a result captures a causal effect that is local in
nature.
• The RD treatment effect tends to have limited external validity, that is, the RD effect is
generally not representative of the treatment effects that would occur for units with
scores away from the cutoff.
RD Plots
RD Plots
• The global polynomial fit is simply a smooth approximation to the unknown regression
functions based on a fourth- or fifth-order polynomial regression fit of the outcome on the
score, fitted separately above and below the cutoff, and using the original raw data.
• The local sample means are created by first choosing disjoint (i.e., non-overlapping) intervals
or “bins” of the score, calculating the mean of the outcome for the observations falling within
each bin, and then plotting the average outcome in each bin against the mid point of the bin.
Example
Choosing the bins
• There are two different types of bins that can be used in the construction of RD plots: bins that
have equal length, or bins that contain (roughly) the same number of observations but whose
length may differ.
• We refer to these two types as evenly-spaced (ES) and quantile-spaced (QS) bins, respectively.
• In an RD plot with ES bins, each of the local means represented by a dot may be computed using a
different number of observations and thus may be more or less precisely calculated than the other
local means in the plot, affecting comparability.
• Moreover, a quantile-spaced RD plot has the advantage of providing a quick visual representation
of the density of observations over the support of the running variable.
Number of bins
• 1. Integrated mean square error (IMSE) method. Minimize an asymptotic approximation to the
integrated mean-squared error (IMSE) of the local means estimator, that is, the sum of the
expansions of the (integrated) variance and squared bias.
• If we choose a large number of bins, we have a small bias because the bins are smaller and the local
constant fit is better; but this reduction in bias comes at a cost, as increasing the number of bins
leads to fewer observations per bin and thus more variability within bin. The IMSE-optimal J- and J+
are the numbers of bins that balance squared-bias and variance so that the IMSE is (approximately)
minimized.
• However, the IMSE-optimal method often results in a very smooth plot where the local means
nearly overlap with the global polynomial fit, and may not be appropriate to capture the local
variability of the data near the cutoff.
Numbes of bins
• 2. Mimicking Variance Method (MV). The second method to select the number of bins
chooses the vales of J- and J+ so that the binned means have an asymptotic (integrated)
variability that is approximately equal to the variability of the raw data. In other words,
the number of bins is chosen so that the overall variability of the binned means “mimics”
the overall variability in the raw scatter plot of the data.
• This method produces a much higher number of bins than we obtained with the IMSE
criterion for both ES and QS bins.
• It's recommended to start with MV bins to better illustrate the variability of the outcome
as a function of the score, ideally comparing ES to QS bins to highlight the distributional
features of the score. Then, if needed, the researcher can select the number of bins to be
IMSE-optimal in order to explore the global features of the regression function.
Estimation
Estimation
• Two related but distinct goals: point estimation of RD treatment effect (i.e., give a scalar estimate of
the vertical distance between the regression functions at the cutoff), and statistical inference about
the RD treatment effect (i.e., construct valid statistical hypothesis tests and confidence intervals).
• Estimation typically proceeds by using (local to the cutoff) polynomial methods to approximate the
regression function E[Yi|Xi = x] on each side of the cutoff separately. In practical terms, this involves
using least-squares methods to fit a polynomial of the observed outcome on the score.
• When all the observations are used for estimation, these polynomial fits are global or parametric in
nature. In contrast, when estimation employs only observations with scores near the cutoff, the
polynomial fits are local, “flexible," or “non-parametric." Our upcoming discussion focuses
exclusively on local polynomial methods, which are by now the standard framework for RD
empirical analysis.
Local Polynomial approach
• Because the running variable is assumed continuous, there are no (or sometimes in
practice very few) observations whose score is c or very nearly so.
• Thus, local extrapolation in RD designs is unavoidable in general.
• In other words, in order to form estimates of the average control response at the cutoff,
E[Yi(0)|Xi =c], and of the average treatment response at the cutoff, E[Yi(1)|Xi = c], we
must rely on observations further away from the cutoff.
• In the Sharp RD design, for example, the treatment effect 𝜏𝑆𝑅𝐷 is the vertical distance
between the E[Yi(1)|Xi = x] and E[Yi(0)|Xi = x] at x = c.
• Estimation and inference proceed by first approximating these unknown regression
functions, and then computing the estimated treatment effect and/or the statistical
inference procedure of interest.
Local Polynomial approach
• Applied to the RD point estimation problem, this principle suggests that the unknown regression functions
E[Yi(0)|Xi = x] and E[Yi(1)|Xi = x] can be approximated by a polynomial function of the score.
• Early empirical work employed the idea of polynomial approximation globally, that is, tried to approximate these
functions using flexible higher-order polynomials, usually of fourth or fifth order, over the entire support of the
data.
• It is now widely recognized that a global polynomial approach does not deliver point estimators and inference
procedures with good properties for the RD treatment effect, the main object of interest. The reason is that
global polynomial approximations tend to deliver a good approximation overall, but a poor approximation at
boundary points.
• Modern RD empirical work employs local polynomial methods, which focus on approximating the regression
functions only near the cutoff. Because this approach localizes the polynomial fit to the cutoff (discarding
observations sufficiently far away) and employs a low-order polynomial approximation (usually linear or
quadratic), it is substantially more robust and less sensitive to boundary and overfitting problems.
Local Polynomial point estimation
• Local polynomial methods implement linear regression fits using only observations near the cutoff point,
separately for control and treatment units.
• Specifically, this approach uses only observations that are between c - h and c + h, where h > 0 is a so-
called bandwidth that determines the size of the neighborhood around the cutoff where the empirical RD
analysis is conducted. Within this bandwidth, it is common to adopt a weighting scheme to ensure that
the observations closer to c receive more weight than those further away; the weights are determined by
a kernel function K(.).
• 2. Choose a bandwidth h.
Local Polynomial point estimation
• 3. For observations above the cutoff (i.e., observations with Xi ≥ c), fit a weighted least
squares regression of the outcome Yi on a constant and (𝑋𝑖 − 𝑐); (𝑋𝑖 − 𝑐)2 , … … ,
𝑋 −𝑐
(𝑋𝑖 − 𝑐)𝑝 , where p is the chosen polynomial order, with weight 𝐾 𝑖 for each
ℎ
observation.
• The estimated intercept from this local weighted regression, 𝑢ො + , is an estimate of the
point
• 𝑢ො + = 𝐸[𝑌𝑖 (1)|𝑋𝑖 = 𝑐]:
• The kernel function K(.) assigns non-negative weights to each transformed observation
𝑋𝑖 −𝑐
, based on the distance between the observation's score Xi and the cutoff c.
ℎ
• Uniform Kernel 𝐾(𝑢) = 𝕝(|𝑢| ≤ 1), also gives zero weight to observations with score
outside [c-h, c+h], but equal weight to all observations whose scores are within this
interval. Employing a local linear estimation with bandwidth h and uniform kernel is
therefore equivalent to estimating a simple linear regression without weights using only
observations whose distance from the cutoff is at most h.
• A third weighting scheme sometimes encountered in practice is the Epanechnikov kernel,
𝐾(𝑢) = (1 − 𝑢2 )𝕝(|𝑢| ≤ 1), which gives a quadratic decaying weight to observations
with 𝑋𝑖 ∈ [𝑐 − ℎ, 𝑐 + ℎ] and zero weight to the rest.
• In practice, estimation and inference results are typically not very sensitive to the
particular choice of kernel used.
Different Kernel weights
Choosing the polynomial order
• A more consequential decision is the choice of the local polynomial order, which must consider
various factors.
• First, a polynomial of order zero -a constant fit - has undesirable theoretical properties at boundary
points, which is precisely where RD estimation must occur.
• Second, for a given bandwidth, increasing the order of the polynomial generally improves the
accuracy of the approximation but also increases the variability of the treatment effect estimator.
• Third, as mentioned above, higher-order polynomials tend to produce overfitting of the data and
lead to unreliable results near boundary points.
• Combined, these factors have led researchers to prefer the local linear RD estimator, which by now
is the default point estimator in most applications.
Bandwidth selection
• The bandwidth h controls the width of the neighborhood around the cutoff that is used to fit the
local polynomial that approximates the unknown regression functions.
• h directly affects the properties of local polynomial estimation and inference procedures, and
empirical findings are often sensitive to its particular value.
• Choosing a smaller h will reduce the misspecification error (also known as “smoothing Bias”) of the
local polynomial approximation, but will simultaneously tend to increase the variance of the
estimated coefficients because fewer observations will be available for estimation. On the other
hand, a larger h will result in more smoothing bias if the unknown function differs considerably
from the polynomial model used for approximation, but will reduce the variance because the
number of observations in the interval [c – h, c + h] will be larger.
• For this reason, the choice of bandwidth is said to involve a “bias-variance trade-off.”
Bandwidth selection
• The general form of the approximate (conditional) MSE for the RD treatment effect is
• 𝑀𝑆𝐸 𝜏Ƹ𝑆𝑅𝐷 = 𝐵𝑖𝑎𝑠 2 𝜏Ƹ𝑆𝑅𝐷 + 𝑉𝑎𝑟 𝜏Ƹ𝑆𝑅𝐷 = ℬ2 + 𝒱
• where the approximate (conditional) bias and variance of the estimator are
1
• ℬ = ℎ2(𝑝+1) 𝐵 and 𝒱 = 𝑉
𝑛ℎ
• The local polynomial RD point estimator 𝜏Ƹ𝑆𝑅𝐷 has an approximate large-sample distribution
𝜏ො 𝑆𝑅𝐷 −𝜏𝑆𝑅𝐷 − ℬ
• ~𝒩(0,1)
𝒱
• Where ℬ and 𝒱 are, respectively, the asymptotic bias and variance of the RD local polynomial
estimator of order p, discussed previously in the context of MSE expansions and bandwidth
selection.
• This distributional result is similar to those encountered, for example, in standard linear regression
problems -with the important distinction that now the bias term ℬ features explicitly; this term
highlights the trade-off between bandwidth choice and misspecification bias locally to the cutoff.
• The variance term 𝒱 can be calculated as in (weighted) least-squares problems, for instance
accounting for heteroskedasticity and/or clustered data.
Using the MSE-Optimal Bandwidth for Inference
• Given the distributional approximation for the RD local polynomial estimator, an asymptotic
95% confidence interval for 𝜏𝑆𝑅𝐷 is approximately given by
• 𝐶𝐼 = 𝜏Ƹ𝑆𝑅𝐷 − ℬ ± 1.96 ∙ 𝒱
• This confidence interval depends on the unknown bias or misspecification error ℬ, and any
practical procedure that ignores it will lead to incorrect inferences unless this term is
negligible.
• The bias term arises because the local polynomial approach is a non-parametric
approximation: instead of assuming that the underlying regression functions are pth order
polynomials (as would occur in OLS estimation), this approach uses the polynomial to
approximate the unknown regression functions.
Conventional Inference
• This naive approach to statistical inference treats the local polynomial approach as
parametric within the neighborhood around the cutoff and de facto ignores the bias term,
a procedure that leads to invalid inferences in all cases except when the approximation
error is so small that it can be ignored.
• When the bias term is zero, the approximate distribution of the RD estimator is
𝜏ො 𝑆𝑅𝐷 −𝜏𝑆𝑅𝐷
• ~𝒩(0,1)
𝒱
• And the CI is
• 𝐶𝐼𝑏𝑐 = ± 1.96 ∙
𝜏Ƹ𝑆𝑅𝐷 − ℬ 𝒱
Standard Bias Correction
• The bias term depends on the “curvature” of the unknown regression functions captured via their derivative of
order p+1 at the cutoff. These unknown derivatives can be estimated with a local polynomial of order q = p + 1
or higher, which requires another choice of bandwidth, denoted b.
• The ratio 𝜌 = ℎ/𝑏 is important, as it relates to the variability of the bias correction estimate relative to the RD
ℎ
point estimator. Standard bias correction methods require 𝜌 = > 0 , that is, a small 𝜌. In particular, note this
𝑏
ℎ
rules out 𝜌 = = 1, that is, standard bias correction does not allow h = b.
𝑏
• The bias-corrected confidence intervals CIbc allow for a wider range of bandwidths h and, in particular, result in
valid inferences when the MSE-optimal bandwidth is used. However, they typically have poor performance in
applications because the variability introduced in the bias estimation step is not incorporated in the variance
term used.
Robust Bias Correction
• The robust bias correction approach delivers valid inferences even when the MSE-optimal
bandwidth for point estimation is used -no undersmoothing is necessary- and remains valid even
ℎ
when 𝜌 = 𝑏 = 1 (h = b), which implies that exactly the same data can be used for both point
estimation and inference.
• Robust bias-corrected confidence intervals are based on the bias correction procedure described
above, by which the estimated bias term ℬ is removed from the RD point estimator.
• However, in contrast to CIbc, the derivation allows the estimated bias term to converge in
distribution to a random variable and thus contribute to the distributional approximation of the RD
point estimator.
• This results in a new asymptotic variance 𝒱bc that, unlike the variance 𝒱 used in CIus and CIbc,
incorporates the contribution of the bias correction step to the variability of the bias-corrected
point estimator.
Robust Bias Correction
• Because the new variance 𝒱bc incorporates the extra variability introduced in the bias estimation step, it is
larger than the conventional OLS variance 𝒱 when the same bandwidth is used.
• 𝐶𝐼𝑟𝑏𝑐 = ± 1.96 ∙
𝜏Ƹ𝑆𝑅𝐷 − ℬ 𝒱𝑏𝑐
• which is constructed by subtracting the bias estimate from the local polynomial estimator and using the new
variance formula for Studentization.
• This robust confidence interval results in valid inferences when the MSE-optimal bandwidth is used, because it
has smaller coverage errors and is therefore less sensitive to tuning parameter choices.
ℎ
• In practice, the confidence interval can be implemented by setting 𝜌 = = 1 (h = b) and choosing h = hMSE, or
𝑏
by selecting both h and b to be MSE-optimal for the corresponding estimators, in which case 𝜌 is set to
hMSE/bMSE or their respective data-driven implementations.
Robust Bias Correction
• One of the most important RD falsification tests involves examining whether, near the
cutoff, treated units are similar to control units in terms of observable characteristics.
• Thus, except for their treatment status, units just above and just below the cutoff should
be similar in all variables that could not have been affected by the treatment.
• These variables can be divided into two groups: variables that are determined before the
treatment is assigned -which we call predetermined covariates;
• Variables that are determined after the treatment is assigned but, according to
substantive knowledge about the treatment's causal mechanism, could not possibly have
been affected by the treatment -which we call placebo outcomes.
Predetermined covariates and placebo outcomes
• All predetermined covariates and placebo outcomes should be analyzed in the same way as
the outcome of interest: for each predetermined covariate or placebo outcome, researchers
should first choose an optimal bandwidth, and then use local polynomial techniques within
that bandwidth to estimate the “treatment effect” and employ valid inference procedures
such as the robust bias-corrected methods discussed previously.
• The fundamental idea behind this test is that, since the predetermined covariate (or placebo
outcome) could not have been affected by the treatment, the null hypothesis of no treatment
effect should not be rejected if the RD design is valid.
• The reasoning is that if covariates or placebo outcomes that are known to correlate strongly
with the outcome of interest are discontinuous at the cutoff, the continuity of the potential
outcome functions is unlikely to hold, and thus the validity of the design is called into question.
Density of the running variable
• The second type of falsification test examines whether, in a local neighborhood near the
cutoff, the number of observations below the cutoff is surprisingly different from the
number of observations above it.
• The underlying assumption is that, if units do not have the ability to precisely manipulate
the value of the score that they receive, the number of treated observations just above
the cutoff should be approximately similar to the number of control observations below
it.
• 1. Histogram
• 2. Density test. McCrary test.
McCrary
• Test de McCary: -.0035 (.0027)
• Acepto Ho. No manipulación
Placebo cutoffs
• Another falsification approach seeks to investigate how sensitive the results are to the response of
units who are located very close to the cutoff.
• If systematic manipulation of score values has occurred, it is natural to assume that the units closest
to the cutoff are those most likely to have engaged in manipulation.
• The idea behind this approach is to exclude such units and then repeat the estimation and
inference analysis using the remaining sample.
• This idea is sometimes referred to as a “donut hole” approach.
• Even when manipulation of the score is not suspected, this strategy is also useful to assess the
sensitivity of the results to the unavoidable extrapolation involved in local polynomial estimation, as
the few observations closest to the cutoff are likely to be the most influential when fitting the local
polynomials.
Sensitivity to observations near the cutoff
• Note that, although the total number of observations will always decrease
when observations closest to the cutoff are excluded, the effective number
of observations used in the analysis may increase or decrease, depending
on how the bandwidth changes.
• In practice, it is natural to repeat this exercise a few times to assess the
actual sensitivity for different amounts of excluded units.
Sensitivity to Bandwidth Choice
• Under the local randomization framework for RD designs, the score values are assumed to be as-if
randomly assigned in a small window around the cutoff, so that placement above or below the
cutoff and hence treatment assignment can be interpreted to be as-if experimental.
• This contrasts with the continuity based approach, where extrapolation to the cutoff plays a
predominant role. Once the local randomization assumption is invoked, the analysis can proceed by
using tools from the analysis of experiments.
• This alternative framework can be used as a complement and robustness check to the continuity
based analysis when the running variable is continuous (under appropriate assumptions), and is a
natural framework for analysis when the running variable is discrete and has few mass points.
• [Link]
Introduction
• The local randomization approach to RD analysis formalizes the idea that the RD design
behaves like a randomized experiment near the cutoff by imposing explicit
randomization-type assumptions that are stronger than the continuity-based conditions.
• When the running variable is continuous, the local randomization approach typically
requires stronger assumptions than the continuity-based approach; in these cases, it is
natural to use the continuity-based approach for the main RD analysis, and to use the
local randomization approach as a robustness check.
• But in settings where the running variable is discrete or other departures from the
canonical RD framework occur, the local randomization approach no longer imposes the
strongest assumptions and can be a natural and useful method for estimation and
inference in practice.
Introduction
• In a continuity-based approach, the RD treatment effect in 1(b) can be estimated by calculating the
limit of the conditional average of the observed outcomes given the score as the score approaches
the cutoff for the treatment and control groups separately, lim 𝐸 𝑌𝑖 𝑋𝑖 = 𝑥 − lim 𝐸 𝑌𝑖 𝑋𝑖 = 𝑥 .
𝑥↓𝑐 𝑥↑𝑐
• The estimation of these limits requires that the researcher approximate the regression functions,
and this approximation will typically contain an error that may directly affect estimation and
inference.
• This is in stark contrast to the experiment depicted in Figure 1(a), where estimation does not
require functional form assumptions: by construction, the regression functions are constant in the
entire region where the score is randomly assigned.
• This shows that RD designs are not canonical randomized experiments but rather natural
experiments, and thus belong to the toolkit of observational studies methods.
Local randomization approach
• A point often overlooked is that the known functional form of the regression functions in
a true experiment does not follow from the random assignment of the score per se, but
rather from the lack of relationship between the score and the potential outcomes that is
assumed to be a consequence of the randomization. If the value of the score were
randomly assigned but had a direct effect on the average outcomes, the regression
functions in Figure 1(a) would not necessarily be flat.
• A local randomization approach to RD analysis must thus be based not only on the
assumption that placement above or below the cutoff is randomly assigned within a
window of the cutoff, but also on the assumption that the value of the score within this
window is unrelated to the potential outcomes -a condition that is not guaranteed by the
random assignment of the score Xi (nor by the random assignment of the treatment Ti).
Local randomization approach
• LR1 requires that 𝑃𝑤 [𝑋𝑤 ≤ 𝑥] = 𝐹(𝑥) for some known joint c.d.f. F(x).
For example, this condition holds when all units have the same probability
of receiving all possible score values in W, and therefore equal probability
of being assign to control (Xi < c) or treatment (𝑋𝑖 ≥ 𝑐) when the window
W is symmetric around the cutoff c.
• The second condition, LR2, is the exclusion restriction ensuring that the
potential outcomes are not a function of the score for those units with
score inside W, as would be expected in a true double-blind randomized
experiment.
Local randomization approach
• Under LR1 and LR2, for all units with 𝑋𝑖 ∈ 𝑊 = [𝑐 − 𝑤; 𝑐 + 𝑤], placement
above or below the cutoff is unrelated to the potential outcomes, and the
potential outcomes are unrelated to the running variable; therefore, the
regression functions are at inside W.
• This is illustrated in Figure 2 where for the case of random potential outcomes
𝜇1 𝑥 = 𝐸[𝑌𝑖 (1)|𝑋𝑖 = 𝑥] and 𝜇0 𝑥 = 𝐸[𝑌𝑖 (0)|𝑋𝑖 = 𝑥] are constant for all values
of x in W.
• The contrast between Figures 1(a), 1(b), and 2 illustrates the differences between
an actual experiment where the score is a randomly generated number, a
continuity-based RD design, and a local randomization RD design.
Figure 2.
Differences: EXP, RD cont and RD exp
• In the actual experiment, the potential outcomes are unrelated to the score for all possible score
values. In this case, the functional forms of E[Yi(1)|Xi = x] and E[Yi(0)|Xi = x] are known.
• In the continuity-based RD design, the potential outcomes can be related to the score everywhere;
the functions E[Yi(1)|Xi = x] and E[Yi(0)|Xi = x] are unknown but assumed to be smooth, and
estimation and inference is based on approximating them near the cutoff.
• Finally, in the local randomization RD design, the potential outcomes can be related to the running
variable far from the cutoff, but there is a window around the cutoff where this relationship ceases.
In this case, the functions E[Yi(1)|Xi = x] and E[Yi(0)|Xi = x] are unknown over the entire support of
the running variable, but inside the window W they are assumed to be constant functions of x -and
are therefore known.
• A key advantage of the local randomization approach is that it enables finite sample inference
methods, which remain valid and can be used even when only a handful of observations very close
to the cutoff are included in the analysis.
Local Randomization. Estimation and
Inference
Estimation
• The definition of 𝜃𝑆𝑅𝐷 is designed to cover both random and non-random potential
outcomes under different sampling schemes.
Estimation
1
• In a Neyman framework, it reduces to 𝜃𝑆𝑅𝐷 = σ𝑖:𝑋𝑖 ∈𝑊 𝑌𝑖 1 − 𝑌𝑖 0 because the
𝑁𝑤
potential outcomes are fixed and the (conditional) expectation integrates to one.
• In the super-population framework under i.i.d. sampling, we have
• 𝜃𝑆𝑅𝐷 = 𝐸 𝑌𝑖 1 − 𝑌𝑖 0 |𝑋𝑖 ∈ 𝑊]
• The parameter 𝜃𝑆𝑅𝐷 is different from the more conventional continuity-based RD
parameter 𝜏𝑆𝑅𝐷 .
• While 𝜃𝑆𝑅𝐷 is an average effect inside an interval (the window W), 𝜏𝑆𝑅𝐷 is an average at a
single point (the cutoff c) where the number of observations is zero whenever the score is
continuously distributed.
Estimation
• Naturally, the smaller the window W is, the more conceptually similar 𝜃𝑆𝑅𝐷
and 𝜏𝑆𝑅𝐷 become.
• Under the local randomization assumptions invoked within W, we have
1 𝑇𝑖 𝑌𝑖 1 (1−𝑇𝑖 )𝑌𝑖
• 𝜃𝑆𝑅𝐷 = σ𝑖:𝑋𝑖∈𝑊 𝐸𝑤 − σ𝑖:𝑋𝑖 ∈𝑊 𝐸𝑤
𝑁𝑤 𝑃𝑤 [𝑇𝑖 =1] 𝑁𝑤 1−𝑃𝑤 [𝑇𝑖 =1]
•
• regardless of whether the potential outcomes are fixed or random.
Estimation
• The standard difference- in-means estimator is a particular case of 𝜃መ𝑆𝑅𝐷 with 𝜔𝑖 = 1 for
all units.
Inference
• For inference, both the Neyman and the super-population approaches rely on a Gaussian
approximation justified by appropriate central limit theorems.
• where 𝑉 denotes an appropriate choice of variance estimator, which can depend on the
specific framework considered.
Inference
• Specifically, the requirement is that Zi be associated with the score in windows larger
than W0, possibly due to correlation between the score and another characteristic that
also affects Zi, but independent of the score in W0 and all smaller windows. Moreover,
because Zi is a predetermined covariate, the effect of the treatment on Zi is zero by
construction.
• We define a generic null hypothesis H0 stating that the treatment is unrelated to Zi (or
that Zi is “balanced" between the groups).
• This hypothesis could be the Fisherian hypothesis 𝐻0𝐹 or the large sample hypothesis H0.
• The procedure starts with the smallest possible window -W1 in Figure 1 and tests H0.
How to choose the window
• 3. Test statistic. Typical choices of the statistic used to test H0 include the
difference-in-means, the Kolmogorov-Smirnov statistic, and the Wilcoxon
rank-sum statistic.
• 4. Randomization mechanism. If Fisherian inference is used, typical choices
of randomization mechanisms are complete randomization and Bernoulli
assignment. Large sample methods do not require the full specification of
the assignment mechanism, but do require general assumptions about the
type of assignment such as no stratification.
Choosing the window
• The continuity-based approach is more convenient when there are enough observations
near the cutoff to approximate the regression functions with reasonable accuracy -but
possibly inadequate when the number of observations is small.
• In applications with few observations, the local randomization approach has the
advantage of requiring minimal extrapolation and avoiding the use of smoothing
methods.
• Another situation in which a local randomization approach may be preferable to a
continuity-based approach is when the running variable is discrete -i.e., when multiple
units share the same value of the score. When the score is discrete, the continuity-based
approach is not directly applicable, and the local randomization is often a natural and
useful alternative.
The Fuzzy RD design
Fuzzy design
• Some units fail to comply with the treatment condition that is assigned to
them.
• The phenomenon of units receiving a treatment condition different from
the condition that is originally assigned to them is generally known as
imperfect compliance or non-compliance.
• The RD design with imperfect compliance is usually referred to as the
Fuzzy RD design, to distinguish it from the Sharp RD design where
compliance is perfect.
Fuzzy design
• The Fuzzy RD treatment assignment rule is still 𝑇𝑖 = 1(𝑋𝑖 ≥ 𝑐) but compliance with this
assignment is imperfect.
• As a consequence, although the probability of receiving treatment still jumps abruptly at
the cutoff, it no longer changes from 0 to 1 as in the Sharp RD case.
• Di to denote whether the treatment was actually received by unit i. Our notation now
distinguishes between the treatment assigned, Ti, and the treatment received, Di.
• We can thus say that the key characteristic of the Fuzzy RD design is that there are some
units for which 𝑇𝑖 ≠ 𝐷𝑖 .
Figure 1. Sharp and fuzzy design
Estimation and inference
Estimates using OLS
.8
.6
prob treatment
.4
.2
0
• The estimand is the ratio of the jump in the regression of the outcome on the covariate to
the jump in the regression of the treatment indicator.
• Formally:
𝑙𝑖𝑚𝑥↓𝑐 𝐸𝑌 𝑋 = 𝑥 −𝑙𝑖𝑚𝑥↑𝑐 𝐸 𝑌 𝑋 = 𝑥
• 𝜏𝐹𝑅𝐷 =
𝑙𝑖𝑚𝑥↓𝑐 𝐸 𝑊 𝑋 = 𝑥 −𝑙𝑖𝑚𝑥↑𝑐 𝐸 𝑊 𝑋 = 𝑥
𝑙𝑖𝑚𝑥↓𝑐 𝐸 𝑌 𝑋 = 𝑥 −𝑙𝑖𝑚𝑥↑𝑐 𝐸 𝑌 𝑋 = 𝑥
• 𝜏𝐹𝑅𝐷 =
𝑙𝑖𝑚𝑥↓𝑐 𝐸 𝑊 𝑋 = 𝑥 −𝑙𝑖𝑚𝑥↑𝑐 𝐸 𝑊 𝑋 = 𝑥
• The estimand is an average effect of the treatment, but only averaged for units with Xi=c
(by RD), and only for compliers (people who are affected by the threshold).
Estimation using IV logic
• 2SLS estimation
• 𝑌𝑖𝑅𝐷 = 𝐶𝑖𝑅𝐷′ 𝛽𝑟𝑑2 + 𝑓 𝑋𝑖𝑅𝐷 + 𝜏𝑟𝑑2 𝑊
𝑖𝑅𝐷 + 𝜖𝑖
• Both SRD and FRD provide estimates of the average effect for a subpopulation, namely
the subpopulation with covariate value equal to Xi=c.
• The FRD restricts to the compliers.
• So the design has a limited degree of external validity. But it could be of policy interest in
cases where the policy question concerns changing the location of the threshold.
• But the RD has a relatively high degree of internal validity.
Examples of papers using RD
• Chay, MacEwan and Urquiola (2005) evaluates the impact of p-900 chilean program using
come cutoff in test score at school level.
• Card et al., (2004) evaluates the impact of medicare by using age.
• Van Der Klaauw (2002) evaluate the impact of financial aid on college attendance by using
SAT.
• Black (1999). The paper estimates the value that parents places on school quality by
calculating how much more people pay for houses located in areas with better schools.
The paper compares houses on opposite sides of attendance district boundaries (the
geographic lines that determine which school a child attends within a school district).
Examples
• Mark Pitt, and Shahidur Khandker, 1998. This paper estimates the impact of participation,
by gender, in the Grameen Bank and two other group-based credit programs in
Bangladesh on the labor supply, schooling, household expenditure, and assets.
• Using a RD design it finds that credit program has a larger effect when women are the
program participants. After controlling for village fixed-effect, the paper uses a sample
that includes households in villages with treatment choice (program villages) but that are
excluded from making a treatment choice by some exogenous rule: household owing
more than one-half acre of land are precluded from joining any of the credit programs
Examples
• Jacob B. A., and Lefgren L. (2004). The program.- In 1996, the Chicago Public School
System (CPS) placed 71 of its 489 elementary schools on academic probation.
• These probation schools received special funding for staff development as well as
technical assistance and enhanced monitoring.
• Eligibility for probation was determined on the basis of standardized reading scores
(schools with fewer than 15 percent of students scored at or above national norms in
reading were subject to probation). This creates a highly non-linear relationship between
a school’s reading achievement in 1996 and the likelihood that the school was on
probation in subsequent years.
Examples
• Ludwing and Miller (2007) evaluate the impact of the program Head Start using a RD.
• The program was established in 1965 to provide preschool, health, and other social
services to por children age three to five and their families.
• In 1965 the Office of Economic Opportunity (OEO) provided technical assistance to the
300 poorest counties to develop Head Start proposals. The result was a large and lasting
discontinuity in Head Start funding rates at the OEO cutoff for grant-writing assistance.
• Positive impact on infant mortality and educational attainment.