bengtzzon: 2026

söndag 3 maj 2026

Differences-in-differences-approachen

Sedan 00-talet har differences-in-differences varit en av de absolut vanligaste metoderna inom empirisk nationalekonomi. I detta inlägg tänkte jag göra en överblick över hur diskussionerna kring denna metod utvecklats de senaste 20 åren.

Redan 2004 publicerade Marianne Bertrand (U Chicago), Esther Duflo (MIT) och Sendhil Mullainathan (MIT) en artikel med den braskande rubriken "How much should we trust differences-in-differences estimates?" Så här förklarar de estimatorns popularitet: "The great appeal of DD estimation comes from its simplicity as well as its potential to circumvent many of the endogeneity problems that typically arise when making comparisons between heterogeneous individuals (see Meyer [1995] for an overview)." (s. 250) Men de pekar också på att det redan (år 2004) finns en stor kritisk litteratur om metoden. "Treatment" i D-i-D handlar ofta om lagar som införs i delstater i USA, säger de, och att man jämför delstater med och utan sådan lag, men en vanlig invändning är då: är införandet av ens treatment verkligen exogent? [1]

Det är dock inte detta problem som Bertrand et al fokuserar på, utan problem med alltför små standardfel till ens estimatorer. De presenterar DiD-estimatorn som följande ekvation:

Y_ist = A_s + B_t + cX_ist + βI_st + ε_ist,

där Y är utfallet för individen i grupp (t ex delstat) s år t, A_s och B_t är fixed effects för delstat och år, X_ist är kontrollvariabler på individnivå, och I_st är en dummy för treatment eller ej. β är alltså koefficienten för effekten. Standardfelen för denna koefficient är oftast OLS-SEs, ibland korrigerade för korrelerade chocker inom delstat-år-celler (alltså något slags klustring).

Argumentet i Bertrand et als artikel är att skattningen av ekvationen lider ev ett stort och underskattat problem med seriekorrelation. Tre saker gör att detta blir ett särskilt stort problem i en DID-kontext. Ett, DID har ganska långa tidsserier -- studierna de kollar på har i genomsnitt 16,5 perioder. Två, de vanligaste utfallsvariablerna har stark seriekorrelation. Tre, treatment-variabeln I varierar inte mycket, om alls, över tid.

För att beräkna hur stort problemet är kör de i en rad simulationer där de inför placebo-lagar på delstatsnivå i USA. När man skattar effekter av dessa fiktiva lagar borde dessa effekter statistiskt sett bli signifikanta på 5-procentsnivån 5 procent av gångerna, men när de t ex provar med kvinnolöner som utfallsvariabler och med 21 år data så hittar de en "signifikant effekt" av den fejkade lagen 45 procent av gångerna. De replikerar också detta med Monte Carlo-metod. Monte Carlo-metoden använder de också för att prova vad för fixar för seriekorrelationen som biter. En parametrisk korrektion för en viss DGP som en AR(1) räcker inte. En icke-parametrisk teknik, "block bootstrap", fungerar när antalet delstater/grupper är stort nog. Enklare fixar kan också funka. Den ena är att ta bort tidsseriedimensionen genom att helt enkelt dela in datat i en pre- och en post-period. Den andra är att "one can allow for an unrestricted covariance structure over time within states, with or without making the assumption that the error terms in all states follow the same process. This technique works well when the number of groups is large (e.g., 50 states) but fares more poorly as the number of groups gets small." (s. 252)

Deras översikt över DID-artiklar samlar alla atiklar med denna metod i sex tidskrifter mellan 1990 och 2000. "We classiﬁ ed a paper as “DD” if it focuses on speciﬁc interventions and uses units unaffected by the law as a control group." Med denna metod hittar de 92 DID-artiklar i de sex tidskrifterna. Av dessa använde 18 sysselsättning som utfall, 13 löner, 8 hälsa eller medicinska utlägg, 6 arbetslöshet, 4 fertilitet, 4 försäkringar, 3 fattigdom, och 3 konsumtion eller sparande. Det genomsnittliga antalet perioder är 16,5 men bara 5 av artiklarna diskuterar uttalat autokorrelation; av dessa 5 använder 4 en autoregressiv modellspecifikation (AR(k)). [2]

I deras diagnostiska tester av var problemen kommer ifrån börjar de med faktiska lönedata från delstaterna (från Current Population Survey) men med påhittade lagar som införs. Därefter experimenterar de också med att fejka lönedata som får följa en AR(1)-struktur. De provar också hur väl en block bootstrap-modell hanterar problemet. Wikipedia definierar block bootstrap så här:

"The block bootstrap is used when the data, or the errors in a model, are correlated. In this case, a simple case or residual resampling will fail, as it is not able to replicate the correlation in the data. The block bootstrap tries to replicate the correlation by resampling inside blocks of data (see Blocking (statistics)). The block bootstrap has been used mainly with data correlated in time (i.e. time series) but can also be used with data correlated in space, or among groups (so-called cluster data)."

Med 50 delstater funkar block bootstrap-metoden bra för att sluta få typ I-fel (alltså att man tror att det finns en signifikant effekt fast det inte gör det). Men med färre antal stater -- 20, 10 -- så funkar det mindre bra. Nästa approach är den enklare, att helt enkelt reducera tidsseriedimensionen till bara två perioder: pre och post. Detta funkar bara om alla treated delstater upplever treatment samtidigt, men de säger att man kan modifiera regressionen så att det funkar också med heterogen treatment. (s. 267)

Den sista korrigeringsmetoden är vad de kallar "empirical variance-covariance matrix" och de introducerar denna så här:

"Speciﬁcally, suppose that the autocorrelation process is the same across all states and that there is no cross-sectional heteroskedasticity. In this case, if the data are sorted by states and (by decreasing order of) years, the variance-covariance matrix of the error term is block diagonal, with 50 identical blocks of size T by T (where T is the number of time periods). Each of these blocks is symmetric, and the element (i, i j) is the correlation between i and i j. We can therefore use the variation across the 50 states to estimate each element of this matrix, and use this estimated matrix to compute standard errors. Under the as- sumption that there is no heteroskedasticity, this method will produce consistent estimates of the standard error as N (the number of groups) goes to inﬁ nity [Kiefer 1980]." (s. 250)

Metoden funkar väl med 50 delstater, men mindre bra med ett mindre antal stater. De gör också en variant, "arbitrary variance-covariance matrix".

I slutsatserna betonar de att autokorrelationen i utfallen gör att standardfelen i många DID-studier är starkt underskattade. Med tanke på att t-värdena i många av de 92 DID-studier de tittat på ligger runt 2, betyder det att "effekterna" som skattats inte alls är statistiskt signifikanta.

Året därefter, 2005, publicerade Alberto Abadie (då Harvard, sedan 2016 MIT) också han en artikel om metodproblemen inom DID-litteraturen. Hans artikel är dock mycket annorlunda. I abstract introducerar han problematiken: "the conventional DID estimator requires that, in the absence of the treatment, the average outcomes for the treated and control groups would have followed parallel paths over time. This assumption may be implausible if pre-treatment characteristics that are thought to be associated with the dynamics of the outcome variable are unbalanced between the treated and the untreated." Identifikationsproceduren som han använder kommer från Heckman et al (“Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme”, REStud, 1997, “Characterizing Selection Bias using Experimental Data”, Econometrica 1998) säger han, men han presenterar tre nya saker. Ett, estimationsproceduren kräver inte upprepade observationer från samma individer. Två, "it allows the estimation of parsimonious parametric approximations to the average effect of the treatment on the treated conditional on selected covariates of interest." Tre, hans ramverk kan hantera olika intensiteter i treatment.

Abadie presenterar en utfallsvariabel Y_it som genereras i en viss datagenereringsprocess, en components of variance process:

Y (i, t) = δ(t) + α · D(i, t) + η(i) + υ(i, t),

där α är effekten av treatment och de andra delarna är tids-delen, individ-delen och v (i, t) är en "individual-transitory shock that has mean zero at each period, t = 0, 1, and is possibly correlated in time". Bara Y och D observeras, andra skattas. Efter en del manipulation får han:

Y (i, t) = µ + τ · D(i, 1) + δ · t + α · D(i, t) + ε(i, t).

Som är en difference-in-difference-modell, eftersom:

α = {E[Y (i, 1) | D(i, 1) = 1] − E[Y (i, 1) | D(i, 1) = 0]}
− {E[Y (i, 0) | D(i, 1) = 1] − E[Y (i, 0) | D(i, 1) = 0]},

Vi vill ju beräkna α som är effekten av treatment. Men hur funkar detta om det finns en selektion in i treatment eller helt enkelt en korrelation i utfallsvariabeln före treatment? Abadie refererar här till "Ashenfelter's dip" efter Ashenfelter (1978) som visade att individer som utvaldes till utbildningsprogram tenderade att ha negativa inkomstvariationer just innan de sattes i utbildningen. Ashenfelter och Card föreslog redan 1985 ett sätt för att hantera detta problem.

Till skillnad från Bertrand et als artikel är Abadies artikel mycket teknisk och jag kommer inte gå in på detaljerna, men det räcker väl att säga att han resonerar principiellt ekonometriskt om hur man ska hantera problemet med "Ashenfelter's dip" och relaterad selektion in i treatment och ifall man kan använda kontrollvariabler för att göra detta. Meyer (1995) pekar på ett problem med att ha med kovariater i ens regression, ifall ens treatment har olika effekter på olika grupper i befolkningen. Abadie lägger fram en ny metod för att inkludera kontrollvariabler i en DID-modell. Så här introducerar han sin egen approach till detta problem jämfört med Heckman et als förslag:

"A related way to accommodate covariates in a DID estimator has been explored by Heckman et al. (1997, 1998) who propose a DID estimator of the average treatment effect on the treated based also on conditional identiﬁcation restrictions. Their estimator is constructed by matching differences in pre-treatment and post-treatment outcomes for the treated to weighted averages of differences in pre-treatment and post-treatment outcomes for the untreated. The differences are matched on the probability of treatment exposure conditional on the covariates (the propensity score) and the weights are determined non-parametrically using local linear regression. This article, however, proposes a direct weighting scheme on the propensity score that can be used to estimate the effect of the treatment on the treated without estimating weights non-parametrically in a previous step." (s. 4-5)

Om bidraget i Heckman et al (1997) var att föreslå en slags variant av propensity score matching, en teknik för att kontrollera för vilka enheter som väljs in i treatment och vilka som inte gör det, så kan man säga att Abadie föreslår en annan lösning:

"This article proposes simple weighting schemes to produce estimators of the average effect on the treated E[Y 1(1) − Y 0(1) | D = 1] and parsimonious parametric approximations to its conditional version E[Y 1(1) − Y 0(1) | Xk , D = 1], where Xk is a function of X (for example, a subset of the variables in X). The weighting scheme is directly based on the propensity score, P(D = 1 | X), which is the only function which needs to be estimated in a ﬁrst step. As a result, the proposed method reduces the ﬁrst step estimation burden and allows the researcher to use four or two times more observations for ﬁrst step estimation, relative to direct estimation of equation (9). In practice, this feature may be an important advantage if non-parametric estimation is carried out in the ﬁrst step. When the number of observations is too small for non-parametric estimation in the ﬁrst step, the proposed method allows the researcher to circumvent the curse of dimensionality by placing parametric restrictions on the propensity score, which leaves E[Y 1(1) − Y 0(1) | Xk , D = 1] unrestricted, rather than on each one of the conditional means of equation (9), which may impose unwanted restrictions on E[Y 1(1) − Y 0(1) | Xk , D = 1]." (s. 7)

En central skillnad mellan Heckman et als (1997) metod och Abadies (2005) metod är att den tidigare handlade om att matcha enheter baserat på deras egenskaper (vektorn kontrollvariabler X) och sannolikheter att bli "treated" eller inte, medan Abadies kräver en längre tidshorisont och bygger på hur treated och non-treated enheter förändrades före treatment sattes in, givet deras egenskaper (vektorn kontrollvariabler X). I sina slutsatser säger han: "In this article, I have introduced a family of semiparametric difference-in-differences estimators of treatment effects based on conditional identiﬁcation restrictions. These estimators may be particularly appropriate when the distribution of observed characteristics that are thought to be related to the dynamics of the outcome variable differs between treated and untreated." (s. 13) Det seminparametriska här syftar på att Abadies estimator inte kräver linjära effekter av X på Y men att matchningen låter en skilja på olika grupper (utifrån värden på X) men utan att säga att β av X ska vara densamma överallt.

2008 hade DID-litteraturen nått så långt att den fick stora översikter och synteser. Då var det Guido M. Imbens och Jeffrey M. Wooldridge som publicerade "Recent Developments in the Econometrics of Program Evaluation" som ett NBER WP; pappret publicerades ett år senare i Journal of Economic Literature, men jag har läst WP-versionen. De sade i sin introduktion att den ekonometriska litteraturen om kausala effekter av "programs or policies" under de två föregående decennierna nått sådan "level of maturity" att det var dags för en surveyartikel.

Det fundamentala metodologiska problemet med att utvärdera effekten av en treatment eller ett program är att utfallet för varje enhet bara kommer vara ett: "The problem is that we can at most observe one of these outcomes because the unit can be exposed to only one level of the treatment." Diskussionen om hur man kan hantera detta problem går tillbaka till Ashenfelter (1978) och följande studier av Ashenfelter and Card (1985), Heckman and Robb (1985), Lalonde (1986), Fraker and Maynard (1987), Card and Sullivan (1988), och Manski (1990). Denna litteratur fokuserade metodologiskt framför allt på problem med endogenitet, ifall det fanns en systematik i vilka enheter som blev behandlade och vilka som inte blev det. Parallellt så arbetade Rubin (1973, 1974, 1977, 1978) inom statistiken med ungefär samma problem och hans lösning på problemet fick av Holland (1986) namnet Rubin Causal Model. (I slutsatssektionen säger Imbens och Wooldridge att en av de viktiga aspekterna av "the modern literature" är att statistiker och ekonometriker nu konvergerat runt "the Rubin potential outcomes framework" som "the dominant framework", s. 75.) Den statistiska idealmodellen för att utvärdera en kausal effekt av en intervention är så klart om interventionen är helt slumpmässigt tilldelad (randomiserad); denna typ av experiment är ovanliga inom nationalekonomin men har under 2000-talet använts inom utvecklingsekonomin (Duflo 2001; Miguel och Kremer 2004; Angrist, Bettinger och KRemer 2005; Banerjee, Duflo, Cole och Lnden 2007) och inom beteendeekonomi (Bertrand och Mullainathan 2004). Det är dock vanligare med observationsdata och då gäller det för att etablera kausalitet snarare att skapa jämförbara treated och untreated grupper:

"All these labels refer to some form of the assumption that adjusting treatment and control groups for diﬀerences in observed covariates, or pretreatment variables, remove all biases in comparisons between treated and control units. This case is of great practical relevance, with many studies relying on some form of this assumption. The semiparametric eﬃ ciency bound has been calculated for this case (Hahn, 1998) and various semi-parametric estimators have been proposed (Hahn, 1998; Heckman, Ichimura, and Todd, 1998; Hirano, Imbens and Ridder, 2003; Chen, Hong, and Tarozzi, 2005; Imbens, Newey and Ridder, 2005; Abadie and Imbens, 2006)." (s. 2-3)

Det finns en rad strategier för att hantera detta, säger de och rabblar upp 10- och 20-talets stora metoder en efter en:

"Without unconfoundedness there is no general approach to estimating treatment eﬀects.Various methods have been proposed for special cases, and in this review we will discuss several of them. One approach (Rosenbaum and Rubin, 1983; Rosenbaum, 1995) consists of sensitivity analyses, where robustness of estimates to speciﬁc limited departures from unconfoundedness are investigated. A second approach, developed by Manski (1990, 2003, 2007), consists of bounds analyses, where ranges of estimands consistent with the data and the limited assumptions the researcher is willing to make, are derived and estimated. A third approach, instrumental variables, relies on the presence of additional treatments, the so-called instruments, that satisfy speciﬁc exogeneity and exclusion restrictions. The formulation of this method in the context of the potential outcomes framework is presented in Imbens and Angrist (1994) and Angrist, Imbens and Rubin (1996). A fourth approach applies to settings where, in its pure form, overlap is completely absent because the assignment is a deterministic function of covariates, but comparisons can be made exploiting contintuity of average outcomes as a function of covariates. This setting, known as the regression discontinuity design, has a long tradition in statistics (see Shadish, Campbell, and Cook, (2002), Cook (2007) for a historical perspective), but has recently been revived in the economics literature through work by VanderKlaauw (2002), Hahn, Todd, and VanderKlaauw (2000), Lee (2001), and Porter (2003). Finally, a ﬁfth approach, referred to as diﬀ erence–in–diﬀ erences, relies on the presence of additional data in the form of samples of treated and control units before and after the treatment. An early application is Ashenfelter and Card (1985). Recent theoretical work includes Abadie (2005), Bertrand, Duﬂ o and Mullainathan (2004), Donald and Lang (2008), and Athey and Imbens (2006)."

Ur-exemplet är ett job market training program, säger de: det arketypiska studieobjektet sedan Ashenfelter (1978) och Lalonde (1986).

Sektion 2 av översiktsartikeln handlar om Rubin Causal Model. Individen i (bland i = 1, ..., N) har två potentiella utfall, Y_i0 och Y_i1, där det första är om han eller hon inte är med i programmet (W_i=0) och det andra är om han eller hon är med (W_i=1). "This distinction between the pair of potential outcomes (Yi(0), Yi(1)) and the realized outcome Yi is the hallmark of modern statistical and econometric analyses of treatment eﬀects." (s. 5) Ramverket kommer från början från Neyman (1923) och har också utvecklats av Haavelmo (1943) i dennes arbete på simultaneous equations models, SEMs (Haavelmo ville studera sambandet mellan utbud och efterfrågan). Så här diskuterar Imbens och Wooldridge fördelarna med ett ramverk av potential outcomes jämfört med ett av realized outcomes:

"The potential outcomes framework has a number of advantages over a framework based directly on realized outcomes. The ﬁ rst advantage of the potential outcome framework is that it allows us to deﬁ ne causal eﬀ ects before specifying the assignment mechanism, and without making functional form or distributional assumptions. The most common deﬁnition of the causal eﬀect at the unit level is as the diﬀerence Yi(1) − Yi(0), but we may wish to look at ratios Yi(1)/Yi(0), or other functions. Such deﬁnitions do not require us to take a stand on whether the eﬀ ect is constant or varies accross the population. Further, deﬁ ning individual-speciﬁc treatment eﬀects using potential outcomes does not require us to assume endogeneity or exogeneity of the assignment mechanism. By contrast, the causal eﬀects are more diﬃcult to deﬁ ne in terms of the realized outcomes. Often, researchers write down a regression function Y_i = α + τ · W_i + ε_i. This regression function is then interpreted as a structural equation, with τ as the causal eﬀect. Left unclear is whether the causal eﬀect is constant or not, and what the properties of the unobserved component, ε_i, are. The potential outcomes approach separates these issues, and allows the researcher to ﬁrst deﬁne the causal eﬀect of interest without considering probabilistic properties of the outcomes or assignment." (s. 5-6)

Den andra fördelen med POA är att den "links the analysis of causal eﬀects to explicit manipulations." (s. 6) När man utgår från tänkandet kring vad för slags utfall som skulle kunna observeras så manas man att tänka på vilka förlopp som kan påverka vilka utfall som uppstår. (Detta gillar jag verkligen -- det påminner mig om den mycket mer specifika poängen om interaktionsmodeller att man måste tänka på ifall en viss kombination, som modellen förutsätter, ens är möjlig.) En tredje fördel, säger Imbens och Wooldridge, är att den "separates the modelling of the potential outcomes from that of the assignment mechanism. Modelling the realized outcome is complicated by the fact that it combines the potential outcomes and the assignment mechanism." En fjärde fördel är att den "allows us to formulate probabilistic assumptions in terms of potentially observable variables, rather than in terms of unobserved components." En femte fördel är att den klargör var osäkerheten i estimatorerna kommer ifrån.

RCM:s andra komponent efter potential outcomes är the assignment mechanism: "This is deﬁned as the conditional probability of receiving the treatment, as a function of potential outcomes and observed covariates." De ser tre varianter av denna, från enklast till svårast. Den första är ett randomiserat experiment där assignment to treatment inte korrelerar med möjliga utfall. Metoder för att studera fall med denna typ av tilldelningsmekanism diskuteras i sektion 4. Den andra typen av tilldelningsmekanism "maintains the restriction that the assignment probabilities do not depend on the potential outcomes": W_i ⊥ (Y_i(0), Y_i(1)) | X_i, alltså att sannolikheten för att hamna i treatment-gruppen är oberoende av utfallen givet vad vi vet om kontrollvariablerna X_i. ". The precise form of this critical assumption, not tied to functional form or distributional assumptions, was ﬁ rst presented in Rosenbaum and Rubin (1983a). Following Rubin (1990) we refer to this assignment mechanism as unconfounded assignment.
Somewhat confusingly, this assumption, or variations on it, are in the literature also referred to by various other labels. These include selection on observables, exogeneity, and conditional independence." Metoder för att studera fall med denna typ av tilldelningsmekanism diskuteras i sektion 5. Den tredje typen är alla andra, och dessa metoder diskuteras i sektion 6: indstrumentvariabler, regressionsdiskontinuitet, och DID.

En slags begränsning med denna typ av metoder är att man i princip alltid utgår ifrån att en enhets treatment inte påverkar utfallen för en annan enhet. Det är väl i princip vad som ligger bakom att dessa metoder är så stora inom applicerad mikro, men inte inom makro.

"In most of the literature it is assumed that treatments received by one unit do not aﬀect outcomes for another unit. Only the level of the treatment applied to the speciﬁ c individual is assumed to potentially aﬀ ect outcomes for that particular individual. In the statistics literature this assumption is referred to as the Stable-Unit-Treatment-Value-Assumption (SUTVA, Rubin, 1978). In this paper we mainly focus on settings where this assumption is maintained." (s. 9)

Detta antagandet är välmotiverat i medicinska studier, säger Imbens och Wooldridge: om en individ får en ny behandling får stroke så kommer detta inte påverka hälsoutfallen för en helt annan patiant. I nationalekonomiska applikationer är detta ett mer problematiskt antagande: "It is clear that a labor market program that aﬀects the labor market outcomes for one individual potentially has an eﬀect on the labor market outcomes for others. In a world with a ﬁxed number of jobs, a training program could only redistribute the jobs, and ignoring this constraint on the number of jobs by using a partial, instead of a general, equilibrium analysis could lead one to erroneously conclude that extending the program to the entire population would raise aggregate employment. Such concerns have rarely been addressed in the recent program evaluation literature. Exceptions include Heckman, Lochner, and Taber (1999) who provide some simulation evidence for the potential biases that may result from ignoring these issues." (s. 9) [3]

Efter genomgången av RCM-modellens beståndsdelar följer sektionen "What are We Interested in? Estimands and Hypotheses". I de tidiga studierna i denna litteratur, säger Imbens och Wooldridge, så utgick man från att effekter var homogena och linjära. I litteraturen idag använder man mer flexibla beräkningar. Diskussionen börjar efter dessa preliminära poänger med en diskussion om average treatment effects. De börjar med estimanden PATE, Population Average Treatment Effect: τ_pate = E[Y_i(1) − Y_i(0)] . Detta är effekten som skulle uppstå på varje individ i populationen om varje individ behandlades. Och därefter Population Average Treatment Effect on the Treated, PATT: τ_patt = E[Y_i(1) − Y_i(0)|W_i = 1] . Detta är effekten på de individer som faktiskt behandlades. I praktiken kommer denna vara mycket mer relevant än PATE. Nästa variant är CATT och CATE som är conditional ATT och ATE, alltså conditional på X. Nästa variant är att beräkna effekter för subgrupper, utifrån Crump, Hotz, Imbens och Mitnik (2008).

"In settings with selection on unobservables the enumeration of the estimands of interest becomes more complicated. A leading case is instrumental variables. In the presence of heterogeneity in the eﬀ ect of the treatment one can typically not identify the average eﬀect of the treatment even in the presence of valid instruments. There are two new approaches in the recent literature. One is to focus on bounds for well-deﬁned estimands such as the average eﬀect τ_pate or τ_cate. Manski (1990, 2003) developed this approach in a series of papers. An alternative is to focus on estimands that can be identiﬁed under weaker conditions than those requird for the average treatment eﬀect. Imbens and Angrist (1994) show that one can, under much weaker conditions than required for identiﬁ cation of τ_pate, identify the average eﬀect for the subpopulation of units whose treatment status is aﬀ ected by the instrument. They refer to this subpopulation as the compliers. This does not directly ﬁt into the classiﬁcation above since the subpopulation is not deﬁned solely in terms of covariates. We discuss this estimand in more detail in Section 6.3." (s. 12-13)

Sektion 3.2 handlar om kvantil-estimander. Dessa introducerades i den statistiska litteraturen på 1970-talet men slog först nyligen igenom i nationalekonom, säger Imbens och Wooldridge 2008. Doksum (1974) and Lehman (1974) deﬁne τ_q = F_Y(1) ^−1 (q) - F_Y(0) ^-1 (q)" som q-kvantil-treatment-effekten. Kvantil-effekten definieras som skillnaden mellan "quantiles of the two marginal potential outcome distributions, rather than as quantiles of the unit level eﬀect".[4] Metoder för att skatta τ_q har utvecklats av Bitler, Gelbach och Hoynes (2002), Firpo (2006) och Abadie, Angrist och Imbens (2002).

Sektion 3.3 handlar om hypotestestning. Sektion 3.4 om "Decision-theoretic questions".

Efter dessa kommer sektion 4, om randomiserade experiment. En rad papers om arbetsmarknadspolicies i det sena 80-talet ifrågasatte de existerande metodernas förmåga att skatta kausala effekter, och på 90-00-talen har en stor mängd utvecklingsekonomisk forskning genomfört experiment. De diskuterar Fisher's (1925) exakta p-värden för hypotestestning. (s. 16-18)

Sektion 5 heter "Estimation and Inference under Unconfoundedness". Dessa metoder är vanligare än rena experiment, säger Imbens och Wooldridge.

"Methods for estimation of average treatment eﬀects under unconfoundedness are the most widely used in this literature. Often this assumption, which requires that conditional on observed covariates there are no unobserved factors that are associated both with the assignment and with the potential outcomes, is controversial. Nevertheless, in practice, where often data have been collected in order to make this assumption more plausible, there are many cases where there is no clearly superior alternative, and the only alternative is to abandon the attempt to get precise inferences. In this section we discuss some of these methods and the issues related to them. A general theme of this literature is that the concern is more with biases than with eﬃciency." (s. 19)

Så här diskuterar de vidare denna miljö:

"This setting is closely related to that underlying standard multiple regression analysis with a rich set of controls. Unconfoundedness implies that we have a suﬃciently rich set of predictors for the treatment indicator, contained in the vector of covariates X_i, such that adjusting for diﬀerences in these covariates leads to valid estimates of causal eﬀ ects. Combined with linearity assumptions of the conditional expectations of the potential outcomes given covariates, the unconfoundedness assumption justiﬁ es linear regression. But in the last ﬁfteen years the literature has moved away from the earlier emphasis on regression methods. The main reason is that, although locally linearity of the regression functions may be a reasonable approximation, in many cases the estimated average treatment eﬀects based on regression methods can be severely biased if the linear approximation is not accurate globally. To assess the potential problems with (global) regression methods, it is useful to report summary statistics of the covariates by treatment status." (s. 19)

Om vi har alla relevanta variabler i vektorn X så kommer effekten av treatment kunna skattas utan bias -- men det är ett starkt antagande. De föreslår att man testar detta antagande t ex genom ekvationen: ΔX= (X_1 - X_0) / (√S_0^2+S_1^2), alltså skillnaderna i covariates X mellan treated gruppen (1) och kontrollgruppen (0), justerad för variansen. De går över vill olika metoder för att justera för covariates. Propensity score matching beräknar sannolikheten att man hamnar i treatment snarare än kontrollgruppen; en annan variant är pairwise matching. De går över till att diskutera grundantagandet unconfoundedness, introducerat av Rosenbaum och Rubin (1983): W_i ⊥ (Y_i(0), Y_i(1) | X_i. Om treatment-effekten τ är konstant och ε_i är okorrelerad med W_i är den skattade effekten i regressionen kausal. Antagande 2 är overlap, 0 < pr(W_i = 1 | X_i = x) <1. Det säger att för alla möjliga värden på X så finns det både treated och untreated enheter. Rosenbaum och Rubin (1983) sammanfattade de två antagandena unconfoundedness och overlap som "strong ignorability". [5]

De presenterar en generell approachen till regression för att beräkna ATE och diskuterar olika aspekter av vad som gör estimatorn effektiv. Vad finns det för variation i datat och hur påverkar detta estimatorns olika aspekter? Här introducerar de via Heckman, Ichimura och Todd (1997) och samma författare plus Smith (1998) kernel regression som en metod för att hantera icke-linjära relationer, och polynomialer som ett annat alternativ. Med kernel-regressionen viktar man observationer närmre x högre, med en bandbredd satt som h: större h, stark smoothing, mindre h, mer diffus smoothing. Imbens och Wooldridge säger att bandbredden ofta sätts lite godtyckligt och konstaterar att det också finns versioner med kontinuerlig smoothing snarare än kernel-regressionens fastare smoothing. Sieve-estimatorn är ett exempel på denna approach.

De går vidare med metoder som är baserade på propensity scores, som också går tillbaka till Rosenbaum och Rubin (1983). Om unconfoundedness gäller så är de potentiella utfallen och treatment oberoende av varann, givet propensity scores. De diskuterar tre praktiska metoder för att använda detta. Den första är att använda enheternas propensity scores som förklarande variabel i en regressionsanalys. Imbens och Wooldridge rekommenderar att man ska avstå från denna metod:"Because the propensity score does not have a substantive meaning, it is diﬃcult to motivate a low order polynomial as a good approximation to the conditional expectation." (s. 28-29) De säger att individer med propensity scores 0.45 och 0.50 sannolikt är mycket mer lika varann än individer med PS 0.01 och 0.06. Den andra metoden, som kallas blocking, subclassification eller stratification justerar också för propensity scores som påminner om regressionsanalys, men mer flexibelt. Man delar in samplet i strata utifrån diskretiserade värden på PS och ser därefter fördelningen av treated och untreated inom varje strata som ett randomiserat experiment. Den tredje metoderna är att vikta om observationerna. Bland varianterna på detta finns inverse probability weighting-estimatorn. (s. 30-31)

Från propensity scores-metoderna går de till matching-metoder (sektion 5.5). "Matching estimators impute the missing potential outcomes using only the outcomes of a few nearest neighbors of the opposite treatment group. In that sense, matching is similar to non-parametric kernel regression, with the number of neighbors playing the role of the bandwidth in the kernel regression." (s. 31)

Följande sektion är "Combining Regression and Propensity Score Weighting" (5.6) och följande efter det handlar om att kombinera subclassification och regressionsanalys (5.7), och därefter matchning och regression (5.8). Efter dessa mer praktiska delar tar de ett steg tillbaka till "A General Method for Estimating Variances" (5.9). 5.10 är mer direkt intressant för mig: "Overlap in Covariate Distributions". (s. 39-43) Här börjar de med att diskutera den metod som Rubin (2006) föreslagit för att droppa kontrollenheter som är alltför olika treatment-enheter för att jämförelsen ska vara rimlig. Förutsättningen är alltså att man har ett stort antal kontrollenheter i sitt sample, så att man lite granna kan välja och vraka mellan dem. Rubin föreslår att man ordnar treated och kontrollenheter var för sig utifrån propensity score, alltså sannolikheten att selekteras in i treatment, och att man sedan matchar enheter utifrån propensity score. I processen släpper man de kontrollenheter som avviker mest från treated enheter vad gäller bakgrundsvariablerna. Crump, Hotz, Imbens och Mitnik (2008) föreslår en annan approach, ägnad för kontexter när man vill skatta average treatment effect, till skillnad från average effect for the treated som hos Rubin. Den sista sektionen i kapitel 5 är "Assessing the Unconfoundedness Assumption". (s. 43-46) Jag skippar denna och sektionen "Testing" och sektionen 5.13, "Selection of Covariates".

Kapitel 6 heter "Selection of Unobservables" och diskuterar olika metoder som har det gemensamt att de "relax the pair of assumptions made in Section 5", alltså framför allt unconfoundedness. Den första metoden som de diskuterar är Manskis (1990, 1995, 2003, 2005, 2007) metod om "bounds", som går ut på på i komplicerade sammanhang skatta inte ett precis punktestimat av en effekt, utan att skata en lower bound-effekt och en upper bound-effekt.

Den andra metoden är Sensitivity Analysis. Här relaxar man unconfoundedness-antagandet något, antar att det finns oobserverade variabler som är korrelerade både med utfallen och med treatment, och man beräknar hur stort problem detta är, hur stor bias är, genom att relatera treatment-status till tillgängliga kontrollvariabler. Frågan är: hur stor skillnad gör det för ens punktestimat (för effekten av treatment på Y) eller för p-värdet för effekten av treatment, om man inkluderar covariates?

Den tredje metoden är instrumentvariabler. Bloom (1984, “Accounting for No–shows in Experimental Evaluation Designs,” Evaluation Review) använde eligibility för ett program som instrument för deltagande i programmet. Denna typ av design funkar när elegibility delas ut slumpmässigt. Imbens och Angrist föreslog i sin klassiska artikel“Identification and Estimation of Local Average Treatment Effects” i Econometrica, 1994, en mycket bredare approach till instrument. Nyckelantagandet är att instrumentet är exogent: (Yi(0), Yi(1), Wi(0), Wi(1)) ⊥ Zi, alltså att utfallet kan vara Y0 eller Y1 med/utan behandling (W) oavsett värde på Z och personen kan ta upp hypotetiskt tillgänglig behandling eller ej oavsett Z. [6] Imbens och Angrist introducrade begreppet "compliance type", som fångar vilken treatment en individ får beroende på sitt värde på instrumentet. När både treatment och instrument är binära finns det fyra typer: never-taker, complier, defier, och always-taker. De introducerar också antagandet monotonitet, W_i(1) >= W_i(0) för alla individer, så att högre värde på Z inte ger lägre nivå på W. Detta antagande utesluter alltså typen "defier", och kallas ibland "no-defiance" assumption. De utforskar utifrån dessa två antaganden hur man kan identifiera average effect of the treatment på subpopulationen compliers. Imbens och Wooldridge förklarar ganska utförligt Imbens och Angrists analys av relationen mellan never-takers, compliers, och always-takers (när man uteslutit defiers) och hur man kan beräkna Local Average Treatment Effect LATE utifrån de tre gruppernas beteende. Imbens och Angrist delade på Nobelpriset i ekonomi 2021 och jag citerar (via Wikipedia) Nobelpriskommitténs motivering till priset: Imbens och Angrist och deras LATE-ramverk

"significantly altered how researchers approach empirical questions using data generated from either natural experiments or randomized experiments with incomplete compliance to the assigned treatment. At the core, the LATE interpretation clarifies what can and cannot be learned from such experiments."

Imbens och Wooldridge definierar den så här:

τ_late = E[Y_i(1) − Y_i(0)|W_i(0) = 0, W_i(1) = 1] = E[Y_i(1) − Y_i(0)|T_i = complier].

Här kan man däremot inte beräkna den genomsnittliga kausala effekten på never-takers eller always-takers, men däremot kan man använda Manskis bounds approach för att i alla fall sätta gränser uppåt och neråt för den genomsnittliga effekten i hela populationen.

Följande metod efter IV är Regression Discontinity Design, RDD. Denna metod har funnits inom psykologi och applicerad statistik sedan 1960-talet, säger de, men slog bara igenom i nationalekonomin på 1990-2000-talen: DiNardo och Lee (1994), Angrist och Lavy (1999), Van der Klaauw (2002), Lee, Moreetti och Butler (2004), och så vidare. RDD är i grunden en väldigt enkel metod: "The basic idea behind the RD design is that assignment to the treatment is determined, either completely or partly, by the value of a predictor (the forcing variable X_i) being on either side of a common threshold. This generates a discontinuity, sometimes of size one, in the conditional probability of receiving the treatment as a function of this particular predictor." (s. 58) Alltså, det är en kontext där en enhets placering ovanför eller nedanför ett tröskelvärde (på variabeln X) helt eller delvis bestämmer ifall enheten blir treated eller untreated. Man skiljer på sharp och fuzzy RDD. I en sharp RDD är alla enheter med värde på X över tröskelvärdet c treated (det är inte frivilligt), och alla enheter under c är untreated (de har inte tillgång till treatment). Man estimerar, säger Imbens och Wooldridge:

τ_srd = E[Yi(1) − Yi(0) X_i = c].

I fuzzy RDD innebär inte tröskelvärdet att sannolikheten för treatment skiftar från noll till ett, utan det måste bara finnas ett skifte där. I praktiken, säger de, måste diskontinuiteten vara så pass stor att man kan se den i enkla grafiska beskrivningar. (s. 60) I och med att tilldelningen till treatment och icke-treatment här inte är 100-procentig på de två sidorna av tröskelvärdet så kommer resonemangen om compliers, defiers etcetera tillbaka här. Hahn, Todd och Van der Klaauw (2001) definierar i fuzzy RDD-kontexten en complier som enheter vars beteende påverkas av tröskelvärdet, och definierar utifrån detta samt monotonitetsantagandet att:

τ_frd = E[Y_i(1) − Y_i(0) | unit i is a complier and X_i = c].

Estimanden τ_frd är den genomsnittliga effekten av treatment, säger Imbens och Wooldridge, men bara för enheter runt tröskelvärdet c, och bara för compliers. För att kunna generalisera till den bredare beolkningen behövs fler inslag i modellen. Om unconfoundedness råder så blir det mycket enklare att skatta genomsnittliga effekter för befolkningen i stort. En viktig diskussion i RDD är hur breda fönster/bandwidths man ska sätta runt tröskelvärdet: vilken är egentligen gruppen just under och gruppen just över tröskeln som är riligast att jämföra för att dra kausala slutsatser om effekten av treatment? I och W diskuterar olika approacher för att beräkna den rätta bbandbredden, däribland Ludwig och Miller (2005) och Imbens och Lemieux (2007). (s. 60-61) Den sista diskussionen om RDD är två möjliga problem med metoden. Det ena problemet är ifall tröskelvärdet också innebär skiften på andra variabler/covariates. Det andra problemet är ifall enheterna kan manipulera sitt värde på X, putta sig själva över eller under gränsen.

Från RDD övergår de till det egentliga ämnet för detta blogginlägg, Difference-in-Differences. Imbens och Wooldridge ramar in denna diskussion så här:

"Since the seminal work by Ashenfelter (1978) and Ashenfelter and Card (1985), the use of Diﬀerence-In-Diﬀ erences (DID) methods has become widespread in empirical economics. Inﬂuential applications include Card (1990), Meyer, Viscusi and Durbin (1995), Card and Krueger (1993), Eissa and Liebman (1996), Blundell, Duncan and Meghir (1998), and many others. The simplest setting is one where outcomes are observed for units observed in one of two groups, in one of two time periods. Only units in one of the two groups, in the second time period, are exposed to a treatment." (s. 64)

Den här 2x2-designen (två perioder, två grupper) är ju i grund och botten enkel: i period 0 finns ingen treatment, i period 1 är en grupp treated och den andra gruppen förblir untreated. Förändringen i kontrollgruppens värde på utfallet man är intresserad av, subtraheras från förändringen i treatment-gruppen för att räkna ut effekten av treatment. Denna "double differencing", skillnad från period 0 till period 1 och skillnaden mellan grupp T och grupp C, rensar ut bias från hur grupperna skiljer sig åt i förutsättningar, redan under period 0. Utfallet för individ i som är i icke-treatmentgruppen (vilket skrivs Y_i(0)) skrivs:

Y_i(0) = α + β · T_i + γ · G_i + ε_i,

där β fångar utvecklingen över tid, och γ gruppens medelvärde. Denna ekvation för gruppen utan treament förenas sedan med en ekvation för gruppen med treatment:

τ_did = E[Y_i(1)] − E[Y_i(0)]
= E[Y_i|G_i = 1, T_i = 1] − E[Y_i|G_i = 1, T_i = 0]
− E[Y_i|G_i = 0, T_i = 1] − E[Y_i|G_i = 0, T_i = 0] .

Det vill säga det som sades ovan, att vi ar utvecklingen i treatment-gruppen (det första ledet efter lika med-tecknet i ekvationen) minus utvecklingen i kontrollgruppen (det andra ledet). Kombinerat blir de två ekvationerna en ekvation som vi kan skatta med OLS:

Y_i = α + β1 · T_i + γ1 · G_i + τ_did · W_i + ε_i

Hittills har allt med DID varit väldigt enkelt, när man bara har två grupper och två tidsperioder. När man börjar öka på i grupperna och perioderna blir det lite mer komplicerat. I ekvationen motsvarande den för icke-treatment-gruppen ovan får man ha flera parametrar för grupper och tidsperioder, inte bara en β och en γ. Det blir då också intressant att beräkna en diff-in-diff mellan olika icke-treatment-grupper: då ska DID-koefficienten så klart vara noll och om den inte är det så säger det något intressant om heterogenitet i ens data. (s. 66) Från detta går I och W till det problem som bland andra Bertrand, Duflo och Mullainathan (2004) diskuterat, hur det kan finnas korrelation i feltermen inom grupper över tid vilket gör att en naiv OLS överskattar precisionen i estimaten. Startpunkten är följande struktur för feltermen ε_i:

ε_i = ηG_i,T_i + ν_i,

där η fångar upp den gruppspecifika trenden över tid. Finns det sådana effekter i en 2x2-setting är den konventionella DID-estimatorn inte konsistent, säger I och W, och det är svårt att beräkna hur stort klusterproblemet är. Bertrand et al fokuserar däremot på en setting med fler än två tidsperioder, och visar på en metod för att beräkna hur stark den autoregressiva processen är inom grupperna. Också Hansen (2007a, b) diskuterar metoder för att hantera dessa problem.

Nästa diskussion (sektion 6.5.4) diskuterar DID med paneldata och vilka skillnader det gör ifall enheterna inom grupperna är desamma över tid eller inte. Från detta går de till Athey och Imbens (2006) changes-in-changes-modell som motsvarar DID men utan antagande om linjaritet. För enheter utan treatment sätter de en generisk funktion h_0: Y_i(0) = h_0(U_i, T_i), där U är de egenskaper som styr utfallet på individnivå. U kan variera mellan grupper men inte inom grupper över tid. Till detta lägger de tre antanganden från standard-DID för att bygga CIC-modellen:

Ui − E[Ui|Gi] ⊥ Gi (additivity)
h0(u, t) = φ(u + δ · t), (single index model)
for a strictly increasing function φ(·),

φ(·) is the identity function. (identity transformation).

Den genomsnittliga treatment-effekten τ_cic räknas ut som τ_cic = E[Y_i(1)−Y_i(0) | G_i = 1, T_i = 1]. Finessen med CIC jämfört med DID är dock att man kan räkna ut inte bara genomsnittliga effekter utan också icke-linjära effekter, effekter som varierar per kvantil etc (se variabeln U ovan). En hjälpsam ekonometriprofessor och bloggare, Daniel Millimet från Southern Methodist University i Texas (länkad ovan) förklarar att i teorin så skulle man kunna beräkna kvantilspecifika DID i en 2x2-kontext med samma metod som vanlig DID: om vi t ex är intresserade av effekten av treatment på percentil 25, 50 (medianen) och 75, så beräknar vi hur dessa utvecklas i treatment- och kontrollgrupperna och beräknar treatment-effekten som skillnaden i utvecklingen i treatmentgruppen (för, säg 25:e percentilen) och utvecklingen i kontrollgruppen för samma percentil. Men Imbens och Athey argumenterar för att det skulle kräva missvisande antaganden. Så här förklarar bloggaren skillnaden mellan en tänkt kvantil-DID-approach (QDID) och den approach som Athey och Imbens föreslår, QCIC:

"QDID posits that quantile q of the Y(0) distribution for the treatment units would have evolved over time in an identical manner to quantile q of the Y(0) distribution for the control units. If you will, the parallel trends assumption holds at quantile q. Instead, QCIC is based on the assumption that quantile q of the Y(0) distribution for the treatment units would have evolved over time in an identical manner to quantile q' of the Y(0) distribution for the control units, where q' may not equal q. In other words, quantile q for the treatment units would have followed a parallel trend to quantile q' for the control units."

Bloggaren gör en väldigt pedagogisk förklaring av Imbens och Atheys argument här, varför de inte nöjer sig med att matcha 25:e percentilen inom treatmentgruppen med 25:e percentilen inom kontrollgruppen. Parantetiskt kan man säga att det går tillbaka på antagandet ovan, att de antar att förmågor etc som fångas av variabeln U kan variera mellan grupper. Därför är percentilen 25 inom behandlingsgruppen inte nödvändigtvis bäst matchad mot percentil 25 i kontrollgruppen: i själva verket kan personerna/enheterna på p25 i de två grupperna vara väldigt olika varann.

"An illustration will make this clear. Returning to the job training example from above, suppose we are interested in estimating the QTT at the median. The sample median wage for the treatment units in period 0 is, say, $10/hr. So, we then turn to the wage distribution for the control units in period 0 and we see to which quantile $10/hr corresponds. If the treatment group is positively selected, $10/hr might represent, say, the 70th quantile of the wage distribution in period 0 for the control units. We then examine how the 70th quantile of the wage distribution changes over time for the control units and assume the median wage for the treatment units would have evolved similarly. If the 70th quantile for the control units increases to, say $12/hr in period 1, then the counterfactual median wage for the treatment units in period 1 is $12/hr. The QCIC estimate of the QTT at the median is then given by the realized median wage of the treatment units in period 1 minus $12/hr.
A bit strange, indeed, but in hindsight it seems obvious. While we are assuming parallel trends between the treatment and control units across different quantiles, we are assuming parallel trends between treatment and control units with the same value of the outcome in the pretreatment period."

Imbens och Wooldridge förklarar approachen på ett mer formellt sätt. Grundproblemet är väl det vanliga problemet i kausal inferens från statistik, att man observerar utfallen för treatment-gruppen i period 1 med treatment, men aldrig treatment -- det är en helt kontrafaktisk fördelning som man måste jämföra den faktiska realiserade fördelningen med. Kanske man man säga att Athey och Imbens approach är unik just i hur den tillåter en att beräkna den kontrafaktiska fördelningen. Athey och Imbens demonstrerar, säger Imbens och Wooldridge, att man utifrån antaganden om monotonitet för u och att T_i och U_i är conditional independent givet G_i så kan fördelningen F för Y(0) identifieras som:

F_Y11(y) = F_Y10(F^(−1)_Y00 (F_Y01(y))),

där F_Ygt är fördelningen för Y_i inom grupp g och period t. Det sista elementet är vilken plats i rangen en enhet med värdet y har i fördelningen för kontrollgruppen, treatment-perioden: F_Y01(y). Det näst sista är vilket värde samma rang hade i kontrollgruppen period 0: F^(−1)_Y00. Det tredje sista är F_Y10, vilken rang detta motsvarar i behandlingsgruppen, period 0. Och det första elementet på höger sida om lika med-tecknet, F_Y10, frågar: vilken rang motsvarar det i treatment-gruppen, tid 0? Totalt sett ger detta dribblande mellan rang och faktiska värden, mellan treatment-grupp och kontrollgrupp och period 0 och 1 en kontrafaktisk fördelning för treatmentgruppen i treatmentperioden (1). Det centrala antagandet är att rankingen inte förändras av treatment.

Förväntat kontrafaktiskt utfall för treatment-gruppen under period två ifall den inte hade utsatts för treatment (Y_i(0) | G_i = 1, T_i = 1) beräknar de så här:

E[Y_i(0) | G_i = 1, T_i = 1] = E [F^(−1)_01 (F_00(Y_i10)) .

Den kontrafaktiska fördelningen är alltså fördelningen för treatmentgruppen i före-perioden, (Y_i10), rank-mappad genom vilken percentil detta motsvarar i kontrollgruppen i före-perioden (F_00), och därefter en ny, omvänd transformation från rank till värde, för kontrollgruppen i efter-perioden (F^(−1)_01). Detta speglar det annorlunda kontrafaktiska antagande som CIC gör jämfört med DID och som jag diskuterat ovan. Det tillåter också en beräkning av olika effekter tvärsöver fördelningen, kanske att individer med låg U responderar helt annorlunda på treatment än vad individer med hög U gör?

Från DID och varianten CIC går Imbens och Wooldridge till The Abadie-Diamond-Hainmueller Artiﬁcial Control Group Approach, men metod för kontexter med flera olika kontrollgrupper. "Applications in Abadie, Diamond and Hainmueller (2007) to estimation of the eﬀect of smoking legislation in California, and the eﬀ ect of reuniﬁ cation on West Germany are very promising.", säger Imbens och Wooldridge; ekonomisk-historikern Joe Francis skulle inte nödvändigtvis hålla med.

Kapitel 7 har rubriken "Multi-valued and Continuous Treatments". De flesta program evaluation-metoder som utvecklats har fokuserat på binära treatments, säger Imbens och Wooldridge, men på sistone har det också skett en del med mer varierande treatments. Jag skippar detta kapitel. (s. 71-74)

Slutsatskapitlet, kapitel 8, är mycket kort och jag citerar det i sin helhet:

"Over the last two decades there has been a proliferation of the literature on program evaluation.This includes theoretical econometrics work, as well as empirical work. Important features of the modern literature are the convergence of the statistical and econometric literatures, with the Rubin potential outcomes framework now the dominant framework. The modern literature has stressed the importance of relaxing functional form and distributional assumptions, and has allowed for general heterogeneity in the eﬀects of the treatment. This has led to renewed interest in identiﬁcation questions, leading to unusual and controversial estimands such as the local average treatment eﬀect (Imbens and Angrist, 1994), as well as to the literature on partial identication (Manski, 1990). It has also borrowed heavily from the semiparametric literature, using both eﬃciency bound results (Hahn, 1998) and methods for inference based on series and kernel estimation (Newey, 1994ab). It has by now matured to the point that is is of great use for practitioners." (s. 75)

Det mest intressanta ur mitt perspektiv här är väl dels hur centralt Rubins "potential outcomes"-ramverk är, dels den stora betoningen från Imbens och Angrist framåt på att skilja på effekter på olika nivåer: effekter bara på complies, kontra effekter i hela befolkningen, och från om med 90-00-talen också olika effekter på olika platser i fördelningen, som i fallet med Athey och Imbens Changes-in-Changes-ramverk.

Jag hoppar tio år framåt i tiden och går till Andrew Goodman-Bacons (då Vanderbilt Univ, nu Federal Reserve Minneapolis) papper "Difference-in-Differences with Variation in Treatment Timing" som publicerades i Journal of Econometrics 2021 men som jag läst i WP-version från 2018. Hans abstract är så pedagogiskt som det kan bli, så jag citerar det utförligt:

"The canonical difference-in-differences (DD) model contains two time periods, “pre” and “post”, and two groups, “treatment” and “control”. Most DD applications, however, exploit variation across groups of units that receive treatment at different times. This paper derives an expression for this general DD estimator, and shows that it is a weighted average of all possible two-group/two-period DD estimators in the data. This result provides detailed guidance about how to use regression DD in practice. I define the DD estimand and show how it averages treatment effect heterogeneity and that it is biased when effects change over time."

Liksom redan pappertiteln säger så handlar det alltså om DID i kontexter när inte all treatment kommer samtidigt, och från abstract utläser jag att grundproblematiken i pappret är vad som händer ifall treatment-effekten inte är konstant över tid, fastän grundtanken i en 2x2-DID så klart är det. Goodman-Bacon påpekar att 2x2 är helt central i hela DID-diskussionen: med "a common trends assumption, a two-group/two-period (2x2) DD identifies the average treatment effect on the treated. All econometrics textbooks and survey articles describe this structure,2 and recent methodological extensions build on it.3" Fotnot 3, om nyliga metodutvecklingar, anger: "Inverse propensity score reweighting: Abadie (2005), synthetic control: Abadie, Diamond, and Hainmueller (2010), changes-in-changes: Athey and Imbens (2006), quantile treatment effects: Callaway, Li, and Oka (forthcoming)." Och det känns ju tryggt att Athey och Imbens vid det här laget 12 år gamla papper fortfarande räknas som "recent": idag, på 2020-talet, skämtas det friskt om att det tvärtom kommer en ny DID-estimator varje år eller mer. Det viktiga här är i alla fall hur Goodman-Bacon fortsätter: många -- han säger t o m de flesta -- användningar av DID avviker i praktiken från 2x2-setupen genom att ha treatment som inträffar vid olka tider; i en fotnot refererar han att hälften av de 93 DID-papers som under 2014-15 publicerades i 5 topptidskrifter hade variation i timing. (s. 1)

Att beräkna den tänkta genomsnittliga effekten på treated i 2x2-DID är enkelt eftersom man bara jämför de fyra grupperna: (treatmentgruppen post - treatmentgruppen pre) - (kontrollgruppen post - kontrollgruppen pre). För settings med fler tidsperioder vet vi inte lika mycket, säger Goodman-Bacon:

"In contrast to our substantial understanding of the canonical 2x2 DD model, we know relatively little about the two-way fixed effects DD model when treatment timing varies. We do not know precisely how it compares mean outcomes across groups.5 We typically rely on general descriptions of the identifying assumption like “interventions must be as good as random, conditional on time and group fixed effects” (Bertrand, Duflo, and Mullainathan 2004, p. 250), and consequently lack well-defined strategies to test the validity of the DD design with timing. We have limited understanding of the treatment effect parameter that regression DD identifies. Finally, we often cannot evaluate when alternative specifications will work or why they change estimates." (s. 2) [7]

Pappret visar att twoway-fixed effects-DID-estimatorn med flera perioder ger ett viktat genomsnitt av alla möjliga 2x2-estimatorer som jämför olika timinggrupper med varann. Ibland kommer jämförelsegruppen vara treatmentgrupper som inte behandlats än, ibland kommer den vara rena kontrollgrupper. "As in any least squares estimator, the weights on the 2x2 DD’s are proportional to group sizes and the variance of the treatment dummy within each pair. Treatment variance is highest for groups treated in the middle of the panel and lowest for groups treated at the extremes." (s. 2) Så här förklarar han värdet av sin approach och hur den relaterar till den nya litteraturen inom diff-in-diff:

"By decomposing the DD estimator into its sources of variation (the 2x2 DD’s) and providing an explicit interpretation of the weights in terms of treatment variances, my results extend recent research on DD models with heterogeneous effects.7 Assuming equal counterfactual trends, Abraham and Sun (2018), Borusyak and Jaravel (2017), and de Chaisemartin and D’HaultfŒuille (2018b) show that two-way fixed effects DD yields an average of treatment effects across all groups and times, some of which may have negative weights. My results show how these weights arise from differences in timing and thus treatment variances, facilitating a connection between models of treatment allocation and the interpretation of DD estimates.8 I also explain why the negative weights occur: when already-treated units act as controls, changes in their treatment effects over time get subtracted from the DD estimate. This negative weighting only arises when treatment effects vary over time, in which case it typically biases regression DD estimates away from the sign of the true treatment effect. This does not imply a failure of the underlying design, but it does caution against the use of a single-coefficient two-way fixed effects specification to summarize time-varying effects." (s. 2-3)

Jag hade inte haft en tanke på olika vikter för olika grupper och olika DID-jämförelser i en sån här DID-setting, så det tycker jag är en väldigt fascinerande poäng! Att systematiskt tänka på vilken vikt olika grupper får, och Goodman-Bacon presenerar flera metoder för att göra detta: både att plotta gruppvisa DID-resultat som är under the hood om man bara tar en estimator, mot sina vikter. Och för det andra, att göra en Oaxaca-Blinder-Kitagawa-style dekomponering av hur mycket av de olika koefficienten beror på själva i de olika ingående DID-jämförelserna kontra deras olika vikter. "The source of instability matters because changes due to different weighting reflect changes in the estimand (not bias), while changes in the 2x2 DD’s suggest that covariates address confounding." (s. 4) För att demonstrera vikten av metoden replikerar Goodman-Bacon Stevenson och Wolfers (2006) studie av effekterna av lagändringar som gör skilsmässa enklare, på kvinnors självmordsfrkevens. Stevenson och Wolfers fann att en förenkling av skilsmässa (att det räcker med att en part vill ha skilsmässa) minskar kvinnors självmordstal med 3 per 1 miljon kvinnor. Goodman-Bacon menar att den sanna effekten är närmare -5 självmord per 1 miljon kvinnor.

Startpunkten för Goodman-Bacons analys är den enklaste 2x2-DID-modellen:

𝑦_𝑖t = 𝛾 + 𝛾_i TREAT_𝑖 + 𝛾_t POST_t + 𝛽^(2𝑥2)TREAT_𝑖 × POST_t + 𝑢_𝑖 (1)

Och anpassningen till en twoway-fixed-effects-regression när man har treatment vid olika perioder, så inte bara en pre-period och en post-period:

𝑦_𝑖t = 𝛼_𝑖 + 𝛼_t + 𝛽^(𝐷𝐷)_𝑖t + 𝑒_𝑖t (2)

Med olika treatment-perioder kan man inte använda (1) utan folk tenderar att använda (2). Goodman-Bacon säger att "Researchers clearly recognize that differences in when units received treatment
contribute to identification, but have not been able to describe how these comparisons are made." och går vidare med att bygga upp hur den egentligen funkar. Tänk en balanserad panel med T perioder (t) och N enheter (i) som var och en tillhör antingen en untreated grupp U, en early treatment-grupp k som får en binär treatment vid t^*k och så en sen treatment-grupp l som får sin treatment vid t^*l > t^*k. Figur 1 (inklistrad ovan) plottar denna struktur. Han frångår här språket med en "kontrollgrupp" för att förtydliga att med flera tidsperioder så blir också treatment-grupper "kontroller", beroende på när det sker. Figur 2 plottar denna struktur, för ett case med tre grupper. Panel A och B visar att med bara en treatment-grupp så är vi tillbaka på klassisk 2x2-mark medan Panel C och D visar att med bara treatment-grupper och ingen untreated, så kommer identifikationen hänga på att jämföra tidigt treated (k) med inte ännu treated (l). Goodman-Bacon säger att: "My central result is that any two-way fixed effects DD estimator is a weighted average of well-understood 2x2 DD estimators, like those plotted in figure 2." (s. 6)

Han härleder detta matematiskt, och jag kommer skippa det mer tekniska, inklusive ett DID Decomposition Theorem. (s. 7-8) Det blir viktigare för mig när han kommer tillbaka till den mer principiella frågan om vad det är för parametar som DID egentligen skattar och med vilka antaganden. Han dekomponerar, utifrån Callaway och sant'Annas (2018) definition av "group-time average treatment effect" ATT för grupp k vid tid t, DID-koefficienten:

β^(DD) = VWATT + VWCT + ΔATT

Där VWATT är koefficienten man får ur en vanlig TWFE DID-estimator, som Goodman-Bacon kallar the “variance-weighted average treatment effect on the treated” (VWATT). Den andra termen, “variance-weighted common trends” (VWCT) står för gemensamma trender i settingen med flera tidsperioder. Och den sista termen ΔATT är hur ATT förändras över tid. ΔATT kan alltås ses som ett mått på bias i den enda koefficienten β^(DD) som man hoppas är = ATT. "Note that this does not mean that the DD research design is invalid. In this case other specifications, such as an event-study model (Jacobson, LaLonde, and Sullivan 1993) or “stacked DD” (Abraham and Sun 2018, Deshpande and Li 2017, Fadlon and Nielsen 2015), or other estimators such as reweighting strategies (Callaway and Sant'Anna 2018, de Chaisemartin
and D’HaultfŒ uille 2018b) may be more appropriate." (s. 12)

I slutsatserna betonar Goodman-Bacon att:

"My central result, the DD decomposition theorem, shows that a two-way fixed effects DD
coefficient equals a weighted average of all possible simple 2x2 DD’s that compare one group that changes treatment status to another group that does not. Many ways in which the theoretical interpretation of regression DD differs from the canonical model stem from the fact that these simple components are weighted together based both on sample sizes and the variance of their treatment dummy. This defines the DD estimand, the variance-weighted average treatment effect on the treated (VWATT), and generalizes the identifying assumption on counterfactual outcomes to variance-weighted common trends (VWCT). Moreover, I show that because already-treated units act as controls in some 2x2 DD’s, the two-way fixed effects model requires an additional identifying assumption of time-invariant treatment effects.
The DD decomposition theorem also leads to several new tools for practitioners. Graphing the 2x2 DD’s against their weight displays all the identifying variation in any DD application, and summing weights across types of comparisons quantifies “how much” of a given estimate comes from different sources of variation. I use the DD decomposition theorem to propose a reweighted balance test that reflects this identifying variation, is easy to implement, has higher power than tests of joint balance across groups, and shows how large and in what direction any imbalance occurs. I suggest several simple methods to learn why estimates differ across alternative specifications. The weighted average representation leads to a Oaxaca-Blinder-Kitagawa-style decomposition that quantifies how much of the difference in estimates comes from changes in the 2x2 DD’s, the weights, or both. Plots of the components or the weights across specifications show clearly where differences come from and can help researchers understand why their estimates changes and whether or not it is a problem." (s. 29-30)

Som någon som gillar dekomponeringsmetoder och att beskriva data på nära och detaljerade sätt så bara måste jag ju älska detta! Tekniskt ja men också djupt intuitivt, att man ska veta vad det är för variation i data man exploaterar när man gör sina beräkningar. Och plotta det!

Clément de Chaisemartin (då University of California-Santa Barbara, nu Sciences Po i Paris) och Xavier d'Haultfoeuilles (CREST-ENSAE i Frankrike) artikel i American Economic Review, 2021, är nära relaterad till Goodman-Bacons. Också här kan man se edt som att de utforskar vad det egentligen är som driver resultaten i Twoway Fixed Effects-designer/DID-designer. De sammanfattar ungefär sin utgångspunkt i citatet nedan.

Artikeln är matematisk och teoretisk och jag ska inte gå in på detljerna men det intressanta är alltså att estimatorn β_fe som är standard i diff-in-diff är en viktad summa av en rad diff-in-diff-jämförelser men att inte varje jämförelse i ens modell kommer mätas på samma sätt. Problemet med negativa vikter, som kan flippa koefficienterna, är centralt här, i en kontext med staggered treatment och potentiellt heterogena effekter över tid. CdC och Xd'H lägger fram en ny estimator, som de kallar DID_M, som kan hantera dessa problem, och som är implementerad i två nya Stata-paket. Dessa tillåter också att man beräknar vikterna i regressionen (paketet twowayfeweights, sedermera också tillgängligt för R). Så här förklarar de hur deras approach relaterar till andra pågående metodologiska studier inom diff-in-diff:

"More recently, Borusyak and Jaravel (2017), Abraham and Sun (2018), Athey and Imbens (2018), Callaway and Sant’Anna (2018), and Goodman-Bacon (2018) study the special case of staggered adoption designs, where the treatment of a group is weakly increasing over time. Those papers derive some important results specific to that design that we do not consider here. Still, some of the results in those papers are related to ours, and we describe precisely those connections later in the paper. The most important dimension on which our paper differs from those is that our results apply to any two-way fixed effects regressions, not only to those with staggered adoption. In our survey of the AER papers estimating two-way fixed effects regressions, less than 10 percent have a staggered adoption design. This suggests that while staggered adoptions are an important research design, they may account for a relatively small minority of the applications where two-way fixed effects regressions have been used." (s. 2966)

Liyang Sun (då MIT, nu UCL) och Sarah Abraham (Cornerstone Research) diskuterar i sin artikel "Estimating dynamic treatment effects in event studies with heterogeneous treatment effects", publicerad i Journal of Econometrics 2021 användningen av leads och lags för att utforska effekter över tid. Detta är deras abstract:

"To estimate the dynamic effects of an absorbing treatment, researchers often use two-way fixed effects regressions that include leads and lags of the treatment. We show that in settings with variation in treatment timing across units, the coefficient on a given lead or lag can be contaminated by effects from other periods, and apparent pretrends can arise solely from treatment effects heterogeneity. We propose an alternative estimator that is free of contamination, and illustrate the relative shortcomings of two-way fixed effects regressions with leads and lags through an empirical application."

Och så här sammanfattar de i introduktionen vad artikeln gör:

Också här rör det sig alltså om problem med staggered treatment och hur man då ska jämföra treated grupper, never treated och not yet treated. Om Goodman-Bacons artikel är väldigt diagnostisk: hur ska man identifiera problemen i ens modell och mäta dem, så är Sun och Abrahams mer framåtblickande: här är en estimator som löser de identifierade problemen med heterogena effekter och problematiska jämförelser.

Kirill Borusyak, Xavier Jaravel och Jann Spiess fortsätter i sin artikel "Revisiting Event-Study Designs: Robust and Efficient Estimation" i Review of Economic Studies 2024 på problematiken med diff-in-diff med staggered treatment och heterogena kausala effekter.

referenser

Alberto Abadie (2005) "Semiparametric Difference-in-Differences Estimators", Review of Economic Studies.

Andrew Baker, Brantly Callaway, Scott Cunningham, Andrew Goodman-Bacon och Pedro H. C. Sant’Anna (2025) "Difference-in-Differences Designs: A Practitioner’s Guide", arxiv.org working paper, juni 2025.

Marianne Bertrand, Esther Duflo och Sendhil Mullainathan (2004) "How much should we trust differences-in-differences estimates?", Quarterly Journal of Economics.

Kirill Borusyak, Xavier Jaravel och Jann Spiess (2024) "Revisiting Event-Study Designs: Robust and Efficient Estimation", Review of Economic Studies. -- över 4 000 citeringar på Google Scholar.

Clément de Chaisemartin and Xavier D'Haultfoeuille (2019) "Two-way Fixed Effects Estimators with Heterogeneous Treatment Effects", NBER Working Paper No. 25904. -- publicerad 2020 i American Economic Review. -- ungefär 7 000 citeringar på Google Scholar.

Andrew Goodman-Bacon (2018) "Difference-in-differences with variation in treatment timing", NBER Working Paper. -- publicerad 2021 i Journal of Econometrics. -- över 11 000 citeringar på Google Scholar.

Guido M. Imbens and Jeffrey M. Wooldridge (2008) "Recent Developments in the Econometrics of Program Evaluation", NBER Working Paper No. 14251. -- publicerad 2009 i Journal of Economic Literature.

Jonathan Roth, Pedro HC Sant'Anna, Alyssa Bilinski och John Poe (2023) "What's trending in difference-in-differences? A synthesis of the recent econometrics literature", Journal of Econometrics, 235: 2218-2244.

Liyang Sun och Sarah Abraham (2021) "Estimating dynamic treatment effects in event studies with
heterogeneous treatment effects", Journal of Econometrics 225: 171-199. -- ungefär 7 400 citeringar på Google Scholar.

fotnoter

[1] Deras fotnot här är: "See Besley and Case [2000]. Another prominent concern has been whether DD estimation ever isolates a speciﬁ c behavioral parameter. See Heckman [2000] and Blundell and MaCurdy [1999]. Abadie [2000] discusses how well the comparison groups used in nonexperimental studies approximate appropriate control groups. Athey and Imbens [2002] critique the linearity assumptions used in DD estimation and provide a general estimator that does not require such assumptions." (s. 250)

[2] Bertrand et al gör också en intressant anmärkning o mmetoden här: "Two additional points are worth noting. First, 80 of the original 92 DD papers have a potential problem with grouped error terms as the unit of observation is more detailed than the level of variation (a point discussed by Donald and Lang [2001]). Only 36 of these papers address this problem, either by clustering standard errors or by aggregating the data. Second, several techniques are used (more or less informally) for dealing with the possible endogeneity of the intervention variable. For example, three papers include a lagged dependent variable in equation (1), seven include a time trend speciﬁ c to the treated states, ﬁffteen plot some graphs to examine the dynamics of the treatment effect, three examine whether there is an “effect” before the law, two test whether the effect is persistent, and eleven formally attempt to do triple-differences (DDD) by ﬁ nding another control group. In Bertrand, Duﬂlo, and Mullainathan [2002] we show that most of these techniques do not alleviate the serial correlation issues." (s. 254)

[3] "The most interesting literature in this area views the interactions not as a nuisance but as the primary object of interest. This literature, which includes models of social interactions and peer eﬀ ects, has been growing rapidly in the last decade, following the early work by Manski (1993). See Manski (2000) and Brock and Durlauf (2000) for recent surveys. Empirical work includes Kling, Liebman and Katz (2007), who look at the eﬀ ect of households moving to neighborhoods with higher average socio-economic status; Sacerdote (2001), who studies the eﬀ ect of college roommate behavior on a student’s grades; Glaeser, Sacerdote and Scheinkman (1996), who study social interactions in criminal behavior; Case and Katz (1991), who look at neighbourhood eﬀ ects on disadvantaged youths, Graham (2006), who infer interactions from the eﬀ ect of class size on the variation in grades; and Angrist and Lang (2004), who study the eﬀect of desegregation programs on students’ grades. Many identiﬁ cation and inferential questions remain unanswered in this literature. (s. 10)

[4] "In general the quantile of the diﬀ erence, τ˜_q, diﬀers from the diﬀerence in the quantiles, τ_q, unless there is perfect rank correlation between the potential outcomes Yi(0) and Yi(1) (the leading case of this is the constant additive treatment eﬀ ect). The quantiles of the treatment eﬀect, τ˜_q, have received much less attention than the quantile treatment eﬀ ects, τ_q. The main reason is that the τ˜_q are generally not identiﬁ ed without assumptions on the rank correlation between the potential outcomes, even with data from a randomized experiment." (s. 13)

[5] De diskuterar här också efficiency bounds. Vilken estimator av ATE har lägst varians, alltså är mest effektiv?

[6] DE förklarar att : "Formulating exogeneity in this way is attractive compared to conventional residual-based deﬁnitions, as it does not require the researcher to specify a regression function in order to deﬁne the residuals. This assumption captures two properties of the instrument. First, it captures random assignment of the instrument so that causal eﬀ ects of the instrument on the outcome and treatment received can be estimated consistently. This part of the assumption, which is implied by explicitly randomization of the instrument, as for example in the seminal draft lottery study by Angrist (1990), is not suﬃcient for causal interpretations of instrumental variables methods. The second part of the assumption captures an exclusion restriction that there is no direct eﬀect of the instrument on the outcome. This second part is captured by the absence of z in the deﬁnition of the potential outcome Yi(w). This part of the assumption is not implied by randomization of the instrument and it has to be argued on a case by case basis. See Angrist, Imbens and Rubin (1996) for more discussion on the distinction between these two assumptions, and for a formulation that separates them." (s. 55)

[7] Goodman-Bacon har en intressant fotnot här: "This often leads to sharp disagreements. See Neumark, Salas, and Wascher (2014) on unit-specific linear trends, Lee and Solon (2011) on weighting and outcome transformations, and Shore-Sheppard (2009) on age time fixed effects. "

torsdag 16 april 2026

Identifikationsrevolutionen inom olika nationalekonomiska fält

Empirical microeconomics has experienced a credibility revolution, with a consequent increase in policy relevance and scientific impact. Sensitivity analysis played a role in this, but as we see it, the primary engine driving improvement has been a focus on the quality of empirical research designs."

Angist och Pischke, "The Credibility Revolution in Empirical Economics: How Better Research Design is Taking the Con out of Econometrics" (2010)

Paul Goldsmith-Pinkham, nationalekonom vid Yale, har ett intressant nytt surveypapper i NBER WP-serien. Det handlar om de empiriska metoder som på 1990- och 00-talen slog igenom inom mikroekonomin och hur dessa spridit sig över olika fält inom nationalekonomin. Han börjar sitt papper så här:

"How far has the credibility revolution spread? Angrist and Pischke (2010) documented a sea change in how economists approach empirical work—a shift toward transparent research designs, explicit identification strategies, and credible causal inference. Currie, Kleven, and Zwiers (2020b) showed that this shift was accelerating through the late 2010s, at least in applied microeconomics. But that analysis left open a basic question: are finance, macroeconomics, and other fields keeping pace, or has the revolution been narrower than it appears?"

Janet Currie, Henrik Kleven och Esmée Zwiers (alla tre då verksamma vid Princeton, Zwiers nu i Amsterdam) undersökte två typer av outputs: NBER Working Papers mellan 1980 och 2018, och artiklar i nationalekonomins topp 5-tidskrifter mellan 2004 och 2019. De begränsade sitt sample till mikroekonomi och fann att detta fält ökade sin andel av artiklarna i topptidskrifterna under perioden, kanske drivet av fältets expertis i de fräcka nya metoderna som sammanfattades under rubriken "identifikationsrevolutionen" eller till och med "trovärdighetsrevolutionen" (Angrist och Pischke 2010): andelen applicerad mikro i topp 5 steg från runt 55-60 procent under 00-talets mitt till 70-75 procent 2013-2019. Figur 2 visar att de mikroekonomiska artiklarna också förändrades i sitt innehåll under perioden: andelen som nämnde "identifikation" ökade från 4 procent till 50 procent, och fler använde experimentella eller kvasi-experimentella metoder eller administrativa data. En aspekt som jag tycker är väldigt intressant är att mängden diagram som en del av allt detta har ökat relativt till antalet tabeller; det visas i panel D i diagrammet.

Diagram 3 (inte visat här) visar att andelen labb-experiment också ökade, liksom diskussion om extern validitet. Diagram A.IV visar att andelen av papers i applicerad mikro som diskuterar (kausala) mekanismer ökat från 20 procent till 60 procent i NBER WP-serien, och till 70 procent i topp 5-artiklarna. Diagram 4 diskuterar användningen av fyra specifika metoder: diff-in-diff, regression discontinutity, event studies, och bunching. Deras historiografi här är intressant:

"Figure IV drills down on speciﬁc quasi-experimental methods: difference-in-differences, re-
gression discontinuity, event studies, and bunching. These methods have all become more popular over time, in roughly the order named. The use of difference-in-differences was virtually non-existent until 1990 and then starts growing. The ﬁrst papers that mention difference-in-differences estimators in our data are Ashenfelter and Card (1985) and Card and Sullivan (1988), which appeared as NBER working papers in 1984 and 1987, respectively. As far as we are aware, the very ﬁrst paper to use a difference-in-differences approach is Ashenfelter (1978), although that paper did not use the difference-in-differences language. It is quite striking that, today, almost 25 percent of all NBER working papers in applied micro make references to difference-in-differences.
Regression discontinuity approaches start gaining popularity around 2000, following the early contributions by authors such as Angrist and Lavy (1999) and Hahn, Todd, and Van der Klaauw (2001), which were circulated as NBER working papers a couple of years prior.
Event studies and bunching approaches are more recent, having taken off during the last decade. Both of these approaches are closely linked to the increased use of administrative data sources, which are critical to the effective implementation of these data-demanding approaches. Over time, event studies have become almost synonymous with difference-in-differences: It is now rare to use difference-in-differences without showing an event study graph, and conversely it is rare to show event studies without a control group. As a result, the sharp rise in the use of event studies over the last ten years goes hand in hand with the increased slope of the difference-in-differences series during this time period. The modern bunching approach starts with Saez (2010), although the NBER working paper version of that paper appeared more than ten years prior."

De nya metoderna har inte ersatt äldre kausala metoder som instrumentvariabler eller fixed effects; "The fact that old and new methods appear to be complements rather than substitutes suggests that another outgrowth of the credibility revolution is the rise of the “collage” approach to empirical work. Authors no longer hang their hats on a single method or dataset, but attempt to make a case based on a more multi-pronged approach." Diagram 6 visar på spridningen av fyra fenomen, av ganska blandad karaktär: binscatter plots som blivit populära sedan de användes i Chetty et al (2011, “How Does Your Kindergarten Classroom Affect Your Earnings? Evidence from Project STAR”, QJE); preanalysis plans; maskinlärning; och textanalys.

Så långt Currie, Kleven och Zwiers. Goldsmith-Pinkham tar vid där de slutade, med motsvarande metod men med ett mycket större sample: 44 000 NBER WPs från 1982 till 2025 och 12 300 artiklar från elva topptidskrifter inom nationalekonomi och finans [1] från 2011 och 2024. Medan Currie et al begränsade sin analys till mikroekonomin så utvidgar Goldsmith-Pinkham fokus till nationalekonomin som helhet, just för att kunna studera om samma metoder som dominerar inom applicerad mikro också blivit populära t ex inom makro och finans.

Goldsmith-Pinkham sammanfattar sina resultat i tre steg. Ett, finans och makro är metodologiskt sett fortfarande annorlunda än applicerad mikro. Två, utanför applied micro är det differences-in-differences som dominerar trovärdighetsrevolutionen; här säger G-P lite syrligt att "This reliance on a single method is striking given the recent econometrics literature highlighting sensitivities in DiD designs (Roth 2022; De Chaisemartin and d’Haultfoeuille 2020; Callaway, Goodman-Bacon, and Sant’Anna 2024)." (s. 1) Och tre, det finns en stor skillnad mellan de metoder som diskuteras i ekonometrikernas egna Journal of Econometrics -- där dominerar icke-parametriska beräkningar, bootstrap-metoder och asymptotisk teori -- och de metoder som dominerar bland praktikerna, diff-in-diff och identifikationsstrategier. PGP: "The tools powering the credibility revolution and the theoretical literature developing new estimators occupy largely separate methodological spaces."

Huvudresultaten syns för NBER WP-serien i Figur 3 som jag klistrat in ovan: frekvensen av diskussion om "identifikation", användning av experimentella och kvasi-experimentalla metoder, och användning av administrativa data i working papers sedan 2000, uppdelat på tre typer av nationalekonomi: applicerad mikro, finans, och makro/annat.

Figur 4 fortsätter med mer detaljerad nerbrytning på metoder: differences-in-differences inklusive event studies i panel A; syntetiska kontrollmetoder i panel B; Bartik och shift-share instrument i panel C; instrumentvariabler i panel D; experimentella metoder i panel E, och regression discontinity i panel F. Diff-in-diff är vanligast, mellan 20 och 35 procent av alla WPs idag beroende på fält, medan instrumentvariabler är allra starkast inom ett särskilt fält, med runt 30 procent inom applicerad mikro sedan 2010 (diff-in-diff har däremot ökat mycket snabbt, från "bara" 10-15 procent inom mikro runt 2010), medan IV "bara" ligger runt 15-20 procent inom de andra fälten. Syntetiska kontroller är mycket ovanligare, runt 3 procent, Bartik/shift-share någonstans mittemellan liksom RD runt 8-9 procent, och experimenten mycket vanliga inom mikro (25 procent) men inte så vanliga i de andra fälten (runt 10 procent).

Resultaten för tidskrifterna är överlag liknande med en stor ökning över tid, högre nivåer inom mikro än inom andra fält, och en mycket stark ställning för diff-in-diff-designer. [2] Undersökningen av Journal of Econometrics ger mer kontrasterande resultat:

"Most credibility revolution methods—DiD, event studies, RD, RCTs, administrative data, synthetic control, Bartik instruments, binscatter, and heterogeneous treatment effects—appear far less frequently in the Journal of Econometrics than in applied journals. DiD appears in approximately 19% of applied journal papers but under 4% of Journal of Econometrics papers; event studies show a similar gap. The exceptions are identification language and instrumental variables, where the Journal of Econometrics matches or exceeds applied journals—reflecting the theoretical literature on identification and IV estimation that is a core focus of the journal." (s. 16)

Goldsmith-Pinkham har undersökt vad det är för metoder som diskuteras inom JoE istället:

"Asymptotic theory and Monte Carlo simulation top the list—appearing in 86% and 65% of papers respectively—but these reflect the standard toolkit for deriving and validating estimators; applied papers rely on asymptotic theory implicitly even when they do not use the term. The more informative contrasts involve substantive methods: nonparametric estimation (58%), time series models (54%), structural/GMM/MLE methods (54%), and Bayesian methods all appear at far higher rates in the Journal of Econometrics than in applied journals. These are the estimation and inference techniques that form the theoretical infrastructure of econometrics—important in their own right, but distant from the day-to-day practice of most applied economists." (s. 17)

Från denna diskussion om heterogenitet inom nationalekonomin rör Goldsmith-Pinkham sig till frågan om det så att säga borde ske en konvergens. Det är inte hans argument, säger han:

"Many questions in macroeconomics are fundamentally about general equilibrium, and the applied micro toolkit—built around partial equilibrium and local treatment effects—may not be the right tool for every setting. The same is true in asset pricing, where the object of interest is often an equilibrium price rather than a treatment effect. The more relevant distinction is between fields where quasi-experimental methods are feasible but underused—corporate finance, for example, has abundant natural experiments—and fields where the questions themselves call for different approaches. Nakamura and Steinsson (2018, “Identification in macroeconomics”, JEP) offer a thoughtful example of how credibility revolution thinking can be adapted to macroeconomic settings without simply importing the applied micro playbook."

I slutsatsdelen har Goldsmith-Pinkham en intressant reflektion om ifall resultaten är drivna av hans metod. Metoden är ju att med en maskininlärningsmodell läsa en stor mängd text och räkna omnämnande av olika metoder och upplägg. Hade resultaten varit annorlunda om man t ex hade kollat på citeringar till klassiska papers ur identifikationsrevolutionen?

"One limitation of this analysis is that keyword mentions measure the diffusion of methodological language but not the quality of adoption or influence of methods. Validation against LLM classification (Appendix A) shows that keyword precision varies across categories—exceeding 90% for regression discontinuity and lab experiments, but falling below 50% for DiD and event studies, where many mentions reflect discussion rather than use as a primary research design. Cross-field comparisons should therefore be interpreted with caution for categories where precision is lowest, as some of the measured gap may reflect differences in vocabulary rather than uptake. A complementary approach would track citations to foundational credibility revolution papers—Angrist and Krueger ( “Does compulsory school attendance affect schooling and earnings?”, QJE, 1991), Angrist and Pischke (Mostly Harmless Econometrics: An Empiricist’s Companion, 2009), Imbens and Lemieux (“Regression discontinuity designs: A guide to practice”, J of Econometrics, 2008)—across fields. If finance and macro cite these works at comparable rates but describe methods differently, the measured gap would partly reflect writing conventions rather than substantive methodological differences. If citation rates also differ, this would reinforce the keyword evidence." (s. 21)

I Appendix finns en stor mängd ytterligare resultat. Jag fastnar särskilt för diagrammet som fångar "den grafiska revolutionen", motsvarande Currie, Kleven och Zwiers Figur 2D: alltså ration av diagram till tabeller, här i NBER WPs. Diagrammet nedan visar att i makro så använde man vid 2000-talets början och mitt ungefär 50-75 procent mer diagram än tabeller, och den övervikten har nu ökat till ungefär 200 procent! Inom mikro är trenden faktiskt mindre häftig, från runt 25 procent till runt 125 procent, och inom finans är trenden liknande. Den totala trenden (svart streckad linje) visar att vid 00-talets början gick det ungefär 1,5 diagram på varje tabell, vid 2010-talets mitt ungefär 1,75 diagram per tabell, och idag ungefär 2,4 diagram per tabell.

referenser

Janet Currie, Henrik Kleven, och Esmée Zwiers (2020) "Technology and Big Data Are Changing Economics: Mining Text to Track Methods", American Economic Review Papers and Proceedings.

Paul Goldsmith-Pinkham (2026) "Tracking the Credibility Revolution across Fields", NBER Working Paper 35051, april 2026.

fotnot

[1] Dessa tidskrifter: "three general-interest economics journals (AER, QJE, JPE), the four American Economic Journals (Applied, Policy, Macro, Micro), three top finance journals (Journal of Finance, Review of Financial Studies, Journal of Financial Economics), and the Journal of Econometrics." (s. 2)

[2] "AEJ Applied Economics and AEJ Economic Policy show the highest rates of credibility revolution methods—unsurprising given their explicit focus on applied empirical work. Among the general-interest journals, AER and QJE show higher rates than JPE, reflecting differences in paper composition. The finance journals show moderate adoption of DiD and identification language but lower rates of RD and experimental methods—echoing the NBER findings at the journal level." (s. 14-15)