3  Diff-in-Diff

3.1 Introduction

This week we will start talking about Causal Inference in R. You will have already seen the theoretical and mathematical parts with Brenda in the lecture. This script is supposed to initiate you to the practical part of Causal Inference.

This session will show you how to :

  1. Prepare data for Causal Inference
  2. Perform a simple Diff-in-Diff analysis in R
  3. Interpret the results of a Diff-in-Diff analysis in R
`modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
  backend. Learn more at: https://vincentarelbundock.github.io/tinytable/

Revert to `kableExtra` for one session:

  options(modelsummary_factory_default = 'kableExtra')
  options(modelsummary_factory_latex = 'kableExtra')
  options(modelsummary_factory_html = 'kableExtra')

Silence this message forever:

  config_modelsummary(startup_message = FALSE)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::lag()        masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

3.2 Diff-in-Diff

For the first example on Diff-in-Diff I am going to use the code written by the brilliant Rohan Alexander who has written one of the (if not the) best introduction to Data Analysis in R which you can find here. Furthermore, the example he uses is a paper written by Charles Angelucci and Julia Cagé who you probably know already.

set.seed(853)

simulated_diff_in_diff <-
  tibble(
    person = rep(c(1:1000), times = 2),
    time = c(rep(0, times = 1000), rep(1, times = 1000)),
    treat_group = rep(sample(x = 0:1, size = 1000, replace = TRUE ), times = 2)
    ) |>
  mutate(
    treat_group = as.factor(treat_group),
    time = as.factor(time)
  )

simulated_diff_in_diff <-
  simulated_diff_in_diff |>
  rowwise() |>
  mutate(
    serve_speed = case_when(
      time == 0 & treat_group == 0 ~ rnorm(n = 1, mean = 5, sd = 1),
      time == 1 & treat_group == 0 ~ rnorm(n = 1, mean = 6, sd = 1),
      time == 0 & treat_group == 1 ~ rnorm(n = 1, mean = 8, sd = 1),
      time == 1 & treat_group == 1 ~ rnorm(n = 1, mean = 14, sd = 1)
    )
  )

simulated_diff_in_diff
# A tibble: 2,000 × 4
# Rowwise: 
   person time  treat_group serve_speed
    <int> <fct> <fct>             <dbl>
 1      1 0     0                  4.43
 2      2 0     1                  6.96
 3      3 0     1                  7.77
 4      4 0     0                  5.31
 5      5 0     0                  4.09
 6      6 0     0                  4.85
 7      7 0     0                  6.43
 8      8 0     0                  5.77
 9      9 0     1                  6.13
10     10 0     1                  7.32
# ℹ 1,990 more rows
simulated_diff_in_diff |>
  ggplot(aes(x = time, y = serve_speed, color = treat_group)) +
  geom_point(alpha = 0.2) +
  geom_line(aes(group = person), alpha = 0.1) +
  theme_minimal() +
  labs(x = "Time period", y = "Serve speed", color = "Person got a new racket") +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

The paper is called “Newspapers in Times of Low Advertising Revenues” (Angelucci and Cagé 2019) and follows a difference-in-difference analysis and it also comes with replication material which means that we can try to emulate their results.

In 1967, the French government introduced advertisements on French TV programs. This led to a decrease in advertising revenues for newspapers all over France. The idea is therefore to understand if and how the introduction of ads on TV affected the ad revenues for newspapers, both local and national ones, in France. They use a difference-in-difference approach to estimate the effect of the introduction of advertising on the advertising revenues of newspapers. Thus, the treatment is…? Exactly, the introduction of ads on TV. Angelucci and Cagé argue that national newspapers were more affected by this change than local newspapers. They have many more hypotheses which they test in this paper, but for now we will focus on this one. As mentioned above, their data is available and we can use it to replicate their results. You need to sign up, to get it.

newspapers <- read_dta("data/Angelucci_Cage_AEJMicro_dataset.dta")

Next we will have to do some minor data management. Fortunately enough, this replication material is already pretty clean and we do not have to mutate() or filter() our way around too much. The only thing we actually have to do, is to convert some variables into factors and to create a new variable which is the ratio of advertising revenue over circulation. Here I specify the argument across() again of the mutate() function and within it, I give a vector c() containing all the variables that should be transformed to factors.

Here is a brief explanation of the variables in the dataset:

  • year: The year of the observation
  • id_news: A unique identifier for each newspaper
  • local: A binary indicator (0 or 1) representing whether a newspaper is local or national (1 = local, 0 = national)
  • national: A binary indicator (0 or 1) representing whether a newspaper is national or local (1 = national, 0 = local)
  • ra_cst: The advertising revenue of the newspaper (in constant (2014) euros)
  • ps_cst: The price of subscriptions for the newspaper. This variable measures how much the newspaper charges for its subscriptions, reflecting its pricing strategy and revenue from readers
  • qtotal: The circulation of the newspaper. This variable measures the number of copies sold, reflecting the newspaper’s readership and revenue from readers
  • after_national: A binary indicator (0 or 1) representing the interaction between a newspaper’s national status and the post-television advertising era. Specifically, it marks the period after a newspaper has started national circulation and also corresponds to times after the introduction or significant increase of television advertising. This variable is designed to capture the combined effects of a newspaper reaching a national audience and the competitive or complementary impacts of television advertising on newspaper revenues or strategies
  • ra_cst_div_qtotal: A derived variable representing the advertising revenue per unit of circulation (ra_cst / qtotal). This variable is calculated to assess the efficiency or effectiveness of advertising revenue in relation to the newspaper’s circulation size
newspapers <-
  newspapers |>
  select(
    year, id_news, after_national, local, national, ra_cst, ps_cst, qtotal
    ) |> 
  mutate(ra_cst_div_qtotal = ra_cst / qtotal, 
         after_national =  if_else(year >= 1967, 1, 0),
         after_national_placebo = if_else(year >= 1966, 1, 0),
         across(c(id_news, local, national, after_national), as.factor),
         year = as.factor(year)) |> 
  rename()

newspapers
# A tibble: 1,196 × 10
   year  id_news after_national local national    ra_cst ps_cst  qtotal
   <fct> <fct>   <fct>          <fct> <fct>        <dbl>  <dbl>   <dbl>
 1 1960  1       0              1     0         52890272   2.29  94478.
 2 1961  1       0              1     0         56601060   2.20  96289.
 3 1962  1       0              1     0         64840752   2.13  97313.
 4 1963  1       0              1     0         70582944   2.43 101068.
 5 1964  1       0              1     0         74977888   2.35 102103.
 6 1965  1       0              1     0         74438248   2.29 105169.
 7 1966  1       0              1     0         81383000   2.31 126235.
 8 1967  1       1              1     0         80263152   2.88 128667.
 9 1968  1       1              1     0         87165704   3.45 131824.
10 1969  1       1              1     0        102596384   3.28 132417.
# ℹ 1,186 more rows
# ℹ 2 more variables: ra_cst_div_qtotal <dbl>, after_national_placebo <dbl>

3.2.1 Inspecting your data

One of the first things you can do, is to plot your data points and see if something stands out. This might serve as a first indicator of anything that concerns the parallel trends assumption for example. The code below plots the development of advertising revenue for French newspapers in a given year. The panels are divided into local newspaper or national newspaper. Remember that our control group are local newspapers and the treatment group are the national ones. Essentially, we would expect that before the intervention, both groups should have parallel trajectories in their outcomes. This is the parallel trends assumption. If the parallel trends assumption holds, the difference between the treatment and control group should be constant over time. If the parallel trends assumption is violated, the DiD estimator will be biased.

newspapers |>
  mutate(type = if_else(local == 1, "Local", "National")) |>
  ggplot(aes(x = year, y = ra_cst)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    labels = dollar_format(
      prefix = "$",
      suffix = "M",
      scale = 0.000001)) +
  labs(x = "Year", y = "Advertising revenue") +
  facet_wrap(vars(type), nrow = 2) +
  theme_minimal() +
  geom_vline(xintercept = 1966.5, linetype = "dashed")
Warning: Removed 144 rows containing missing values or values outside the scale range
(`geom_point()`).

In the top panel/the local newspapers, the revenue data points are quite dense and cluster at regular intervals, suggesting that local newspapers had relatively stable advertising revenues year-to-year. There’s no obvious trend or shift in revenue around the dashed vertical line, which likely represents the year 1967, the year when television advertising was introduced in France.

The distribution of data points in the lower panel showing the observations of national newspapers is less dense compared to local newspapers, which could suggest greater variability in the advertising revenues of national newspapers. There appears to be a change around the dashed vertical line at 1967. After this year, there seems to be a wider spread of data points, including some years with significantly lower advertising revenues compared to previous years. This could indicate that national newspapers were more impacted by the introduction of television advertising. Overall, the graph suggests that the introduction of television advertising in 1967 may have had a differential impact on local and national newspapers, with national newspapers possibly experiencing greater negative effects on their advertising revenue.

However, just because some face validity seems to indicate that there is a difference in the treatment and control group, does not mean that it actually is in the data. We need to test this with a model that we will construct in the next section.

3.2.2 Building the DiD model

This here is the regression formula for the DiD analysis, including the interaction term in which we specify the treatment. The treatment is the introduction of ads on TV and the interaction term is the product of the treatment and the national status of the newspaper

\[ \ln(\mathrm{ra\_cst}) = \beta_0 + \beta_1 \mathrm{national} + \beta_2 \mathrm{after\_national} + \beta_3(\mathrm{national} \times \mathrm{after\_national}) + \beta_4 \mathrm{year} + \alpha_i + \epsilon \]

  • ln(ra_cst): The natural logarithm of advertising revenue for a newspaper. Log transformation is often used in economic data to help normalize the distribution of skewed variables and to interpret the coefficients in terms of percentage changes

  • national: This is a dummy variable indicating whether a newspaper is national (1) or local (0). This variable distinguishes the treatment group from the control group.

  • afternational: A dummy variable indicating the time period after the introduction of television advertising in 1967 (1 for years 1967 and later, 0 for earlier years). It captures the before-and-after comparison.

  • national*after_national: The interaction term between national and after_national. This term is crucial for DiD analysis as it estimates the differential effect of the introduction of television advertising on national newspapers compared to local newspapers over time

  • year: A continuous variable representing the year of observation. Including this allows controlling for linear time trends that affect all newspapers

  • alpha: Newspaper fixed effects that control for all unobserved, time-invariant differences between newspapers

  • epsilon: The error term

This is the code to specify exactly this. Note that in R national * after_national includes three variables due to the * operator. It specifies the main effect of national, the main effect of after_national and the interaction effect national:after_national, representing the differential impact of the post-television advertising period specifically on national newspapers relative to local ones. If we had only put in national:after_national, we would have only included the interaction term and would have had to add the main effects manually. This means that the formula log(ra_cst) ~ national*after_national + ... is shorthand for log(ra_cst) ~ national + after_national + national:after_national + ....

newspaper_did_model <-
  lm(log(ra_cst) ~ national*after_national + year + id_news,
     data = newspapers)

You can see that this is as straightforward as what we have already done in Session 1. It is a simple lm() and an interaction effect; nothing more…

3.2.3 Interpreting the DiD model results

Here, I am showing you a different way of displaying models in R/Quarto. I am making use of the fact that my quartobook is rendered to html. I am using the modelsummary package to display the results of the model in a nice table. Similar to the stargazer package, you can change almost every aspect of the table. I am using the coef_omit argument to exclude the fixed effects of the individual newspapers from the table. This is because we are not interested in them and it would have given us a lot of coefficients. I am also using the coef_map argument to rename the coefficients in the table. This package is maybe a bit more advanced but also more versatile.

model_coefs <- c(
  `national1` = "National Newspapers",
  `after_national1` = "Period After TV Ads",
  `national1:after_national1` = "Interaction Effect (National * After TV Ads)",
  `year` = "Year"
)

modelsummary(newspaper_did_model,
             title = "Difference-in-Differences Model Summary",
             # omit the fixed effects using a regular expression
             stars = TRUE
             )
Difference-in-Differences Model Summary
(1)
+ p
(Intercept) 17.862***
(0.052)
national1 -1.054***
(0.077)
after_national1 0.652***
(0.034)
year1961 0.103**
(0.035)
year1962 0.173***
(0.034)
year1963 0.224***
(0.034)
year1964 0.303***
(0.034)
year1965 0.315***
(0.033)
year1966 0.377***
(0.033)
year1967 -0.215***
(0.030)
year1968 -0.230***
(0.030)
year1969 -0.121***
(0.030)
year1970 -0.113***
(0.030)
year1971 -0.086**
(0.030)
year1972 -0.004
(0.030)
year1973 0.045
(0.030)
id_news3 -1.672***
(0.065)
id_news6 -1.506***
(0.073)
id_news7 -0.895***
(0.065)
id_news13 -1.018***
(0.067)
id_news16 -1.669***
(0.065)
id_news25 -4.670***
(0.114)
id_news28 -0.719***
(0.067)
id_news34 -0.455***
(0.065)
id_news38 0.973***
(0.069)
id_news44 0.727***
(0.073)
id_news48 -0.724***
(0.076)
id_news51 0.721***
(0.065)
id_news53 -3.416***
(0.065)
id_news54 -4.977***
(0.082)
id_news57 -2.305***
(0.065)
id_news60 -1.788***
(0.071)
id_news62 -2.186***
(0.065)
id_news66 -1.645***
(0.073)
id_news67 -1.037***
(0.079)
id_news70 -1.762***
(0.065)
id_news71 0.687***
(0.065)
id_news72 -3.010***
(0.065)
id_news80 -1.433***
(0.065)
id_news82 -1.423***
(0.065)
id_news88 -0.989***
(0.065)
id_news95 -2.132***
(0.068)
id_news97 -1.731***
(0.065)
id_news98 -1.474***
(0.065)
id_news103 -1.591***
(0.069)
id_news105 -1.808***
(0.065)
id_news106 -2.095***
(0.065)
id_news118 -1.249***
(0.065)
id_news119 -0.577***
(0.065)
id_news127 -0.095
(0.065)
id_news136 0.303***
(0.067)
id_news138 0.443***
(0.073)
id_news148 0.611***
(0.065)
id_news151 -0.116
(0.079)
id_news153 -2.574***
(0.065)
id_news154 -0.127+
(0.065)
id_news157 -2.118***
(0.065)
id_news158 -4.073***
(0.073)
id_news161 -2.200***
(0.065)
id_news163 0.632***
(0.067)
id_news167 1.144***
(0.065)
id_news169 0.306***
(0.065)
id_news179 -2.528***
(0.065)
id_news184 -1.247***
(0.065)
id_news185 -2.181***
(0.065)
id_news187 -0.615***
(0.067)
id_news196 0.645***
(0.065)
id_news206 0.463***
(0.065)
id_news210 -0.692***
(0.067)
id_news212 -1.913***
(0.073)
id_news213 -0.867***
(0.065)
id_news224 0.841***
(0.065)
id_news225 -0.042
(0.073)
id_news234 -0.691***
(0.069)
id_news236 0.206**
(0.065)
id_news245 1.285***
(0.065)
id_news247 -1.248***
(0.065)
id_news310 -2.983***
(0.079)
id_news452 -5.146***
(0.076)
id_news467 -1.267***
(0.067)
id_news469 0.893***
(0.067)
id_news480 -1.681***
(0.069)
id_news20040 2.023***
(0.077)
id_news20345 -0.997***
(0.083)
id_news20346 3.350***
(0.076)
id_news20347 1.269***
(0.076)
id_news20352 -1.182***
(0.077)
id_news20354 0.706***
(0.077)
id_news21006 1.195***
(0.083)
id_news21025 2.126***
(0.076)
id_news21173 2.437***
(0.076)
id_news21176 3.216***
(0.076)
id_news33718 0.454***
(0.076)
national1 × after_national1 -0.231***
(0.031)
Num.Obs. 1052
R2 0.986
R2 Adj. 0.984
AIC 36154.6
BIC 36625.7
Log.Lik. 364.840
RMSE 0.17

This table reports the results from a difference-in-differences (DiD) regression model, which examines the impact of television advertising on the advertising revenues of national newspapers compared to local newspapers over time. The intercept we can disregard in this case. As a reminder, it is when all IVs are set to 0. It does not really make sense in our case, especially due to the year variable. I did drop the coefficients of all the fixed effects for the individual newspapers (which were in the factor variable of id_news) because we are not very interested in them and it would have given us an individual coefficient for each newspaper which would have been a lot of displayed coefficients. I used a regular expression (regex) for that. For now, you do not have to know what this is, but it is a very powerful tool to manipulate character strings. At some point, I will add this to the text-as-data tutorial or to the already existing webscraping script.

  • National Newspapers: The coefficient for national newspapers is -1.039 and is highly statistically significant (p < 0.001). This suggests that, holding other factors constant, the log of advertising revenue for national newspapers is, on average, 1.039 units lower than for local newspapers. This could imply that national newspapers, on their own, tend to have lower advertising revenue in this model or that they were already at a disadvantage before the treatment period.

  • Period After TV Ads: The coefficient for the period after the introduction of television advertisements is very small and not statistically significant (-0.001, p > 0.1). This indicates that the introduction of television advertising, when not considering whether a newspaper is national or local, does not have a significant overall effect on advertising revenues.

  • Year: The coefficient for the year variable is positive and significant (0.046, p < 0.001), indicating that there is a general positive trend in advertising revenue over time across all newspapers in the sample.

  • Interaction Effect (National * After TV Ads): This is the center piece of our model since we try to estimate the causal effect through this interaction term. Fortunately for us, it is significant and negative (-0.228, p < 0.001). In DiD analyses, this interaction term captures the differential impact of the treatment effect (here, the introduction of television ads) on the treated group (national newspapers). The negative sign suggests that after television advertising started, national newspapers experienced a significant decrease in advertising revenue compared to local newspapers. This term essentially captures the essence of the DiD strategy by showing the relative effect post-treatment for the group of interest!

Let’s test a placebo at \(t-1\). The parallel trends assumption is the main assumption. We can never really test whether this assumption would be met in the absence of a treatment but we can approximate this by seeing whether this assumption holds true where we can measure it, i.e. before the treatment. That’s why I have created a placebo dummy variable after_national_placebo earlier, which takes on the value 1 in 1966 and not in 1967. If our parallel trends assumption is correct, we should not get a significant effect for this model below:

newspaper_did_model_placebo <-
  lm(log(ra_cst) ~ national * after_national_placebo + year + id_news,
     data = newspapers)

Let’s look at the regression table:

modelsummary(newspaper_did_model_placebo,
             title = "Difference-in-Differences Model Summary",
             # omit the fixed effects using a regular expression
             coef_omit = "id_news\\d+",
             stars = TRUE
             )
Difference-in-Differences Model Summary
(1)
+ p
(Intercept) 17.859***
(0.053)
national1 -1.031***
(0.078)
after_national_placebo 0.653***
(0.034)
year1961 0.103**
(0.035)
year1962 0.173***
(0.035)
year1963 0.224***
(0.034)
year1964 0.302***
(0.034)
year1965 0.316***
(0.033)
year1966 -0.240***
(0.030)
year1967 -0.215***
(0.030)
year1968 -0.231***
(0.030)
year1969 -0.122***
(0.030)
year1970 -0.114***
(0.030)
year1971 -0.086**
(0.030)
year1972 -0.005
(0.030)
year1973 0.045
(0.030)
national1 × after_national_placebo -0.229***
(0.032)
Num.Obs. 1052
R2 0.986
R2 Adj. 0.984
AIC 36158.5
BIC 36629.5
Log.Lik. 362.914
RMSE 0.17

These are odd findings as we find a similar coefficient to the one in the first model…

So what do we conclude now? Concerning our Difference-in-Difference causal design, the model supports the hypothesis that the introduction of television advertising had a different negative impact on national newspapers’ advertising revenues compared to local newspapers! The DiD model effectively isolates this effect by comparing the changes in revenues over time between the two groups (of course assuming parallel trends before the treatment).

3.3 References