4  Diff-in-Diff

4.1 Introduction

This week we will start talking about Causal Inference in R. You will have already seen the theoretical and mathematical parts with Brenda in the lecture. This script is supposed to initiate you to the practical part of Causal Inference.

This session will show you how to :

  1. Prepare data for Causal Inference
  2. Perform a simple Diff-in-Diff analysis in R
  3. Interpret the results of a Diff-in-Diff analysis in R
Version 2.0.0 of `modelsummary`, to be released soon, will introduce a
  breaking change: The default table-drawing package will be `tinytable`
  instead of `kableExtra`. All currently supported table-drawing packages
  will continue to be supported for the foreseeable future, including
  `kableExtra`, `gt`, `huxtable`, `flextable, and `DT`.
  
  You can always call the `config_modelsummary()` function to change the
  default table-drawing package in persistent fashion. To try `tinytable`
  now:
  
  config_modelsummary(factory_default = 'tinytable')
  
  To set the default back to `kableExtra`:
  
  config_modelsummary(factory_default = 'kableExtra')
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::lag()        masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

4.2 Diff-in-Diff

For the first example on Diff-in-Diff I am going to use the code written by the brilliant Rohan Alexander who has written one of the (if not the) best introduction to Data Analysis in R which you can find here. Furthermore, the example he uses is a paper written by Charles Angelucci and Julia Cagé who you probably know already.

The paper is called “Newspapers in Times of Low Advertising Revenues” (Angelucci and Cagé 2019) and follows a difference-in-difference analysis and it also comes with replication material which means that we can try to emulate their results.

In 1967, the French government introduced advertisements on French TV programs. This led to a decrease in advertising revenues for newspapers all over France. The idea is therefore to understand if and how the introduction of ads on TV affected the ad revenues for newspapers, both local and national ones, in France. They use a difference-in-difference approach to estimate the effect of the introduction of advertising on the advertising revenues of newspapers. Thus, the treatment is…? Exactly, the introduction of ads on TV. Angelucci and Cagé argue that national newspapers were more affected by this change than local newspapers. They have many more hypotheses which they test in this paper, but for now we will focus on this one. As mentioned above, their data is available and we can use it to replicate their results. You need to sign up, to get it.

newspapers <- read_dta("data/Angelucci_Cage_AEJMicro_dataset.dta")

Next we will have to do some minor data management. Fortunately enough, this replication material is already pretty clean and we do not have to mutate() or filter() our way around too much. The only thing we actually have to do, is to convert some variables into factors and to create a new variable which is the ratio of advertising revenue over circulation. Here I specify the argument across() again of the mutate() function and within it, I give a vector c() containing all the variables that should be transformed to factors.

Here is a brief explanation of the variables in the dataset:

  • year: The year of the observation
  • id_news: A unique identifier for each newspaper
  • local: A binary indicator (0 or 1) representing whether a newspaper is local or national (1 = local, 0 = national)
  • national: A binary indicator (0 or 1) representing whether a newspaper is national or local (1 = national, 0 = local)
  • ra_cst: The advertising revenue of the newspaper (in constant (2014) euros)
  • ps_cst: The price of subscriptions for the newspaper. This variable measures how much the newspaper charges for its subscriptions, reflecting its pricing strategy and revenue from readers
  • qtotal: The circulation of the newspaper. This variable measures the number of copies sold, reflecting the newspaper’s readership and revenue from readers
  • after_national: A binary indicator (0 or 1) representing the interaction between a newspaper’s national status and the post-television advertising era. Specifically, it marks the period after a newspaper has started national circulation and also corresponds to times after the introduction or significant increase of television advertising. This variable is designed to capture the combined effects of a newspaper reaching a national audience and the competitive or complementary impacts of television advertising on newspaper revenues or strategies
  • ra_cst_div_qtotal: A derived variable representing the advertising revenue per unit of circulation (ra_cst / qtotal). This variable is calculated to assess the efficiency or effectiveness of advertising revenue in relation to the newspaper’s circulation size
newspapers <-
  newspapers |>
  select(
    year, id_news, after_national, local, national, ra_cst, ps_cst, qtotal
    ) |> 
  mutate(ra_cst_div_qtotal = ra_cst / qtotal, 
         after_national =  if_else(year >= 1967, 1, 0),
         across(c(id_news, local, national, after_national), as.factor),
         year = as.integer(year)) |> 
  rename()

newspapers
# A tibble: 1,196 × 9
    year id_news after_national local national    ra_cst ps_cst  qtotal
   <int> <fct>   <fct>          <fct> <fct>        <dbl>  <dbl>   <dbl>
 1  1960 1       0              1     0         52890272   2.29  94478.
 2  1961 1       0              1     0         56601060   2.20  96289.
 3  1962 1       0              1     0         64840752   2.13  97313.
 4  1963 1       0              1     0         70582944   2.43 101068.
 5  1964 1       0              1     0         74977888   2.35 102103.
 6  1965 1       0              1     0         74438248   2.29 105169.
 7  1966 1       0              1     0         81383000   2.31 126235.
 8  1967 1       1              1     0         80263152   2.88 128667.
 9  1968 1       1              1     0         87165704   3.45 131824.
10  1969 1       1              1     0        102596384   3.28 132417.
# ℹ 1,186 more rows
# ℹ 1 more variable: ra_cst_div_qtotal <dbl>

4.2.1 Inspecting your data

One of the first things you can do, is to plot your data points and see if something stands out. This might serve as a first indicator of anything that concerns the parallel trends assumption for example. The code below plots the development of advertising revenue for French newspapers in a given year. The panels are divided into local newspaper or national newspaper. Remember that our control group are local newspapers and the treatment group are the national ones. Essentially, we would expect that before the intervention, both groups should have parallel trajectories in their outcomes. This is the parallel trends assumption. If the parallel trends assumption holds, the difference between the treatment and control group should be constant over time. If the parallel trends assumption is violated, the DiD estimator will be biased.

newspapers |>
  mutate(type = if_else(local == 1, "Local", "National")) |>
  ggplot(aes(x = year, y = ra_cst)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    labels = dollar_format(
      prefix = "$",
      suffix = "M",
      scale = 0.000001)) +
  labs(x = "Year", y = "Advertising revenue") +
  facet_wrap(vars(type), nrow = 2) +
  theme_minimal() +
  geom_vline(xintercept = 1966.5, linetype = "dashed")
Warning: Removed 144 rows containing missing values or values outside the scale range
(`geom_point()`).

In the top panel/the local newspapers, the revenue data points are quite dense and cluster at regular intervals, suggesting that local newspapers had relatively stable advertising revenues year-to-year. There’s no obvious trend or shift in revenue around the dashed vertical line, which likely represents the year 1967, the year when television advertising was introduced in France.

The distribution of data points in the lower panel showing the observations of national newspapers is less dense compared to local newspapers, which could suggest greater variability in the advertising revenues of national newspapers. There appears to be a change around the dashed vertical line at 1967. 1 After this year, there seems to be a wider spread of data points, including some years with significantly lower advertising revenues compared to previous years. This could indicate that national newspapers were more impacted by the introduction of television advertising. Overall, the graph suggests that the introduction of television advertising in 1967 may have had a differential impact on local and national newspapers, with national newspapers possibly experiencing greater negative effects on their advertising revenue.

However, just because some face validity seems to indicate that there is a difference in the treatment and control group, does not mean that it actually is in the data. We need to test this with a model that we will construct in the next section.

4.2.2 Building the DiD model

This here is the regression formula for the DiD analysis, including the interaction term in which we specify the treatment. The treatment is the introduction of ads on TV and the interaction term is the product of the treatment and the national status of the newspaper

\[ \ln(\mathrm{ra\_cst}) = \beta_0 + \beta_1 \mathrm{national} + \beta_2 \mathrm{after\_national} + \beta_3(\mathrm{national} \times \mathrm{after\_national}) + \beta_4 \mathrm{year} + \alpha_i + \epsilon \] - ln(ra_cst): The natural logarithm of advertising revenue for a newspaper. Log transformation is often used in economic data to help normalize the distribution of skewed variables and to interpret the coefficients in terms of percentage changes

  • national: This is a dummy variable indicating whether a newspaper is national (1) or local (0). This variable distinguishes the treatment group from the control group.

  • afternational: A dummy variable indicating the time period after the introduction of television advertising in 1967 (1 for years 1967 and later, 0 for earlier years). It captures the before-and-after comparison.

  • national*after_national: The interaction term between national and after_national. This term is crucial for DiD analysis as it estimates the differential effect of the introduction of television advertising on national newspapers compared to local newspapers over time

  • year: A continuous variable representing the year of observation. Including this allows controlling for linear time trends that affect all newspapers

  • alpha: Newspaper fixed effects that control for all unobserved, time-invariant differences between newspapers

  • epsilon: The error term

This is the code to specify exactly this. Note that in R national * after_national includes three variables due to the * operator. It specifies the main effect of national, the main effect of after_national and the interaction effect national:after_national, representing the differential impact of the post-television advertising period specifically on national newspapers relative to local ones. If we had only put in national:after_national, we would have only included the interaction term and would have had to add the main effects manually. This means that the formula log(ra_cst) ~ national*after_national + ... is shorthand for log(ra_cst) ~ national + after_national + national:after_national + ....

newspaper_did_model <-
  lm(log(ra_cst) ~ national * after_national + year + id_news,
     data = newspapers)

You can see that this is as straightforward as what we have already done in Session 1. It is a simple lm() and an interaction effect; nothing more…

4.2.3 Interpreting the DiD model results

Here, I am showing you a different way of displaying models in R/Quarto. I am making use of the fact that my quartobook is rendered to html. I am using the modelsummary package to display the results of the model in a nice table. Similar to the stargazer package, you can change almost every aspect of the table. I am using the coef_omit argument to exclude the fixed effects of the individual newspapers from the table. This is because we are not interested in them and it would have given us a lot of coefficients. I am also using the coef_map argument to rename the coefficients in the table. This package is maybe a bit more advanced but also more versatile.

model_coefs <- c(
  `national1` = "National Newspapers",
  `after_national1` = "Period After TV Ads",
  `national1:after_national1` = "Interaction Effect (National * After TV Ads)",
  `year` = "Year"
)

modelsummary(newspaper_did_model,
             title = "Difference-in-Differences Model Summary",
             # omit the fixed effects using a regular expression
             coef_omit = "id_news\\d+",
             stars = TRUE,
             coef_map = model_coefs
             )
Difference-in-Differences Model Summary
 (1)
National Newspapers −1.039***
(0.078)
Period After TV Ads −0.001
(0.022)
Interaction Effect (National * After TV Ads) −0.228***
(0.032)
Year 0.046***
(0.003)
Num.Obs. 1052
R2 0.985
R2 Adj. 0.984
AIC 36169.6
BIC 36581.2
Log.Lik. 345.343
RMSE 0.17
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

This table reports the results from a difference-in-differences (DiD) regression model, which examines the impact of television advertising on the advertising revenues of national newspapers compared to local newspapers over time. The intercept we can disregard in this case. As a reminder, it is when all IVs are set to 0. It does not really make sense in our case, especially due to the year variable. I did drop the coefficients of all the fixed effects for the individual newspapers (which were in the factor variable of id_news) because we are not very interested in them and it would have given us an individual coefficient for each newspaper which would have been a lot of displayed coefficients. I used a regular expression (regex) for that. For now, you do not have to know what this is, but it is a very powerful tool to manipulate character strings. At some point, I will add this to the text-as-data tutorial or to the already existing webscraping script.

  • National Newspapers: The coefficient for national newspapers is -1.039 and is highly statistically significant (p < 0.001). This suggests that, holding other factors constant, the log of advertising revenue for national newspapers is, on average, 1.039 units lower than for local newspapers. This could imply that national newspapers, on their own, tend to have lower advertising revenue in this model or that they were already at a disadvantage before the treatment period.

  • Period After TV Ads: The coefficient for the period after the introduction of television advertisements is very small and not statistically significant (-0.001, p > 0.1). This indicates that the introduction of television advertising, when not considering whether a newspaper is national or local, does not have a significant overall effect on advertising revenues.

  • Year: The coefficient for the year variable is positive and significant (0.046, p < 0.001), indicating that there is a general positive trend in advertising revenue over time across all newspapers in the sample.

  • Interaction Effect (National * After TV Ads): This is the center piece of our model since we try to estimate the causal effect through this interaction term. Fortunately for us, it is significant and negative (-0.228, p < 0.001). In DiD analyses, this interaction term captures the differential impact of the treatment effect (here, the introduction of television ads) on the treated group (national newspapers). The negative sign suggests that after television advertising started, national newspapers experienced a significant decrease in advertising revenue compared to local newspapers. This term essentially captures the essence of the DiD strategy by showing the relative effect post-treatment for the group of interest!

So what do we conclude now? Concerning our Difference-in-Difference causal design, the model supports the hypothesis that the introduction of television advertising had a different negative impact on national newspapers’ advertising revenues compared to local newspapers! The DiD model effectively isolates this effect by comparing the changes in revenues over time between the two groups (of course assuming parallel trends before the treatment).

4.3 References

Abou-Chadi, Tarik. 2016. “Niche Party Success and Mainstream Party Policy Shifts – How Green and Radical Right Parties Differ in Their Impact.” British Journal of Political Science 46 (2): 417–36. https://doi.org/10.1017/S0007123414000155.
Adams, James, and Zeynep Somer-Topcu. 2009. “Policy Adjustment by Parties in Response to Rival Parties’ Policy Shifts: Spatial Theory and the Dynamics of Party Competition in Twenty-Five Post-War Democracies.” British Journal of Political Science 39 (4): 825–46. https://doi.org/10.1017/S0007123409000635.
Angelucci, Charles, and Julia Cagé. 2019. “Newspapers in Times of Low Advertising Revenues.” American Economic Journal: Microeconomics 11 (3): 319–64. https://doi.org/10.1257/mic.20170306.
Beck, Nathaniel, and Jonathan N. Katz. 1995. “What to Do (and Not to Do) with Time-Series Cross-Section Data.” The American Political Science Review 89 (3): 634–47. https://doi.org/10.2307/2082979.
Budge, Ian, and Dennis Farlie. 1983. Explaining and Predicting Elections: Issue Effects and Party Strategies in Twenty-Three Democracies. London ; Boston: Allen & Unwin.
Carpenter, Christopher, and Carlos Dobkin. 2015. “The Minimum Legal Drinking Age and Crime.” The Review of Economics and Statistics 97 (2): 521–24. https://doi.org/10.1162/REST_a_00489.
Cunningham, Scott. 2021. Causal Inference: The Mixtape. New Haven ; London: Yale University Press.
Dinas, Elias, and Kostas Gemenis. 2010. “Measuring Parties’ Ideological Positions with Manifesto Data: A Critical Evaluation of the Competing Methods.” Party Politics 16 (4): 427–50. https://doi.org/10.1177/1354068809343107.
Gemenis, Kostas. 2013. “What to Do (and Not to Do) with the Comparative Manifestos Project Data.” Political Studies 61 (April): 23–43. https://doi.org/10.1111/1467-9248.12015.
Green, Jane, and Sara Hobolt. 2008. “Owning the Issue Agenda: Party Strategies and Vote Choices in British Elections.” Electoral Studies 27 (3): 460–76. https://doi.org/10.1016/j.electstud.2008.02.003.
Huntington-Klein, Nick. n.d. The Effect: An Introduction to Research Design and Causality the Effect. Accessed April 10, 2024. https://theeffectbook.net/.
Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. 1st ed. New York: Basic Books.
Ruedin, Didier, and Laura Morales. 2019. “Estimating Party Positions on Immigration: Assessing the Reliability and Validity of Different Methods.” Party Politics 25 (3): 303–14. https://doi.org/10.1177/1354068817713122.

  1. I am not going to lie, I was hoping for a but of a clearer shift here… As you can see the causal effect is there mathematically but visually this could have been a bit more evident. This however emphasizes the importance of not only relying on visual inspection but rather your estimation strategy and being clear about the assumptions you are making.↩︎