5 Multinomial Regressions in R

Author

Luis Sattelmayer

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

5.1 Introduction

In this script, I will show you how to construct a multinomial logistic regression in R. For this, we will work on the European Social Survey (ESS) again. These are the main points that are covered in this script:

The logic of multinomial (logistic) regressions
Advanced Data Management
Interpretation of a multinomial Model
Model Diagnostics
Goodness of Fit
APIs and Data Visualization (OPTIONAL!)

5.2 The Logic of multinomial (logistic) Regressions

I have chosen four countries out of which you will be able to choose one later one when I ask you to work on some exercises. For now, I will mainly work on Germany. One of the classic applications of multinomial models in political science is the question of voting behavior, more precisely vote choice. Last week, we have seen models of a logistic regression (logit). It is used in cases when our dependent variable (DV) is binary (0 or 1; true or false; yes or no) which means that we are not allowed to use OLS. The idea of logit can be extended to unordered categorical or nominal variables with more than two categories, e.g.: Vote choice, Religion, Brands…

Instead of one equation modelling the log-odds of \(P(X=1)\), we do the same thing but for the amount of categories that we have. In fact, this means that a multinomial model runs several single logistic regressions on something we call a baseline. R will choose this baseline to which the categorical values of our DV will then relate. But we can also change it (this is called releveling). This allows us to make very interesting inferences with categorical (or ordinal) variables. If this sounds confusing, you should trust me when I tell you that this will become more straightforward in a second!

However, this also makes the interpretation of these models a bit intricate and opaque at times. Nevertheless, you will see that once you have understood the basic idea of a multinomial regression and how to interpret the values in accordance to the baseline, it is not much different from logistic regressions on binary variables (and in my eyes even a bit simpler…). If the logic of logit is not 100% clear at this point, I recommend you go back to last session’s script on logit and work through my explanations. And if that does not help, try to follow this lecture attentively. As I said, the logic is the same, so I will repeat myself :) And if it is still unclear, you can always ask in class or come see me after the session!

But enough small talk, let’s first do some data wrangling which you all probably dread at this point…

5.3 Data Management for Multinomial Regression

The data which we will use for this session is the 9th round of the ESS published in 2018. The goal of this session is to understand predictors that tell us more about why people vote for Populist Radical-Right Parties, henceforth called PRRP (Mudde 2007). For this I have two main hypotheses in mind, as well as some predictors which I know are important based on the literature. Finally we also need some control variables which we need to control for in almost any regression analysis using survey data.

My two hypotheses (H1) and H2) are as follows:

H1: Thinking that immigrants enrich a country’s culture decreases the likelihood of voting for PRRPs.

H2: Having less trust in politicians increases the likelihood of voting for PRRPs than voting for other parties.

Now you might notice two things. First, my hypotheses are relatively self-explanatory and you are absolutely right. They are more than that, they are perhaps even self-evident. But to this, I would just reply that this is supposed to be an easy exercise which is supposed to expose you to a multinomial regression and the logic of it. Second, you might see that my hypotheses are relatively broadly formulated. This is because I would like you, later in class, to choose one of the countries of the 9th wave of the ESS and build a model yourselves. By giving you broad hypotheses, you can do this ;)

# read_csv from the tidyverse package
ess <- read_csv("ESS9e03_1.csv") |> 
  # dplyr allows me to select only those variables I want to use later
  select(cntry, 
         prtvtdfr, 
         prtvede1, 
         prtvtddk, 
         prtvtdpl, 
         imueclt, 
         yrbrn, 
         eduyrs, 
         hinctnta, 
         stflife, 
         trstplt, 
         blgetmg, 
         gndr) |> 
  # based on the selected variables, I filter the dataframe so that I am only
  # left with the data for Germany and Denmark
  filter(cntry %in% c("DE", "DK"))

Rows: 49519 Columns: 572
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (10): name, proddate, cntry, ctzshipd, cntbrthd, lnghom1, lnghom2, fbrn...
dbl (562): essround, edition, idno, dweight, pspwght, pweight, anweight, pro...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Again, every transformation and mutation of variables which you see below is done based on my knowledge of the dataset which I solely gained from looking at the code book. The code book can be found on the Moodle page (or the ESS’ website). It is highly important that you get used to reading a code book in general but especially to familiarize yourselves with the data which you will use by looking at the way that the variables are coded in the code book. There, for example, you will find information on the numeric values which are stored in the variables prtvtdfr, prtvede1, prtvtddk and prtvtdpl. They all stand for a category or, in our case, a party name which you can only identify if you open the code book. You will see that I only selected some parties in the mutate() function below. This is more or less to get rid of those parties that did not make it into the national parliament at the last national election of each country.

You have seen a similar chunk of code in the last script. See how, once you have a code that works for one dataset, you can use it again?

ess_clean <- ess |>
  mutate(across(
    where(is.numeric),  # Exclude 'age' while keeping numeric columns
    ~ case_when(
      . %in% c(66, 77, 88, 99, 7777, 8888, 9999) ~ NA_integer_,
      TRUE ~ .
    )
  ),
  gndr = recode(gndr, `1` = 0, `2` = 1),
  blgetmg = case_when(
    blgetmg == 2 ~ 0, 
    blgetmg == 1 ~ 1
  ))

In fact, you could already build the model now and start the multinomial regression. However, I add an additional data management step by placing the numeric values of the election variable in a new variable called vote_de, where I convert the numeric values to character values and at the same time give them the names of the parties. This will automatically transform NAs in all the rows in which the country is not that in which the person has voted.

But more importantly, once I run the regression, it will display the parties’ names instead of the numbers. This means that I won’t have to go back to the code book every time to check what the 1s or 2s correspond to.

# this is simple base R creating a new column/variable with character
# values corresponding to the parties' names behind the numeric values
ess_clean$vote_de[ess_clean$prtvede1==1]<-"CDU/CSU"
ess_clean$vote_de[ess_clean$prtvede1==2]<-"SPD"
ess_clean$vote_de[ess_clean$prtvede1==3]<-"Die Linke"
ess_clean$vote_de[ess_clean$prtvede1==4]<-"Grüne"
ess_clean$vote_de[ess_clean$prtvede1==5]<-"FDP"
ess_clean$vote_de[ess_clean$prtvede1==6]<-"AFD"

Here is a way to mutate all the variables at once. However, this somehow creates conflicts with a package used further below.

ess_clean <- ess_clean |> 
  mutate(
    vote_dk = case_when(prtvtddk == 1 ~ "Socialdemokratiet",
                        prtvtddk == 2 ~ "Det Radikale Venstre",
                        prtvtddk == 3 ~ "Det Konservative Folkeparti",
                        prtvtddk == 4 ~ "SF Socialistisk Folkeparti",
                        prtvtddk == 5 ~ "Dansk Folkeparti",
                        prtvtddk == 6 ~ "Kristendemokraterne",
                        prtvtddk == 7 ~ "Venstre",
                        prtvtddk == 8 ~ "Liberal Alliance",
                        prtvtddk == 9 ~ "Enhedslisten",
                        prtvtddk == 10 ~ "Alternativet",
                        TRUE ~ NA_character_),
    vote_de = case_when(prtvede1 == 1 ~ "CDU/CSU",
                        prtvede1 == 2 ~ "SPD",
                        prtvede1 == 3 ~ "Die Linke",
                        prtvede1 == 4 ~ "Grüne",
                        prtvede1 == 5 ~ "FDP",
                        prtvede1 == 6 ~ "AFD"))

5.4 Constructing the Model

Now that the data management process is finally over, we can specify our model. For this, you need to install the nnet package and load it to your library. Once this is done, we will take the exact same steps as you would do for an OLS or logit model. You specify your DV followed by a ~ and then you only need to add all your IVs. Lastly, you need to specify the data source. Hess = TRUE will provide us with a Hessian matrix that we need for a package later. If you don’t know what that is… that is absolutely fine!

library(nnet)
model_de <- multinom(vote_de ~ imueclt  + stflife + trstplt + blgetmg + 
                    gndr + yrbrn + eduyrs + hinctnta,
                     data = ess_clean,
                     Hess = TRUE)

# weights:  60 (45 variable)
initial  value 2512.046776 
iter  10 value 2043.040197
iter  20 value 2023.943362
iter  30 value 1978.403539
iter  40 value 1937.802919
iter  50 value 1926.723964
iter  60 value 1925.989813
iter  70 value 1925.918081
iter  80 value 1925.834275
final  value 1925.814901 
converged

model_dk <- multinom(vote_dk ~ imueclt  + stflife + trstplt + blgetmg + gndr +
                     yrbrn + eduyrs + hinctnta,
                     data = ess_clean,
                     Hess = TRUE)

# weights:  100 (81 variable)
initial  value 2525.935847 
iter  10 value 2265.702751
iter  20 value 2136.160544
iter  30 value 2057.010064
iter  40 value 1999.928035
iter  50 value 1952.439681
iter  60 value 1940.731241
iter  70 value 1934.989529
iter  80 value 1915.364568
iter  90 value 1908.641777
iter 100 value 1907.030232
final  value 1907.030232 
stopped after 100 iterations

5.4.1 Re-leveling your DV

In my case, the German PRRP is called Alternative für Deutschland meaning it starts with an “A”. R tends to take the alphabetical order as a criterion for the baseline meaning that the baseline for your multinomial model is chosen based on the party which comes first in alphabetical order. Depending on what you want to show, you might want to change the baseline which we can do with the fct_relevel() function. Let’s say we are not interested in vote choice regarding the PRRP but conservative parties and thus want to put the German Christian conservative party, the CDU/CSU, as a baseline. Here is how we could do this in R:

# don't run this code chunk
#| eval: false
# you need to specify your DV as a factor for this; further, the ref must 
# contain the exact character label of the party

ess_clean$vote_de |> levels()

NULL

ess_clean$vote_dk |> as_factor() |> levels()

 [1] "Socialdemokratiet"           "Det Konservative Folkeparti"
 [3] "SF Socialistisk Folkeparti"  "Dansk Folkeparti"           
 [5] "Alternativet"                "Liberal Alliance"           
 [7] "Venstre"                     "Kristendemokraterne"        
 [9] "Det Radikale Venstre"        "Enhedslisten"

ess_clean <- ess_clean |> 
  mutate(vote_de = fct_relevel(as.factor(vote_de), "AFD"),
         vote_dk = fct_relevel(as.factor(vote_dk), "Dansk Folkeparti"))

ess_clean$vote_dk |> levels()

 [1] "Dansk Folkeparti"            "Alternativet"               
 [3] "Det Konservative Folkeparti" "Det Radikale Venstre"       
 [5] "Enhedslisten"                "Kristendemokraterne"        
 [7] "Liberal Alliance"            "SF Socialistisk Folkeparti" 
 [9] "Socialdemokratiet"           "Venstre"

5.5 Interpreting a Multinomial Model

You already know the stargazer package for displaying a regression table. This time we will use another package called modelsummary which is developed by Vincent Arel-Bundock who also develops the marginaleffects package that we will see later on. The documentation on the package can be found here. You might wonder why I am showing you this package as you already know stargazer. Modelsummary is much more flexible and compatible with many more model classes than stargazer, which is why I mostly use modelsummary at this point.

modelsummary::modelsummary(
  model_de,
  stars = TRUE,
  # gof_omit = c("AIC|BIC|RMSE"),
  stars_method = "simple",
  coef_map = c(
    "imueclt" = "Positivity Immigration",
    "stflife" = "Satisfaction w/ Life",
    "trstplt" = "Trust in Politicians",
    "blgetmg" = "Ethnic Minority",
    "gndr" = "Gender",
    "yrbrn" = "Age",
    "eduyrs" = "Education",
    "hinctnta" = "Income"
  ),
  title = "Multinomial Regression Results Germany",
  notes = "Vote Choice is the dependent variable.",
  output = "html",
  shape = model + term ~ response,
  gof_map = tibble::tribble(
    ~raw, ~clean, ~fmt,
    "AIC", "Akaike Inf. Crit.", 2,
    "BIC", "Bayesian Inf. Crit.", 2,
    "logLik", "Log-Likelihood", 2,
    "deviance", "Deviance", 2
  ),
  exponentiate = TRUE
)

Multinomial Regression Results Germany
		CDU/CSU	Die Linke	FDP	Grüne	SPD
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Vote Choice is the dependent variable.
(1)	Positivity Immigration	1.419000e+00***	1.853000e+00***	1.385000e+00***	2.191000e+00***	1.637000e+00***
		(8.900000e−02)	(1.480000e−01)	(1.050000e−01)	(1.650000e−01)	(1.060000e−01)
	Satisfaction w/ Life	1.029000e+00	8.420000e−01*	1.099000e+00	9.110000e−01	8.970000e−01
		(6.900000e−02)	(7.000000e−02)	(1.020000e−01)	(7.200000e−02)	(6.000000e−02)
	Trust in Politicians	1.671000e+00***	1.325000e+00**	1.527000e+00***	1.422000e+00***	1.489000e+00***
		(1.290000e−01)	(1.220000e−01)	(1.380000e−01)	(1.200000e−01)	(1.160000e−01)
	Ethnic Minority	1.056000e+00***	1.558000e+00***	2.962000e+00***	1.466000e+00***	1.469000e+00***
		(2.000000e−03)	(1.000000e−03)	(3.000000e−03)	(3.000000e−03)	(3.000000e−03)
	Gender	2.362000e+00***	1.272000e+00*	1.773000e+00***	2.528000e+00***	1.484000e+00***
		(2.420000e−01)	(1.210000e−01)	(1.690000e−01)	(3.470000e−01)	(1.660000e−01)
	Age	9.870000e−01***	1.007000e+00***	9.900000e−01***	9.980000e−01***	9.840000e−01***
		(0.000000e+00)	(0.000000e+00)	(1.000000e−03)	(0.000000e+00)	(0.000000e+00)
	Education	1.014000e+00	1.084000e+00	1.037000e+00	1.095000e+00+	1.041000e+00
		(5.200000e−02)	(6.400000e−02)	(6.200000e−02)	(6.000000e−02)	(5.400000e−02)
	Income	1.123000e+00*	9.690000e−01	1.120000e+00+	1.145000e+00*	1.048000e+00
		(6.200000e−02)	(6.400000e−02)	(7.500000e−02)	(7.100000e−02)	(5.900000e−02)

modelsummary::get_gof(model_de)

      aic      bic r.squared adj.r.squared      rmse nobs
1 3941.63 4177.684 0.1785191     0.1780926 0.3385528 1402

modelsummary::modelsummary(
  model_dk,
  stars = TRUE,
  stars_method = "simple",
  coef_map = c(
    "imueclt" = "Positivity Immigration",
    "stflife" = "Satisfaction w/ Life",
    "trstplt" = "Trust in Politicians",
    "blgetmg" = "Ethnic Minority",
    "gndr" = "Gender",
    "yrbrn" = "Age",
    "eduyrs" = "Education",
    "hinctnta" = "Income"
  ),
  shape = model + term ~ response
)

modelsummary::get_gof(model_dk)

The format of the regression table on our Danish model is not ideal since the names of the parties are quite long and overlap. Blame this on my lack of knowledge of abbreviations of Danish parties…

5.6 Interpreting a Multinomial Regression Table

We can see that many many things are going on in this regression table. Let us try to analyze our results step by step.

First of all, we can see that we have many variables that are statistically significant (lots of stars yay!). This is always a good sign. Note also that the baseline was the party AFD. You can see this based on the fact that the category AFD which our DV can take on is not given in our table. This means that whenever we see the results where the DV is one of the parties, R has calculated the coefficients based on the logic that the respondent would have voter for either the party in the dependent variable or the party of the baseline, which in our case is that of the AFD. In more mathematical terms these are several single logistic regressions always with regards to the baseline AFD which are then aggregated to a multinomial regression. And to be slightly more mathematical, this means our DV is technically: \(1 = DV\) and then \(0 = AFD\).

Therefore, we can interpret the results exactly like we would for a logistic regression. Last week it was about the likelihood of voting abstention, this week it is the likelihood of voting for the CDU/CSU instead of the AFD, or voting for the SPD instead of the AFD, or voting for Die Linke instead of the AFD, and so on. You get the idea hopefully.

Remember that these are the coefficients of logistic regressions. We cannot interpret them linearily like in OLS. For now, the regression table tells us something about the statistical significance of our predictors and the direction of association: whether or not a statistically significant predictor increases or decreases the likelihood of voting for either or.

5.6.1 The Hypotheses

As a reminder, these were my initial (frankly also bad) hypotheses:

H1: Thinking that immigrants enrich a country’s culture decreases the likelihood of voting for PRRPs.

H2: Having less trust in politicians increases the likelihood of voting for PRRPs than voting for other parties.

I am now interested to see the effect of positivity toward migration and trust in politicians on the vote choice for each party instead of the AFD. What we can see is that a one-unit increase in positive attitudes toward migration (thinking that immigrants culturally enrich the respondents’ country) raises the likelihood for voting for all other parties instead of voting for the AFD. In the case of the first column, in which the vote was either for the CDU/CSU or the AFD, a one unit increase in stances on immigration results in a higher likelihood of voting of voting for the CDU/CSU than the AFD.

If we now turn to trust in politicians and this variable’s effect on vote choice for the different German parties, we can see that overall there is a statistically significant a positive association with having more trust in politicians and also voting for other parties than the AFD. In return, this also means that low trust in politicians raises the likelihood of voting for the AFD.

You could obviously exponentiate the values that we have here in order to get the odds-ratio. But I have tortured you enough with ORs and predicted probabilities are much more intuitively interpreted. Therefore, we will calculate them in the next section.

5.7 Predicted Probabilities

You all hopefully still remember the idea of predicted probabilities which we have already seen last time for a simply logistic regression. You hold all but one predictor variables (IVs) constant at their mean or another logical value. The one predictor which you do not hold constant you let alternate/vary to estimate the predicted probabilities of this specific variable of interest and the different values it can take on (on your dependent variable). The predicted probabilities can be tricky to code manually and we are not going to do this again but we will use a package that can do this for us.

The text below is the outdated version of how I calculate predicted probabilities for multinomial models. I still leave it in here because it is still working but I recommend working with the marginaleffects.

The package is called MNLpred and allows us to specify the variable of interest. This packages makes draws from our posterior distribution (hello Bayesian statistics) and simulates our coefficients n-times (we tell it how many times to run the simulation) and then takes the mean value of all of our simulations. This way, we end up more or less with the same predicted probabilities that we have seen last week. These are much more easily interpreted than relative risk ratios (the odds-ratios of multinomial regressions) and can be plotted.

library(MNLpred)
pred1 <- mnl_pred_ova(
  model = model_de,
  # specify data source
  data = ess_clean,
  # specify predictor of interest
  x = "imueclt",
  # the steps which should be used for the simulated prediction
  by = 1,
  # this would be for replicability, we do not care about it
  # here
  seed = "random",
  # number of simulations
  nsim = 100,
  # confidence intervals
  probs = c(0.025, 0.975)
)

Multiplying values with simulated estimates:
================================================================================
Applying link function:
================================================================================
Done!

The pred1 object now contains the simulated means for each party at each step of our predictor of interest meaning that there are 10 simulated mean values for each value that imueclt can take on for each party:

pred1$plotdata |> head()

  imueclt vote_de       mean      lower      upper
1       0     AFD 0.24688322 0.16713163 0.33759867
2       1     AFD 0.19126946 0.13624094 0.25634557
3       2     AFD 0.14330486 0.10755951 0.18625379
4       3     AFD 0.10373299 0.08224574 0.13448033
5       4     AFD 0.07254962 0.05882374 0.09445876
6       5     AFD 0.04907084 0.03672051 0.06375989

Let’s simulate the exact same thing for our second hypothesis regarding the trust in politicians:

pred2 <- mnl_pred_ova(
  model = model_de,
  data = ess_clean,
  x = "trstplt",
  by = 1,
  seed = "random",
  nsim = 100,
  probs = c(0.025, 0.975)
)

Multiplying values with simulated estimates:
================================================================================
Applying link function:
================================================================================
Done!

The results, which we have both stored respectively in the objects pred1 and pred2 can be used for a visualization with ggplot().

library(ggplot2)
ggplot(data = pred2$plotdata, aes(
  x = trstplt,
  y = mean,
  ymin = lower,
  ymax = upper
)) +
  # this gives us the confidence intervals
  geom_ribbon(alpha = 0.1) +
  # taking the mean of the values
  geom_line() +
  # here we display the predicted probabilities for all parties in one plot
  facet_wrap(. ~ vote_de, ncol = 2) +
  # putting the values of the y-axis in percentages
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  # the x-axis follows the 0-10 scale of the predictor
  scale_x_continuous(breaks = c(0:10)) +
  # specifying the ggplot theme
  theme_minimal() +
  # lastly you only need to label your axes; Always label your axes ;)
  labs(y = "Predicted probabilities",
       x = "Trust in Politicians")

Here we can see very well by how many percent the likelihood increases or decreases for each party given that our independent variable, our predictor, of trust in politicians increases (increasing values mean more trust in politicians).

We can also visualize our predicted probabilities in one single plot. I made the effort of coordinating the colors so that they would be displayed in the colors of the parties. If you want to have a color selector to get the HEX color codes, you can click on this link: (it will say Google Farbwähler, which is not a scam but German…). As by recently, R will also display the color you have selected.

ggplot(data = pred2$plotdata, aes(
  x = trstplt,
  y = mean,
  color = as.factor(vote_de)
)) +
  geom_smooth(aes(ymin = lower, ymax = upper), stat = "identity") +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks = c(0:10)) +
  scale_color_manual(
    values = c(
      "#03c2fc",
      "#000000",
      "#f26dd5",
      "#FFFF00",
      "#00e81b",
      "#fa0000"
    ),
    name = "Vote",
    labels = c("AFD", "CDU", "DIE LINKE", "FDP",
               "GRUENE", "SPD")
  ) +
  ylab("Predicted Probability Vote") +
  xlab("Trust in Politicians") +
  theme_minimal()

This here is the plot for our first hypothesis for which we have stored the predicted probabilities in the object pred1:

library(ggplot2)
ggplot(data = pred1$plotdata, aes(
  x = imueclt,
  y = mean,
  ymin = lower,
  ymax = upper
)) +
  geom_ribbon(alpha = 0.1) + # Confidence intervals
  geom_line() + # Mean
  facet_wrap(. ~ vote_de, ncol = 2) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + # % labels
  scale_x_continuous(breaks = c(0:10)) +
  theme_minimal() +
  labs(y = "Predicted probabilities",
       x = "Positivity towards Immigrants") # Always label your axes ;)

And here the code which puts all the predicted probabilities in one plot:

ggplot(data = pred1$plotdata, aes(
  x = imueclt,
  y = mean,
  color = as.factor(vote_de)
)) +
  geom_smooth(aes(ymin = lower,
                  ymax = upper),
              stat = "identity") +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks = c(0:10)) +
  scale_color_manual(
    values = c(
      "#03c2fc",
      "#000000",
      "#f26dd5",
      "#FFFF00",
      "#00e81b",
      "#fa0000"
    ),
    name = "Vote",
    labels = c("AFD", "CDU", "DIE LINKE", "FDP",
               "GRUENE", "SPD")
  ) +
  ylab("Predicted Probability Vote") +
  xlab("Positivity towards Immigration")

5.8 `marginaleffects`

library(marginaleffects)

multinom_pred_de <- plot_predictions(
  model_de,
  type = "probs",
  condition = "trstplt",
  draw = FALSE) |> 
  as_tibble()

multinom_pred_de

# A tibble: 300 × 18
   rowid group estimate std.error statistic   p.value s.value conf.low conf.high
   <int> <fct>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
 1     1 AFD     0.131     0.0330      3.96   7.35e-5    13.7   0.0662    0.196 
 2     2 AFD     0.122     0.0299      4.07   4.63e-5    14.4   0.0631    0.180 
 3     3 AFD     0.113     0.0270      4.19   2.83e-5    15.1   0.0601    0.166 
 4     4 AFD     0.105     0.0244      4.30   1.68e-5    15.9   0.0571    0.153 
 5     5 AFD     0.0972    0.0220      4.42   9.72e-6    16.7   0.0541    0.140 
 6     6 AFD     0.0900    0.0198      4.54   5.55e-6    17.5   0.0512    0.129 
 7     7 AFD     0.0833    0.0179      4.66   3.16e-6    18.3   0.0483    0.118 
 8     8 AFD     0.0770    0.0161      4.77   1.83e-6    19.1   0.0454    0.109 
 9     9 AFD     0.0712    0.0146      4.87   1.09e-6    19.8   0.0426    0.0998
10    10 AFD     0.0657    0.0132      4.96   6.92e-7    20.5   0.0398    0.0917
# ℹ 290 more rows
# ℹ 9 more variables: imueclt <dbl>, stflife <dbl>, blgetmg <dbl>, gndr <dbl>,
#   yrbrn <dbl>, eduyrs <dbl>, hinctnta <dbl>, trstplt <dbl>, vote_de <fct>

multinom_pred_de |>
  ggplot(aes(trstplt, estimate, group = group, color = group)) +
  geom_line() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2) +
  facet_wrap(~group) +
  scale_color_manual(
    values = c(
      "#03c2fc",  # AFD
      "#000000",  # CDU
      "#f26dd5",  # DIE LINKE
      "#FFFF00",  # FDP
      "#00e81b",  # GRUENE
      "#fa0000"   # SPD
    )
  ) +
  theme_light()

multinom_pred_de_gender <- plot_predictions(
  model_de,
  type = "probs",
  condition = "gndr",
  draw = FALSE) |> 
  as_tibble()

multinom_pred_de_gender |> 
  ggplot(aes(gndr, estimate, group = group, color = group)) +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
  facet_wrap(~group, scales = "free_y") +
  scale_color_manual(
    values = c(
      "#03c2fc",  # AFD
      "#000000",  # CDU
      "#f26dd5",  # DIE LINKE
      "#FFFF00",  # FDP
      "#00e81b",  # GRUENE
      "#fa0000"   # SPD
    )
  ) +
  labs(
    x = "",
    y = "Predicted Probabilities by Gender",
    color = ""
  ) +
  theme(
    legend.position = "none"
  ) +
  theme_light()

5.9 Diagnostics of Multinomial Models

I have talked about diagnostics of models before. This will be the first time that we really touch upon that in models that are not linear like OLS. Usually this is a step which you should take between the building and the final interpretation of your model.

The estimates of your model change depending on several influences. The number of predictors, the scaling of your predictors, the scaling of your dependent variable or the coding of your dependent variable. All these kind of things (and many more) will have an effect on your model’s results. We need to be sure that we have a good amount of variables to account for enough variance. But we also need to make sure that we do not overfit our model, meaning that we put in too many predictors for example. We also need to make sure that our model is not biased by the scaling of our variables. This is why we need to check for multicollinearity, heteroskedasticity and other things.

We are firstly concerned with the goodness of fit of our model. In a linear model using the OLS method, we have looked at the \(R^2\) and adjusted \(R^2\) of the models. This tells us something about how much variance of the DV is explained by our IVs. Unfortunately, this measure does not exist for logistic or multinomial models. But the good news is that we can calculate something that is called McFadden’s Pseudo \(R^2\). It is interpreted in a similar way as you would do it with a normal \(R^2\) meaning that anything ranging between 0.2 and 0.4 is a result that should make us happy.

This is how you do this in R:

# you obviously need to install the package first
library(pscl)

Classes and Methods for R originally developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University (2002-2015),
by and under the direction of Simon Jackman.
hurdle and zeroinfl functions by Achim Zeileis.

pR2(model_de)

fitting null model for pseudo-r2
# weights:  12 (5 variable)
initial  value 2512.046776 
iter  10 value 2159.652469
final  value 2158.240953 
converged

          llh       llhNull            G2      McFadden          r2ML 
-1925.8149013 -2158.2409526   464.8521025     0.1076924     0.2821995 
         r2CU 
    0.2958110

5.9.1 Hetereoskedasticity and Multicollinearity

Then there are issues of scary words like multicollinearity or heteroskedasticity (oftentimes also refered to as “heteroske-something”). These two things describe two phenomena that can skew our estimations and, in the worst case scenario, will lead to wrong inferences. Therefore, we must check for them in all different kinds of models, be it a simple model using the OLS method, or a logistic regression or a multinomial regression. There are ways to test for potential problems that might arise and also ways to work our way around them if ever we encounter them.

5.9.2 Multicollinearity and how to eliminate it

For now, we will only look at the potential issue of multicollinearity. It occurs when your independent variables are correlated among each other. This means that they vary very similarly in their values and measure either similar things or measure things the same way. The higher the multicollinearity within your model, the less reliable are your statistical inferences.

We can detect the amount and measure of (multi)collinearity by calculating the Variance Inflation Factor (VIF). It measures the amount of correlation between our predictors. The VIF should be below 10. If it is below 0.2, this is a potential problem. Anything below 0.1 should have us really worried. To do this in R, we use the vif() function of the car package. However, it does not work on the object of a multinomial model. Thus, we cheat our way around it and build a GLM model (glm()) in which we set our DV as factors and pretend that they are binomially distributed. This way, R sort of manually calculates the individual logistic regressions according to a baseline and we can calculate the VIF for the IVs individually.

model_vif <-
  glm(
    as.factor(vote_de) ~ imueclt  + stflife + trstplt + blgetmg +
      gndr + yrbrn + eduyrs + hinctnta,
    data = ess_clean,
    family = binomial()
  )

car::vif(model_vif)

 imueclt  stflife  trstplt  blgetmg     gndr    yrbrn   eduyrs hinctnta 
1.168846 1.253236 1.142468 1.041481 1.038761 1.108050 1.245434 1.311167

Based on the results, we can see that our variance is not inflated since all values are below 10. That is great news! A VIF of 1 means that there is no correlation within our predictors, a VIF between 1 and 5 (which is quite normal) indicates slight correlation, and a VIF between 5 and 10 shows a strong correlation.

If, however, you should encounter issues of multicollinearity, you should test the VIFs for different versions of your model by starting to drop the IV with the highest VIF and see how that affects your VIFs overall. Or you check the variables which have high values, see if theoretically speaking they measure similar things, and combine them into a single measure.

It is important to do this step in order to test the validity and reliability of our models!

5.10 Goodness of Fits and its other measures

You have seen me use the term goodness of fit before and that this becomes very important in quantitative research when you try to model statistical relationships. Until now, we have always only modeled one model and then interpreted its coefficients and model values. We have seen the \(R^2\) and adjusted \(R^2\) and we have mostly seen bad OLS models which showed very low values in both these measures. However, this measure does not always exist for generalized linear models. Thus, statisticians have come up with other ways to compare models and their goodness of fit. As a rule of thumb, we should always favor models which explain as much as possible by not making too many (strong) assumptions and overfitting our predictors, e.g. adding too many in one regression etc. In one of your introduction to (political) science classes, you might have heard of Ockham’s razor; this is the same idea but for statistical models.

Goodness of fit in our case refers to how well the model which we have constructed, fits the set of our made observations. Thus, goodness of fit somewhat measures the discrepancy between our observed and expected values given our model. If we do not have an adjusted \(R^2\), we need to use other information criteria to determine which model fits best our data. There is quite an abundance of criteria which come to mind. Some of them are related to specific kinds of statistical models, whereas some are more general. The two which I would like to mention here are the AIC and the BIC.

5.10.1 AIC (Akaike Information Criterion)

Don’t be like me and think for years AIC was a bad abbreviation of Akaike. It actually stands for Akaike Information Criterion. It is calculated based on the number of predictors of our model and how well it reproduces our data (the likelihood estimation). If you go back to our multinomial regressions above, you can see that the the last line of our table shows the AIC for this model. Individually, this information criterion is meaningless. It becomes important when we compare it to an AIC of a similar model and check which one indicates a better fit.

What would a similar model look like? Well, if we dropped one of our IVs for example, we would alter the model a bit but keep its global structure. In that case, we would generate a second but different AIC. Comparing the AIC then tells us something about which model (meaning which composition of model) indicates a better fit.

What is a better AIC? The lower AIC indicates that the model fits our data better than the model with the higher AIC. This is simply a mathematical measure. Stand-alone values of the AIC do not tell us much. They need to be considered in comparison to other values.

5.10.2 BIC (Bayesian Information Criterion)

Bayesian Statistics ¹ are super fascinating and I will include them wherever I can. Luckily, the BIC is very common and is to the AIC what the adjusted \(R^2\) is to the \(R^2\). This means that it is a “stricter” measure of goodness of fit than the AIC. It is quite similar to the AIC but differs in that it penalizes you for adding “less useful” variables to your model (potentially overfitting or overcomplexifing your model). Thus, similarly to the AIC, we should also favor the lower BIC!

The second school of doing statistics. What we are doing is called frequentist statistics. Bayesian statistics are a bit more complicated and require a different way of thinking about statistics but they are the more intuitive way of conducting statistics (without p-values but with something you could call certainty). They are also more computationally expensive and thus, have only gained traction over the recent decades. They are gaining more and more popularity. If you are interested in learning more about them, I recommend the book by Richard McElreath (2016) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. He also has lectures on YouTube that follow the book. Definitely worth a try!↩︎