1  RStudio Recap & OLS

1.1 Introduction

This is a short recap of things you have seen last year and will need this year as well. It will refresh your understanding of the linear regression method called ordinary least squares (OLS). This script is supposed to serve as a cheat sheet for you to which you can always come back to.

These are the main points of today’s session and script:

  1. A refresher of Ordinary Least Squares (OLS)
  2. What is Base R and what are packages?
  3. A recap of basic coding in R
  4. Building & Interpreting a simple linear model
  5. Visualizing Residuals
  6. The broom package (OPTIONAL!)

These are the packages we will be using in this session:

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rio)
library(stargazer)

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 

1.2 Ordinary Least Squares (OLS)

OLS regressions are the powerhouse of statistics. The world must have been a dark place without them. They are the most basic form of linear regression and are used to predict the value of a dependent variable (DV) based on the value of independent variables (IVs). It is important to note that the relationship between the DV and the IVs is assumed to be linear.

As a quick reminder, this is the formula for a basic linear model: \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X\).

OLS is a certain kind of method of linear model in which we choose the line which has the least prediction errors. This means that it is the best way to fit a line through all the residuals with the least errors. It minimizes the sum of the squared prediction errors \(\text{SSE} = \sum_{i=1}^{n} \widehat{\epsilon}_i^2\)

Five main assumptions have to be met to allow us to construct an OLS model:

  1. Linearity: Linear relationship between IVs and DVs
  2. No endogeneity between \(y\) and \(x\)
  3. Errors are normally distributed
  4. Homoscedasticity (variance of errors is constant)
  5. No multicolinearity (no linear relationship between the independent variables)

For this example, I will be working with some test scores of a midterm and a final exam which I once had to work through. We are trying to see if there is a relationship between the score in the midterm and the grade of the final exam. Theoretically speaking, we would expect most of the students who did well on the first exam to also get a decent grade on the second exam. If our model indicates a statistical significance between the independent and the dependent variable and a positive coefficient of the former on the latter, this theoretical idea then holds true.

1.3 Coding Recap

Before we start, let’s refresh our coding basics again. RStudio works with packages and libraries. There is something called Base R, which is the basic infrastructure that R always comes with when you install it. The R coding language has a vibrant community of contributors who have written their own packages and libraries which you can install and use. As Malo, I am of the tidyverse school and mostly code with this package or in its style when I am wrangling with data, changing its format or values and so on. Here and there, I will, however, try to provide you with code that uses Base R or other packages. In coding, there are many ways to achieve the same goal – and I will probably be repeating this throughout the semester – and we always strive for the fastest or most automated way. I will not force the tidyverse environment on you but I do think that it is one of the most elegant and fastest way of doing statistics in R. It is sort of my RStudio dialect but you can obviously stick to yours if you have found it. Also, as long as you find a way that works for you, that is fine with me!

To load the packages, we are going to need:

Next we will import the dataset of grades.

ess <- read_csv("data/ess_10.csv")
New names:
Rows: 37611 Columns: 38
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(6): cntry, cntbrthd, lnghom1, mbrncntc, fbrncntc, region dbl (32): gndr,
yrbrn, feethngr, rlgatnd, mbtru, hinctnta, edulvlb, gincdif,...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `dweight` -> `dweight...38`

The path which I specify in the read_csv() file is short as this quarto document has the same working directory to which the data set is also saved. If you, for example, have your dataset on your computer’s desktop, you can access it via some code like this one:

data <- read_csv("~/Desktop/ess_10.csv")

Or if it is within a folder on your desktop:

data <- read_csv("~/Desktop/folder/ess_10.csv")

I will be only working within .Rproj files and so should you. 1 This is the only way to ensure that your working directory is always the same and that you do not have to change the path to your data set every time you open a new RStudio session. Further, this is the only way to make sure that other collaborators can easily open your project and work with it as well. Simply zip the file folder in which you have your code and

You can also import a dataset directly from the internet. Several ways are possible that all lead to the same end result:

dataset_from_internet_1 <- read_csv("https://www.chesdata.eu/s/1999-2019_CHES_dataset_meansv3.csv")
  
# this method uses the rio package
library(rio)
dataset_from_internet_2 <- import("https://jan-rovny.squarespace.com/s/ESS_FR.dta")

Let’s take a first look at the data which we just imported:

glimpse(ess)
Rows: 37,611
Columns: 38
$ gndr         <dbl> 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, …
$ yrbrn        <dbl> 2006, 1998, 1964, 1987, 1961, 1976, 1951, 1968, 2000, 193…
$ cntry        <chr> "BE", "BE", "BE", "BE", "BE", "BE", "BE", "BE", "BE", "BE…
$ feethngr     <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, …
$ rlgatnd      <dbl> 3, 7, 7, 7, 7, 7, 7, 6, 7, 7, 2, 7, 7, 7, 6, 3, 7, 5, 7, …
$ mbtru        <dbl> 3, 3, 2, 3, 1, 1, 3, 3, 3, 2, 1, 2, 3, 1, 3, 3, 3, 3, 1, …
$ hinctnta     <dbl> NA, NA, 7, 7, 6, 6, 9, 9, 5, 5, NA, 9, 5, 3, 7, NA, NA, 4…
$ edulvlb      <dbl> 113, 720, 610, 323, 800, 510, 321, 710, 313, 323, 113, 61…
$ cntbrthd     <chr> "6666", "SA", "FR", "IQ", "NL", "6666", "6666", "6666", "…
$ gincdif      <dbl> 2, 3, 3, NA, 2, 3, 2, 2, 2, 2, 4, 1, 2, 3, 2, 3, 5, 2, 2,…
$ fairelc      <dbl> 7, 7, 10, 10, 10, 6, 2, 9, 10, 8, 5, 10, 8, 7, 10, 8, 10,…
$ dfprtal      <dbl> 8, 6, 10, 10, 10, 6, 2, 7, 8, 8, 4, 9, 8, 7, 6, 9, 0, 5, …
$ gptpelc      <dbl> 8, 6, 10, 10, 9, 6, 7, 9, 8, 7, 8, 10, 8, 7, 5, 7, 2, 6, …
$ medcrgv      <dbl> 7, 9, 10, 10, 10, 6, 5, 6, 9, 9, 5, 9, 10, 9, 6, 9, 8, 5,…
$ rghmgpr      <dbl> 10, 5, 10, 10, 10, 6, 5, 9, 8, 8, 7, 9, 10, 7, 5, 8, 7, 6…
$ cttresa      <dbl> 10, 8, 10, 10, 10, 5, 6, 9, 10, 5, 6, 10, 8, 10, 9, 10, 2…
$ gvctzpv      <dbl> 10, 7, 10, 10, 9, 7, 9, 9, 8, 9, 6, 10, 10, 7, 5, 9, 8, 8…
$ grdfinc      <dbl> 10, 6, 5, NA, 9, 7, 9, 9, 8, 8, 6, 10, 9, 7, 6, 8, 2, 8, …
$ viepol       <dbl> 9, 7, 5, NA, 3, 7, 8, 9, 7, 9, 7, 6, 8, 5, 3, 6, 9, 6, 5,…
$ wpestop      <dbl> 7, 8, 5, NA, 3, 5, 7, 9, 7, 5, 7, 6, 10, 7, 4, 4, 5, 7, 5…
$ votedir      <dbl> 7, 7, 0, 10, 5, 6, 5, 9, 10, 8, 4, 5, 9, 7, 2, 7, 7, 8, 7…
$ lnghom1      <chr> "ENG", "DUT", "FRE", "ARA", "DUT", "DUT", "FRE", "DUT", "…
$ brncntr      <dbl> 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, …
$ mbrncntc     <chr> "NG", "6666", "FR", "IQ", "NL", "6666", "6666", "6666", "…
$ fbrncntc     <chr> "NG", "6666", "FR", "IQ", "NL", "6666", "6666", "6666", "…
$ freehms      <dbl> 2, 1, 2, 3, 1, 2, 1, 2, 1, 1, 4, 1, 1, 2, 1, 2, 1, 2, 1, …
$ hmsacld      <dbl> 1, 3, 2, 3, 1, 2, 2, 3, 1, 3, 5, 1, 1, 3, 1, 2, 1, 2, 1, …
$ lrnobed      <dbl> 2, 1, 2, 3, 3, 2, 1, 1, 2, 1, 1, 3, 2, 2, 3, 1, 1, 2, 4, …
$ loylead      <dbl> 2, NA, 3, NA, 4, 3, NA, 2, 3, 1, 2, 3, 3, 3, 4, 3, 1, 2, …
$ imdfetn      <dbl> 2, 1, 2, 1, 2, 2, 4, 2, 2, 2, 3, 1, 2, 2, 2, 2, 2, 3, 1, …
$ accalaw      <dbl> 5, 5, 0, NA, 1, 2, 0, 4, 0, 6, 6, 0, 6, 4, 2, 4, 8, 6, 0,…
$ dweight...32 <dbl> 0.8822203, 1.0476425, 1.0877413, 0.9099100, 0.9189489, 0.…
$ rlgdnm       <dbl> 4, NA, NA, 6, NA, NA, NA, 1, NA, 1, 6, NA, 1, NA, 2, 6, 1…
$ imueclt      <dbl> 8, 5, 5, 8, 7, 6, NA, 4, 10, 5, 7, 10, 8, 6, 9, 4, 8, 4, …
$ region       <chr> "BE24", "BE24", "BE33", "BE21", "BE24", "BE23", "BE32", "…
$ lrscale      <dbl> 3, 5, 5, 5, 4, 5, NA, 6, 6, 2, 1, 2, 4, 6, 6, 6, 7, 5, 1,…
$ dweight...37 <dbl> 0.8822203, 1.0476425, 1.0877413, 0.9099100, 0.9189489, 0.…
$ dweight...38 <dbl> 0.8822203, 1.0476425, 1.0877413, 0.9099100, 0.9189489, 0.…

Although I have already filtered this dataset beforehand, I can see that there are still some variables that I am not interested in. Let’s get rid of them using the select() function. If you put a - in front of a variable name, it will unselect it:

ess <- ess |> 
  select(-c(region, accalaw, mbrncntc, fbrncntc))

For the sake of the exercise, we will filter our dataset and create a subset which only contains participating countries in eastern Europe. Let’s first look at the countries that are in this dataset (what will be displayed is only half of the countries that are included in this wave)

ess$cntry |> table()

  BE   BG   CH   CZ   EE   FI   FR   GB   GR   HR   HU   IE   IS   IT   LT   ME 
1341 2718 1523 2476 1542 1577 1977 1149 2799 1592 1849 1770  903 2640 1659 1278 
  MK   NL   NO   PT   SI   SK 
1429 1470 1411 1838 1252 1418 

We will therefore filter out countries which are in western Europe:

ess <- ess |> 
  filter(cntry %in% c("BG", "CZ", "EE", "HR", "HU", "LT", "ME", "MK", "SI", "SK"))

ess$cntry |> table()

  BG   CZ   EE   HR   HU   LT   ME   MK   SI   SK 
2718 2476 1542 1592 1849 1659 1278 1429 1252 1418 

For this exercise, I would like to predict the

Let’s inspect our data for now:

ess$gincdif |> summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   1.000   2.000   2.063   3.000   5.000     344 
ess <- ess |> 
  mutate(gincdif = case_when(
    is.na(gincdif) ~ mean(gincdif, na.rm = TRUE),
    TRUE ~ gincdif
  ))

ess$gincdif |> summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   2.064   3.000   5.000 

1.4 Running a simple OLS regression

Now that we have set up our data frame, we can build our OLS model. For that, we can simply use the lm() function that comes with Base R, it is built into R so to speak. In this function, we specify the data and then construct the model by using the tilde (~) between the dependent variable and the independent variable(s). Store your model in an object which can later be subject to further treatment and analysis.

In this example, I will predict views on redistribution and income inequalities measured through the variable gincdif. This will be our dependent variable (DV). The exact phrasing of the question is the following:

Using this card, please say to what extent you agree or disagree with the following statement: The government should take meaesures to reduce differences in income levels.

  1. Agree strongly
  2. Agree
  3. Neither agree nor disagree
  4. Disagree
  5. Disagree strongly

I will predict this variable by a set of predictors, also called the independent variables (IV). While you can only have one dependent variable per regression, you can have many independent variables. The other variables will be socio-demographic variables such as age, gender, income, and religiosity of respondents.

Let’s make sure that everything is properly coded:

ess <- ess |> 
  mutate(age = 2022 - yrbrn,
         gndr = as.factor(gndr))

We can run an OLS model because this variable is continuous. If this were a categorical or binary dependent variable, we would have to change our modeling approach. This is something that is covered in Chapters 2 and 3.

ess_model <- lm(gincdif ~ age + hinctnta + gndr + rlgatnd, data = ess)

summary(ess_model)

Call:
lm(formula = gincdif ~ age + hinctnta + gndr + rlgatnd, data = ess)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3976 -0.9006 -0.0767  0.7001  3.3375 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.7820127  0.0520506  34.236  < 2e-16 ***
age         -0.0018906  0.0005089  -3.715 0.000204 ***
hinctnta     0.0443685  0.0034183  12.980  < 2e-16 ***
gndr2       -0.0483081  0.0170449  -2.834 0.004601 ** 
rlgatnd      0.0299664  0.0060143   4.983 6.35e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9758 on 13525 degrees of freedom
  (3683 observations deleted due to missingness)
Multiple R-squared:  0.02321,   Adjusted R-squared:  0.02292 
F-statistic: 80.36 on 4 and 13525 DF,  p-value: < 2.2e-16

Since the summary() function only shows us something in our console and the output is not very pretty, I encourage you to use the broom package for a nicer regression table.

broom::tidy(ess_model)
# A tibble: 5 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  1.78     0.0521       34.2  1.97e-246
2 age         -0.00189  0.000509     -3.72 2.04e-  4
3 hinctnta     0.0444   0.00342      13.0  2.70e- 38
4 gndr2       -0.0483   0.0170       -2.83 4.60e-  3
5 rlgatnd      0.0300   0.00601       4.98 6.35e-  7

You can also use the stargazer package in order to export your tables to text or LaTeX format which you can then copy to your documents.

library(stargazer)
stargazer(ess_model, 
          # you can change this to latex, or html depending on your needs
          type = "text",
          # changing the threshold for the stars
          star.cutoffs = c(0.05, 0.01, 0.001),
          # changing the label of the DV
          dep.var.labels = "Attitudes towards economic redistribution",
          # renaming the IVs
          covariate.labels = c(
            "(Intercept)" = "Intercept",
            "age" = "Age", 
            "hinctnta" = "Household Income",
            "gndr2" = "Women", 
            "rlgatnd" = "Religiosity (Attendance)"
            )
          )

==================================================================
                                    Dependent variable:           
                         -----------------------------------------
                         Attitudes towards economic redistribution
------------------------------------------------------------------
Intercept                                -0.002***                
                                          (0.001)                 
                                                                  
Age                                      0.044***                 
                                          (0.003)                 
                                                                  
Household Income                         -0.048**                 
                                          (0.017)                 
                                                                  
Women                                    0.030***                 
                                          (0.006)                 
                                                                  
Religiosity (Attendance)                 1.782***                 
                                          (0.052)                 
                                                                  
------------------------------------------------------------------
Observations                              13,530                  
R2                                         0.023                  
Adjusted R2                                0.023                  
Residual Std. Error                 0.976 (df = 13525)            
F Statistic                      80.356*** (df = 4; 13525)        
==================================================================
Note:                                *p<0.05; **p<0.01; ***p<0.001

1.4.1 Interpretation of OLS Results

How do we interpret this?

The first thing we want to identify is the structure of the table. In the left column, we have our independent variables—our predictors. On the right-hand side, we have the corresponding coefficients, some of which have stars. In parentheses and underneath each coefficient, we find the standard errors. Smaller standard errors suggest more precise estimates.

The stars tell us something about statistical significance. If a variable has a p-value below a certain threshold, it will be marked with one or more stars. A single star * denotes significance at the 5% level (p < 0.05), two stars ** indicate significance at the 1% level (p < 0.01), and three stars *** show significance at the 0.1% level (p < 0.001). This means that a specific variable has a statistically significant relationship with the dependent variable and contributes to explaining its variance. However, the absence of statistical significance does not necessarily mean that the variable has no effect—it may suggest that there is insufficient evidence to detect a meaningful relationship given the data.

The \(\beta\)-coefficients tell us about the direction and strength of the association between a statistically significant predictor and our dependent variable. A positive coefficient indicates that an increase in the independent variable is associated with an increase in the dependent variable, while a negative coefficient suggests that an increase in the independent variable corresponds to a decrease in the dependent variable. Since this is an ordinary least squares (OLS) regression, we can interpret the coefficients directly in their original scale. This direct interpretation holds for linear regression but does not apply to logistic regression, which we will cover in the next session.

Let’s interpret the above regression table. Always bear in mind what a change of values would correspond to on the measurement scale of your dependent variable. In our case, higher values correspond to opposition towards redistribution, while lower values correspond to being in favor of redistribution:

  1. Age is positively and significantly associated with attitudes towards economic redistribution. A one-unit increase in age corresponds to a 0.044 increase in the dependent variable, meaning older individuals tend to oppose economic redistribution more than younger ones.

  2. Household income is negatively and significantly associated with attitudes toward redistribution. A one-unit increase in income corresponds to a 0.048 decrease in opposition for redistribution. The model seems to indicate that richer people tend to be more in favor of redistribution.

  3. The coefficient for the gender variable suggests that women have significantly more negative attitudes towards redistribution compared to men. The coefficient size is 0.030, meaning women are slightly more opposed to redistribution.

Bear in mind that this model is not a good model! It only contains four independent variables and we are likely omitting some other important predictors that would change the model outcomes substantively. It is very likely that we have a case of Omitted Variable Bias in our model. This is probably the worst bias to have in your model because the omission or inclusion of certain variables over others not only change your standard errors but also the direction of association of your coefficients. Hence, the admittedly extremely counterintuitive result that richer people tend to be more in favor of redistribution than lower-income households.

There are also some additional information on the table that tell us more about the model fit, its performance so to say:

  • R2: Imagine you’re trying to draw a line that best fits a bunch of dots (data points) on a graph. The R-squared value is a way to measure how well that line fits the dots. It’s a number between 0 and 1, where 0 means the line doesn’t fit the dots at all and 1 means the line fits the dots perfectly. R-squared tells us how much of the variation in the dependent variable is explained by the variation in the predictor variables.
  • Adjusted R2: Adjusted R-squared is the same thing as R-squared, but it adjusts for how many predictor variables you have. It’s like a better indicator of how well the line fits the dots compared to how many dots you’re trying to fit the line to. It always adjusts the R-squared value to be a bit lower so you always want your adjusted R-squared value to be as high as possible.
  • Residual Std. Error: The residual standard error is a way to measure the average distance between the line you’ve drawn (your model’s predictions) and the actual data points. It’s like measuring how far off the line is from the actual dots on the graph. Another way to think about this is like a test where you want to get as many answers correct as possible and if you are off by a lot in your answers, the residual standard error would be high, but if you are only off by a little, the residual standard error would be low. So in summary, lower residual standard error is better, as it means that the model is making predictions that are closer to the true values in the data.
  • F Statistics: The F-statistic is like a test score that tells you how well your model is doing compared to a really simple model. It’s a way to check if the model you’ve built is any better than just guessing. A large F-statistic means that your model is doing much better than just guessing.

1.4.2 Visualizing the Regression Line

Just for fun and to refresh you ggplot knowledge, let’s visualize the regression line. Here, we specify

tidy_ess_model <- broom::tidy(ess_model, conf.int = TRUE) |> 
  filter(!str_detect(term, "Intercept"))


tidy_ess_model |> 
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_point(position = position_dodge(width = 0.5)) +
  geom_errorbar(
    aes(xmin = conf.low, xmax = conf.high),
    width = 0.2,
    position = position_dodge(width = 0.5)
  ) +
  scale_color_brewer(palette = "Set1") +
  geom_vline(xintercept = 0,
             color = "black")

Remember how I told you above that OLS was about finding the smallest amount of squared errors! This is what we can visualize here. The red lines are the distance from the residuals to the fitted line. The OLS line is the line that minimizes the sum of the squared errors!

1.5 The broom package (OPTIONAL!)

The broom package in R is designed to bridge the gap between R’s statistical output and tidy data. 2 It takes the output of various R statistical functions and turns them into tidy data frames. This is particularly useful because many of R’s modeling functions return outputs that are not immediately suitable for further data analysis or visualization within the tidyverse framework.

1.5.1 Nesting with nest()

Nesting in the context of the broom package usually refers to the idea of creating a list-column in a data frame (or even better a tibble) where each element of this list-column is itself a data frame (again, even better a tibble) or a model object. In a complex analysis, you might fit separate models to different subsets of data. Nesting allows you to store each of these models (or their broom-tidied summaries that we will see in the next three sub-sections) within a single, larger data frame (for the third time, the best is to do this with a tibble) for easier manipulation and analysis. For questions on what tibbles are see the below section on tibbles.

Let’s go back to the ESS which I have used before. The data was collected within countries, and depends on the country context. In other – more statistical terms – this means that the data is nested within countries.

I can group my observations by class using group_by() and then nest them within these groups. The output will only contain as many rows as we have countries, and each row will contain a tibble with the observations of the respective class.

ess |> 
  group_by(cntry) |>
  nest()
# A tibble: 10 × 2
# Groups:   cntry [10]
   cntry data                 
   <chr> <list>               
 1 BG    <tibble [2,718 × 34]>
 2 CZ    <tibble [2,476 × 34]>
 3 EE    <tibble [1,542 × 34]>
 4 HR    <tibble [1,592 × 34]>
 5 HU    <tibble [1,849 × 34]>
 6 LT    <tibble [1,659 × 34]>
 7 ME    <tibble [1,278 × 34]>
 8 MK    <tibble [1,429 × 34]>
 9 SI    <tibble [1,252 × 34]>
10 SK    <tibble [1,418 × 34]>

That is a very simple application of a nesting process. You can group_by() and nest() by many different variables. You might want to nest your observations per year or within countries. You can also nest by multiple variables at the same time. We will see this idea again in the next session when we will talk about the purrr package and how to automatically run regressions for several countries at the same time

1.5.2 Model estimates with tidy()

The tidy() function takes statistical output ofa model and turns it into a tidy tibble. This means each row is an observation (e.g., a coefficient in a regression output) and each column is a variable (e.g., estimate, standard error, statistic). For instance, after fitting a linear model, you can use tidy() to create a data frame where each row represents a coefficient, with columns for estimates, standard errors, t-values, and p-values.

1.5.3 Key metrics with glance()

glance() provides a one-row summary of a model’s information. It captures key metrics that describe the overall quality or performance of a model, like \(R^2\), AIC, BIC in the context of linear models. This is useful for getting a quick overview of a model’s performance metrics.

1.5.4 Residuals with augment()

The augment() function adds information about individual observations to the original data, such as fitted values or residuals in a regression model. You want to do this when you are evaluating a model fit at the observation level, checking for outliers, or understanding the influence of individual data points. We will talk more about model diagnostics in the next session!

Let’s take a look at how this would work out in practice. For simplicity sake, we will set the nest() logic aside for a second and only look at the tidy(), glance(), and augment() functions.

Here I only build the same model that we have already seen above:

tidy(ess_model)
# A tibble: 5 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  1.78     0.0521       34.2  1.97e-246
2 age         -0.00189  0.000509     -3.72 2.04e-  4
3 hinctnta     0.0444   0.00342      13.0  2.70e- 38
4 gndr2       -0.0483   0.0170       -2.83 4.60e-  3
5 rlgatnd      0.0300   0.00601       4.98 6.35e-  7

Now let’s glance the heck out of the model:

glance(ess_model)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>
1    0.0232        0.0229 0.976      80.4 1.65e-67     4 -18864. 37740. 37785.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Don’t worry about what these things might mean for now, AIC and BIC will for example come up again next session.

Or augment it…

augment(ess_model)
# A tibble: 13,530 × 12
   .rownames gincdif   age hinctnta gndr  rlgatnd .fitted  .resid    .hat .sigma
   <chr>       <dbl> <dbl>    <dbl> <fct>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 1               4    77        7 2           6    2.08  1.92   4.26e-4  0.976
 2 3               4    51        7 2           5    2.10  1.90   1.82e-4  0.976
 3 4               2    52        6 2           5    2.05 -0.0514 1.47e-4  0.976
 4 5               1    71        4 1           5    1.98 -0.975  2.61e-4  0.976
 5 8               2    49        4 1           7    2.08 -0.0766 2.88e-4  0.976
 6 9               1    72        3 1           5    1.93 -0.929  2.87e-4  0.976
 7 10              1    72        3 1           5    1.93 -0.929  2.87e-4  0.976
 8 11              1    72        4 2           7    1.98 -0.985  3.43e-4  0.976
 9 15              1    84        2 2           7    1.87 -0.873  5.11e-4  0.976
10 17              1    48        9 1           5    2.24 -1.24   3.26e-4  0.976
# ℹ 13,520 more rows
# ℹ 2 more variables: .cooksd <dbl>, .std.resid <dbl>

1.5.5 What is a tibble?


  1. Malo’s explanation and way of introducing you to RStudio projects can be found here.↩︎

  2. As a quick reminder these three principles are guiding when we speak of tidy data: 1) Every column is a variable, 2) Every row is an observation, 3) Every cell is a single value.↩︎