p4c2-regression.qmd

# Regression

Regression is a common statistical technique employed by many to make generalizations as well as predictions about data.

::: callout-note
The purpose of this chapter is to give readers a high-level overview of how to apply regression techniques in R rather than to give a full introduction to regression itself. However, there are multiple comprehensive resources in the resources section for interested readers.
:::

## Linear Regression

The first kind of regression we'll cover is linear regression. Linear regression will use your data to come up with a linear model that describes the general trend of your data. Generally speaking, a linear model will consist of a dependent variable (y), at least one independent variable (x), coefficients to go along with each independent variable, and an intercept. Here's one common linear model you may remember:

$$
y = mx + b
$$

This is a simple linear model many people begin with where x and y are the independent and dependent variables, respectively, m is the slope (or coefficient of x), and b is the intercept.

To perform linear regression in R, you'll use the "lm" function. Let's try it out on the "faithful" dataset.

```{r}
#| output: false
head(faithful)
```

```{r}
#| echo: false
knitr::kable(head(faithful), format="markdown")
```

The "lm" function will accept at least two parameters which represent "y" and "x" in this format:

```{r}
#| eval: false
lm(y ~ x)
```

Let's try this out by setting the y variable to eruptions and the x variable to waiting. We can then view the output of our linear model by using the "summary" function.

```{r}
lm <- lm(faithful$eruptions ~ faithful$waiting)
summary(lm)
```

This summary will show us the statistical significance of our model along with all relevant statistics to correctly interpret the significance. Additionally, we now have our model coefficients. From this summary we can assume that our model looks something like this:

$$
eruptions = waiting * 0.075628 - 1.874016
$$

Let's break down everything that the model summary returns.

- The "Call" section calls the model that you created
- The "Residuals" section gives you a summary of all of your model residuals. Simply put, a residual denotes how far away any given point falls from the predicted value.
- The "Coefficients" section gives us our model coefficients, our intercept, and statistical values to determine their significance.
    - For each coefficient, we are given the respective standard error. The standard error is used to measure the precision of coefficient's estimate. 
    - Next, we have a t value for each coefficient. The t value is calculated by dividing the coefficient by the standard error.
    - Finally, you have the p value accompanied by symbols to denote the corresponding significance level.
- The residual standard error gives you a way to measure the standard deviation of the residuals and is calculated by dividing residual sum of squares by the residual degrees of freedom and taking the square root of that where the residual degrees of freedom is equal to total observations - total model parameters - 1.
- R-squared gives you the proportion of variance that can be explained by your model. Your adjusted R-squared statistic will tell you the same thing but will adjust for the number of variables you've included in your model.
- Your F-statistic will help you to understand the probability that all of your model parameters are actually equal to zero.

## Multiple Regression

If you had more x variables you wanted to add to your linear model, you could add them just like you would in any other math equation. Here's an example:

```{r}
#| eval: false
lm(data$y ~ data$x1 + data$x2 + data$x3 + data$x4)
```

Additionally, you can use the "data" parameter rather than putting the name of the dataset before every variable.

```{r}
#| eval: false
lm(y ~ x1 + x2 + x3 + x4, data = data)
```

Let's try a real example with the mtcars dataset.

```{r}
#| output: false
head(mtcars)
```

```{r}
#| echo: false
knitr::kable(head(mtcars), format="markdown")
```

Now, let's try to predict mpg and use every other column as a variable then see what the results look like.

```{r}
lm <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
            , data = mtcars)
summary(lm)
```

From here, you would likely tweak your model further based on the significance statistics we see here; however, that's outside the scope of what we're doing in this book. Take a look in the resources section at the end of this chapter to dive deeper into developing regression models.

## Logistic Regression

Logistic regression is commonly used when your dependent variable (y) binomial (0 or 1). Instead of using the "lm" function though, you will use the "glm" function. Let's try this out on the mtcars dataset again but this time with "am" as the dependent variable.

```{r}
glm <- glm(am ~ cyl + hp + wt, family = binomial, data = mtcars)
summary(glm)
```

## Resources

- "Lecture 9 - Linear regression in R" by Professor Alexandra Chouldechova at Carnegie Mellon University: <https://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture09/lecture09-94842.html>

- "Logistic Regression" by Erin Bugbee and Jared Wilber: <https://mlu-explain.github.io/logistic-regression/>

- "Visualizing OLS Linear Regression Assumptions in R" by Trevor French <https://medium.com/trevor-french/visualizing-ols-linear-regression-assumptions-in-r-e762ba7afaff>