9  Logistic Regression

Logistic regression is a modeling technique used when the outcome variable is binary, meaning it has two possible values (such as 0/1, yes/no, or success/failure). Instead of predicting a continuous value, logistic regression estimates the probability that an observation belongs to a particular class.

9.1 Create Train and Test Sets

As before, we split the data so that we can evaluate the model on unseen data.

set.seed(123)

train_index_cls <- sample(seq_len(nrow(df)), size = floor(0.8 * nrow(df)))
train_cls <- df[train_index_cls, ]
test_cls  <- df[-train_index_cls, ]

In this example, the outcome variable is:

  • large_body = 1 → above-median body mass
  • large_body = 0 → below-median body mass

9.2 Fit the Logistic Regression Model

glm_model <- glm(
  large_body ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = train_cls,
  family = "binomial"
)
  • glm() is used for generalized linear models
  • family = "binomial" specifies logistic regression

9.3 Inspect the Model

summary(glm_model)

Call:
glm(formula = large_body ~ bill_length_mm + bill_depth_mm + flipper_length_mm, 
    family = "binomial", data = train_cls)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -42.64099    7.12539  -5.984 2.17e-09 ***
bill_length_mm     -0.06164    0.04502  -1.369    0.171    
bill_depth_mm      -0.01402    0.12162  -0.115    0.908    
flipper_length_mm   0.22738    0.03350   6.787 1.15e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 367.79  on 265  degrees of freedom
Residual deviance: 163.89  on 262  degrees of freedom
AIC: 171.89

Number of Fisher Scoring iterations: 6
  • Coefficients represent how each predictor affects the log-odds of the outcome
  • Positive values increase the probability of class 1
  • Negative values decrease the probability of class 1

9.4 Predict Probabilities

prob_preds <- predict(glm_model, newdata = test_cls, type = "response")
  • Predictions are probabilities between 0 and 1
  • These represent the likelihood of belonging to class 1

9.5 Convert Probabilities to Classes

class_preds <- ifelse(prob_preds >= 0.5, 1, 0)
  • A threshold (here, 0.5) is used to assign class labels
  • This converts probabilities into predictions

9.6 Evaluate the Model

mean(class_preds == test_cls$large_body)
[1] 0.8358209
  • This computes accuracy, the proportion of correct predictions
table(actual = test_cls$large_body, predicted = class_preds)
      predicted
actual  0  1
     0 29  2
     1  9 27
  • This is a confusion matrix
  • It shows how many predictions were correct vs incorrect

9.7 Key Takeaways

  • Logistic regression is used for binary outcomes
  • The model predicts probabilities, not just classes
  • A threshold is used to convert probabilities into predictions
  • Evaluation should be done on a test dataset

Like linear regression, the goal is not just to fit the data,
but to make reliable predictions on new observations.