8  Training and Testing

When building models, it is important to evaluate how well they perform on new, unseen data. If we only evaluate a model on the same data it was trained on, we can get overly optimistic results.

8.1 Why Split the Data?

If we train and test on the same dataset:

  • the model has already “seen” the data
  • it can appear more accurate than it really is
  • we cannot tell how well it will perform in practice

To address this, we split the data into:

  • Training set → used to fit the model
  • Test set → used to evaluate performance

8.2 Create Train and Test Sets

set.seed(123)

train_index <- sample(seq_len(nrow(df)), size = floor(0.8 * nrow(df)))
train <- df[train_index, ]
test  <- df[-train_index, ]
  • 80% of the data is used for training
  • 20% is held out for testing

8.3 Train the Model

lm_train <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm, data = train)

The model is fit only on the training data.


8.4 The Wrong Way: Evaluate on Training Data

train_preds_wrong <- predict(lm_train, newdata = train)
sqrt(mean((train$body_mass_g - train_preds_wrong)^2))
[1] 397.3888
  • This often produces a low error
  • But it is misleading because the model has already seen this data

This is like studying using the exact answers to the test.


8.5 The Correct Way: Evaluate on Test Data

test_preds <- predict(lm_train, newdata = test)
sqrt(mean((test$body_mass_g - test_preds)^2))
[1] 364.3985
  • This gives a more realistic estimate of performance
  • It reflects how the model behaves on new data

8.6 Visualizing Predictions

eval_df <- test
eval_df$predicted_body_mass_g <- test_preds

ggplot(eval_df, aes(x = body_mass_g, y = predicted_body_mass_g)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(title = "Test Set Predictions")

  • Points close to the dashed line indicate good predictions
  • Larger deviations indicate prediction errors

8.7 Key Takeaways

  • Always evaluate models on data they have not seen
  • Training error is often misleading
  • Test error provides a better estimate of real-world performance

A good model is not one that fits the training data perfectly,
but one that generalizes well to new data.