6  Extrapolation

When we use a model to make predictions, it matters whether those predictions fall within the range of data we actually observed. Predictions made inside the observed range are generally more trustworthy, while predictions outside that range require much more caution.

6.1 Interpolation vs. Extrapolation

  • Interpolation means making predictions within the range of values observed in the dataset.
  • Extrapolation means making predictions outside the range of values observed in the dataset.

Interpolation is usually safer because the model is working in a part of the data space it has actually seen. Extrapolation is riskier because it assumes the same relationship continues beyond the observed data, which may not be true.

6.2 Check the Observed Range

Before making predictions, it is helpful to inspect the range of the predictor variable.

range(df$flipper_length_mm)
[1] 172 231

This tells us the smallest and largest flipper lengths observed in the data.

6.3 Prediction Inside the Observed Range

Here we make a prediction using values that fall within the range of the dataset.

predict(lm_model, newdata = data.frame(
  bill_length_mm = 45,
  bill_depth_mm = 17,
  flipper_length_mm = 200
))
       1 
4158.348 

Because these values are similar to observations in the dataset, this is an example of interpolation.

6.4 Prediction Outside the Observed Range

Now we make a prediction using values outside the observed range.

predict(lm_model, newdata = data.frame(
  bill_length_mm = 65,
  bill_depth_mm = 25,
  flipper_length_mm = 260
))
       1 
7412.624 

This is an example of extrapolation. Even though R will still return a prediction, we should be cautious about interpreting it because the model is being asked to make predictions in a region where it has no data.

6.5 Visualizing the Risk

The dashed vertical lines below mark the observed range of flipper lengths in the data.

ggplot(df, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_vline(xintercept = range(df$flipper_length_mm), linetype = "dashed") +
  labs(title = "Extrapolation Warning")
`geom_smooth()` using formula = 'y ~ x'

6.6 Key Takeaways

  • Interpolation uses the model within the range of observed data.
  • Extrapolation uses the model outside the range of observed data.
  • Extrapolated predictions may look precise, but they are often much less reliable.
  • Just because a model can return a prediction does not mean that prediction should be trusted.