How sure are you? Confidence v. Prediction intervals

Two common uses for statistical (or machine learning) models are:

  1. Something happened. Why?
  2. Something will happen. What?

Spoiler Alert

  • When reporting results, tell your audience how certain you are. This is essential context.
  • "Why?": When presenting parameters (drivers, slopes, causal effects, and the like) use something like a confidence interval, Bayesian credible interval, or bootstrapping the data.
  • "What?": When presenting predictions, use a conformal prediction interval, Bayesian posterior predictive interval, or the like. Forecasts must use these.

"Why?" = inference about drivers

When we ask "why?" we are trying to make an inference about how the data was generated. Specifically, inferences about the parameters of the model, which can be thought of as measures of the relationships between features and the target.

For example, if we're using linear regression to fit a line to \(x\) and \(y\), our model looks like:

$$y=\text{slope}\cdot x + \text{intercept} + \text{error}$$

The parameters of this model are the slope, intercept, and standard deviation of the error. In many cases, we're interested in how \(x\) is related to \(y\), which means the interesting parameter is the slope.

For example, we might use a dataset to estimate the parameters and find:

$$y=2.6x+5.7+\mathcal{N}(0,0.52)$$

So, for every additional \(x\) we get 2.6 more \(y\). How sure are we that it's 2.6? Is it \(2.6\pm 0.1\) or \(2.6\pm 3\)?

The frequentist way to describe this uncertainty is to use a confidence interval. You can also use a Bayesian credible interval, a bootstrap interval generated by resampling the data, or other methods. Each has a somewhat different interpretation, but all are ways to convey how certain we are about how much more \(y\) is associated with one more \(x\).

You can also perform a hypothesis test to estimate how sure you are that the slope is measurably different from zero ("reject the null hypothesis"). A Bayesian alternative is to measure how likely it is that the slope is positive (counting the fraction of posterior draws greater than zero).

Excurses: How does a slope answer "why"?

It doesn't. Except when it does.

A lot of ink has been spilled explaining the difference between correlation and causation. More generally, an "association" is not necessarily a "cause". Yet, one key use of statistics, especially in science, is to help us decide if phenomenon X causes phenomenon Y. It seems unlikely that all of this effort is wasted.

To make a causal claim from statistics, more context is required about the data-generating process, such as that:

  • Values of X were assigned randomly, yielding a randomized controlled trial (experiment).
  • We can describe likely causes and interactions in a directed acyclic graph (DAG) or we can leverage other causal inference techniques.
  • Previous discoveries or assumptions give us a theory or narrative connecting X and Y, and that has falsifiable predictions.

Given this additional context and assuming that the model (here, a line) is not too far from the truth, the slope can be a meaningful explanation for why Y changes: because X did. Confidence intervals and their kin can tell us how certain we are about those relationships.

"What?" = prediction of the future

Suppose we have our linear model above, and now we have a new situation described by new data, \(x=6.8\). We want to predict \(y\) in this new situation. We can plug 6.8 in for \(x\):

$$y = 2.6(6.8)+5.7 = 23.38$$

So far, so good. How sure are we that \(y=23.38\)? Can we use a confidence interval?

No. Just no. Don't use a confidence interval for predictions. Especially those about the future.

When making predictions, we need to account for two kinds of uncertainty:

  1. Uncertainty about the parameters.
  2. Uncertainty about the error term.

Uncertainty about the parameters is caused by (foolishly!) failing to collect an infinite amount of data. If you had an infinite amount of data, the standard error \(=\frac{\text{standard deviation of the error}}{\text{amount of data}}\) would go to zero and your confidence interval would have zero width: you'd be certain about the mean effect of \(x\) on \(y\).

Uncertainty about the error term is caused by (hopefully random) factors not included in the model. Even if you collect an infinite amount of data, there are still fluctuations in the observed values, deviations from the line fit to the \((x,y)\) points. These deviations don't get smaller just because we see more of them!

Predictions have (1) some uncertainty that gets smaller when you have more data (uncertainty about parameters) and (2) some uncertainty that doesn't go to zero (uncertainty in the error term). This means that prediction intervals are generally wider than confidence intervals. Put another way, using a confidence interval for predictions always overstates your certainty.

There are a variety of ways to calculate prediction intervals. My favorite is conformal prediction. A full tutorial is outside the scope of this post, but the idea is straightforward: resample the residuals.

  1. Train your model on (X_train, y_train).
  2. Score the model on X_validation to get y_validation_prediction.
  3. Compute the residuals: residuals_validation = y_validation - y_validation_prediction.
  4. Sample from residuals_validation with replacement, say 1k times.
  5. Compute the desired quantiles, say (2.5%, 97.5%). One should be negative and the other positive.
  6. Make your predictions using X_future in the trained model.
  7. For each X_future, add the quantiles to get the desired intervals.

This method works with any black-box model -- OLS, XGBoost, neural network, time-series -- any regression model. It is non-parametric and makes no assumptions (e.g., normality) about the distribution of the residuals. It is calibrated to give the desired coverage, so as long as the data-generating process doesn't change, it should be reliable.

Care the roll the dice on your understanding?

Let's nail down the difference between a confidence interval and a prediction interval with an example.

Suppose we roll a 6-sided die 100 times.

What is the expected value for one roll, i.e. the mean? For our experiment of 100 rolls, I got a sum of 338, or a mean of 3.38. How sure are we that that is the "true" mean for a fair die? The standard deviation of the observed rolls is 1.67, so the standard error of the mean is .167. A 95% confidence interval is [3.05, 3.71]. This does contain the "true" mean of 3.5. Excellent!

Now, let's predict the next roll. There's a 95% chance that the next roll will be in [3.05, 3.71], right? Um, no, I'm pretty sure 0% will land in that range. The 95% conformal prediction interval is [1, 6]. It seems reasonable that 95% of rolls would fall in that range.

The derivation is left as an exercise for the reader.