Tip #2 - Is your model calibrated? Probability calibration: why it matters and how to measure it

By Oren Razon
August 27, 2020
Is your model calibrated?

Model calibration analysis is a crucial tool when dealing with probability estimation use cases but it’s also a great evaluation and ongoing measurement technique for classification use cases that data scientists tend to overlook.

With calibration analysis in a classification task, one can achieve a better understanding of  the  areas or predictions that have a high uncertainty regarding the predicted output; or expect better classification error rates for a given set of predictions and thus supply more visibility and trust in its ongoing model decision making. 

So how do you do calibration analysis?

Most classification algorithms produce probability predictions between 0 to 1, for the predicted class, then based on a given threshold, the model is turning the probabilistic output to a class prediction. 

To analyze model calibration we would use the predicted probability instead of the predicted class. Based on that, we will plot the model calibration curve, a.k.a reliability chart (see image below). In this curve the X-axis represents the mean predicted probability in a fixed number of buckets, while the Y-axe is the frequency of the observed predicted class (Y=1) in each bin.

The position of the curve relative to the diagonal (a perfect calibrated model) can help interpret the probabilities, i.e.: below the diagonal: The probabilities predicted probabilities are too large, and above the diagonal, the probabilities predicted probabilities are too small.

Usually it is useful to plot also the frequency of each prediction probability bin, and we can use the Brier score (a discrete version of the MSE score) to quantify the model calibration level. 

For example: Let’s assume that the model behind the figure above is a binary classification model that classifies users into users that are likely to perform specific intent in our product in the next 7 days.

If we will take all cases that the model predicted positive (meaning, it’s prediction was that these users will perform an intent), with a probability of ˜75%. Using the chart above (the red vertical line) we can assess that the actual expected rate of intent among these positive users will be 64%. 

Code reference - https://s-w.ai/2EmTKkj

Stay tuned for next week’s tip!

Recommended Readings: