Weekly Tip #3 - What is a "good enough" performance?

Oren Razon
September 3, 2020

👉🏽 Tip: Use an ongoing baseline model in production to benchmark your performance

When you develop a new algorithm during research it’s always a good practice to build first a naive “baseline” model to estimate the minimal level of performance, and get a solid understanding of what can be considered a good performance level.

Yet, such practice is usually overlooked during the production phase where things are continuously changing. In production, just as in the research and training phase, you should make use of an ongoing naive baseline model, in shadow mode. This way you can get a benchmark and determine whether your model is truly optimized or just as good as a simple heuristic.

Let’s use an example from the fraud detection domain, assuming the use of a binary classification model to detect fraudulent transactions. When the model was developed, the probability of a transaction to be fraudulent, a priori, was P(y="fraudulent") =0.01

Now let’s assume that the performance metric to be optimized is “precision in the top K” (e.g. precision in the 100 most suspected activities). After deploying the model into production, a label drift may occur, pointing at an increase in the fraudulent activities, and making the a priori label distribution more balanced: P(y="fraudulent") =0.05.

Under these new circumstances, we expect the model to have a higher precision rate for the same top K. But without proper label drift monitoring and in the absence of a naive ongoing benchmark model, one could overlook it. As the precision may remain unchanged, the risk is to think that the model is still optimized while it's not and miss the opportunity for better performance, as you can see illustrated below.

Models performance scores overtime vs. label proportion drift

Recommended Readings: