Tip #1 - Benefits of using multiple test sets. Gain better insights regarding you model robustness

By Oren Razon
August 27, 2020

When developing ML models, we build a process that generalizes patterns out of the historical experience (training data) in order to make good predictions for new unseen data. To test that during research, when evaluating different models, we look for the model that has the best performance on a new unseen independent test dataset. But is it really enough?

In production, the chosen model will act in a constantly changing environment, responding to different incoming data streams. We need to assure that the deployed model is also robust enough under different possible situations.

So instead of having only one test set during model evaluation, create a few multiple smaller test sets and evaluate the best model, not only based on a single or mean performance indicator but rather the performance distribution under the different test cases.

These smaller test sets can be randomly sampled out of the full test set, you can use cross validation embedded test folds, or you can create a test set to represent different cases that are relevant to your domain.

For example,

I’ve ran a K fold cross validation test (K=20) configuration for two different learning strategies I’m researching, and I got the following results:

  • Strategy 1 avg. accuracy: 71%
  • Strategy 2 avg. accuracy: 76%

Based on the average accuracy only, one would assume that Strategie 2 is better. But when we dive in, and we look for the accuracy distribution among the different folds, it appears differently (see figure below). Strategie 2 has a much higher performance variance across the different folds. Knowing that, and considering the temporality of your domain,  it may be that Strategie 1 is actually the safest way to go as it seems less overfitted and more robust.

Code reference - https://s-w.ai/3hau2NX

Recommended Readings: