Today's enterprises rely on machine learning-powered predictions to guide business strategy, such as by forecasting demand and mitigating risk. For an increasing number of businesses, machine learning (ML) underpin their core business model, like financial institutions that use ML models to approve or reject loan applications.
As ML is drastically different from other software or traditional IT, models risk degrading the moment they are pushed into production – where the hyperdynamic nature of the data meets the hypersensitivity of the models. These “drifts” in the data structure, or other properties that cause model degradation, are too often silent and unobservable.
In the last few months, triggered by the COVID-19 crisis, we have all witnessed companies struggling to fix corrupted, business-critical models. One of the most documented of such issues was Instacart, whose inventory prediction model’s accuracy declined from 93% to 61%, leaving a sour after-taste for their customers and their teams.
Rare are the data science and engineering teams who are prepared for this “Day 2” , the day their models meet the real world; as they invest the majority of their time researching, training, and evaluating models. While it’s clear that teams want to address any potential issues before they arise, there is a lack of clear processes, tools and requirements for production systems. Today, the industry still lacks guidelines of what an optimal ML infrastructure should look like
That's why we've gathered best practices for data science and engineering teams to create an efficient framework to monitor ML models. The ebook provides a framework for anyone who has an interest in building, testing, and implementing a robust monitoring strategy in their organization or elsewhere. The article below briefly covers some of the key points.
It's instinctive to focus on model performance as a key metric, but detecting degraded performance might only be achieved too late or remain undetected, as it requires the ability to collect the ground truth - which in many cases is missing or is only collected once your business has suffered a blow.
It's crucial to monitor the stability of entire ML flow, including input, inferences, and output. This allows you to spot the earliest indication of drifts in real time. Additionally, monitoring input delivers useful insights into the cause of drifts, which speeds up diagnostics and remediation.
Tracking multiple models comes with the challenge of being able to build the relevant metrics for each case. Therefore a centralised monitoring solution should take into account different types of data and use multiple performance metrics. For instance, for numeric features, the mean, std, min, max, outliers, etc., must be monitored; while for categorical features, the number of unique values, new values, entropy, portion of the most frequent values, etc., are what matter the most.
Overall data can obscure serious segmental drifts, since many data changes are quite subtle and difficult to detect. Sometimes, drift only affects a certain subset of your dataset, or appears only at specific seasons, making it easy to miss if you stick to a high-level overview.
There's no efficient way to manually monitor your ML models. Every model could have dozens of features for multiple segments or sub-populations, each requiring multiple metrics, and each with different natural distribution changes over time. An effective monitoring framework needs to be both automated and smart. With time series anomaly detection and causality analysis methods, teams can easily extrapolate data and aggregate metrics in a way that indicates the level of urgency of the detected deviation, and promptly identify their potential root-cause.
If you're validating new versions, or benchmarking your production model, you'll need to monitor them alongside your existing version for some time, using either a shadow model, A/B testing, or the multi-armed bandit (MAB) approach.
Whichever approach you use, ensure that you can carry out advance testing for a number of use cases, and gather and analyze the status and KPIs of both versions, to check for serious diversions between the two, where those diversions take place, whether they are the differences you expected, etc. You can consult our extensive article here.
The first step in preventing biases in models is to look for them in the training data in the research phase, before training and deploying to production. However, no matter how careful the data scientist is and even if the model was developed and evaluated to ensure it’s free of biases, the data in production continuously changes. There is always a risk that a bias which hasn't been seen so far may show up and increase, or even be amplified by the model itself. Therefore, it is important to continuously test your model for biases with the live production data.
Frequently, data science and ML engineering teams within the same organization use different platforms to develop and deploy their ML models (custom python microservices, SageMaker, TensorFlow serving,...). Your ML monitoring framework needs to be decoupled from the ML platform so that you can apply it to Python, Java, R, and other types of technologies backing your models.
Most AI models involve multiple stakeholders, but while they're all invested in the model operating at peak accuracy, they often struggle to communicate with each other.
Every team needs independent access and visibility, including the operational team, who have to be able to monitor models without the Data Science’s team involvement. Your teams need a single dashboard and a common language to track and discuss ongoing issues.
To truly maximize the value of your monitoring framework, you need to feed the results back into the entire ML pipeline. For instance, your monitoring framework should guide you to the most accurate datasets, so when you come to retrain models, you can exclude those that show anomalies, or use only data that reflects a long-term change.