ML Infra Best Practices: Monitoring Overview

4 min readFeb 25, 2020

One of the first questions to answer before deploying a model is what is the business risk of deploying this model?

Besides the obvious things to monitor related to the health of the model, there are several aspects about the performance of a model that you will need to monitor in a production setup.

Operational Health

This involves monitoring the basic health, uptime, metrics, latency et al of the model. Other things to monitor wrt traffic include rate changes of traffic, There are many tools both open source and licensed available to monitor the health of your machine learning jobs. Aggregated metrics such as overall CPU usage, number of evaluations etc can be collected and pushed to a time series DB such as Influx or Elastic.

Offline Model Evaluations

Offline model evaluation is done during and post training, in the validation and testing phase of the model building lifecycle. This typically involves evaluating certain model metrics (depending on the type of model) you’ve just built before deploying it to production.

For testing the model, typically the data scientist splits the golden dataset into a train and test dataset in the beginning and tests the model on the pristine test dataset which the model has never seen before. This gives a fair picture of the performance of the model on unseen data.

There are various methods today for validating the model: cross validation, hold out validation, bootstrap and jackknife.

Results from the validation and training is fed back into the hyperparameter tuner to improve the performance of the model.

There are lots of catches while doing both the test and validation evaluation. Some of them include:

One of the most obvious pitfalls is separation of the test data. Validation dataset is different from the test dataset. Validation dataset can be obtained from the training dataset and used for tuning the hyperparameters. The test data should be pristine and not seen at all during the model training to give a fair picture of the model performance.
Data is unevenly split: This can be one of the biggest reasons for underfitting or overfitting a model, depending on the properties of the train, test and validation data sets if the training and test datasets have a completely different data distributions. Common cause is class imbalance (during multiclass classification problems). In a later post we will talk about how to get around these problems.
Training Data itself is unrepresentative of reality: In this case you need to ensure you collect enough data to be close to real life data distributions.

Online Model Evaluations

Online model evaluations: Online evaluation involves comparing models real time with models that have already been deployed.

Moreover, if that production system is based on a rendezvous-style architecture or a shoadow mode deployment, it will be easier and safer to deploy new models so that they score live data in real-time. In fact, with a rendezvous architecture it is probably easier to deploy a model into a production setting than it is to gather training data and do offline evaluation. With a rendezvous architecture, it is also be possible to replicate the input stream to a development machine without visibly affecting the production system. That lets you deploy models against real data with even lower risk.

Drift Detection

There are roughly three types of drifts known today in machine learning systems:

Data drift / Shifting decision boundary

This is the most common type of drift that occurs. Fortunately this is also easier to detect than the other types of drifts.
Differences between the statistical properties of training data and data in production, especially statistical properties of the features on which you had trained your model.
Examples: cases where we don’t have enough training data for certain subspaces, our model can make decisions that might not be as accurate. We might start seeing more data in these subspaces in production which might not have been present in our training dataset.

Techniques to counter this are well known:

Mean, standard deviations and correlations
Hellinger distance
Divergence tests etc.

To put it in pictures, the statistical properties of data changes necessitate the need to move the decision boundary.

Target variable drift

This is a lot more sinister to detect because properties of the target variable itself change even though the data might not.
Statistical properties of the target variable change due to factors such as environmental, cultural etc.
Examples: the definition of a fraudulent transaction could change over time as new ways are developed to conduct illegal transactions or let’s say there are new regulations to give tax credits for electric cars to encourage certain behaviour and car sales prediction models have to take that into account.

To build on the same example as above:

Data Collection Changes

If the way in which data is collected and processed changes upstream in the data pipeline, you might experience upstream data drift.
Examples: changes to feature encoding in the upstream pipeline such as change of metrics, etc.

In a later post we will talk about how to go about detecting and countering each type of drift in detail.

ML Infra Best Practices: Monitoring Overview

Operational Health

Offline Model Evaluations

Online Model Evaluations

Drift Detection

Written by Manjot Pahwa