ML Infra Best Practices: Top Challenges today

Manjot Pahwa
7 min readFeb 14, 2020

--

With data science becoming as ubiquitous as it is deemed to be in all types of companies and industries: AI-first, cloud-native and traditional enterprise, the rise of an infrastructure stack dedicated to AI is inevitable. There are many places where current software engineering tooling falls short — data science development is fundamentally different from software engineering necessitating the need to rethink several layers of this stack. The broad tenets of difference include:

Data science is experimentative and nondeterministic in nature unlike traditional software engineering. There’s a lot of iteration and no deterministic guarantees around end performance or timeline.

Data science involves far more complexity in terms of stakeholders involved. In order to produce one output which is a working model, you need to involve data engineering for data pipelines, data scientists for model building, ML engineers for productionizing, product managers for the business take, and executive level approval.

Data science involves a lot more artifacts compared to traditional software engineering. Traditional software development mostly concerns itself with code, while machine learning development needs to track plots during all the data exploration phase, code of the model training and feature development process, the actual models themselves, features created, and finally metrics of the model performance.

Despite data science and machine learning reaching the hype they have, we’re still in infancy as far as maturity of Machine learning infrastructure goes. When I refer to ML infrastructure, I refer to infrastructure that would help manage, mitigate and prevent the issues one has with data science in production or at scale.

Below are mentioned some of the broad scoped problem areas for ML infrastructure: model versioning and management, model monitoring, explainability, model deployment and serving.

Hierarchy of needs for data science

Model Versioning & Management

Model versioning is similar in principle to code versioning, however model versioning is a lot more complex considering the code is not the only output for machine learning development. An ideal solution for versioning would involve tracking code, model, data, features and metrics as a single unit. This would be the smallest reproducible capsule for an ML solution. Besides tracking all these artifacts, tracking lineages of data and model, especially data, is critical for understanding the context behind creating those models and features. Reasons for lineage being critical: for reusing someone else’s feature or dataset, audit trails of data transformations, compliance requirements and many more.

Currently most data scientists use a combination of notebooks and Git for version control. Most data science teams have ad-hoc solutions in place to version control the way they want to such as in this Stackoverflow thread. This is due to the lack of first class support in Github for machine learning artifacts and a lack of seamless integration of notebooks with Git (option to choose only inputs or outputs, tracking associations and tracking lineages).This manual operation has several problems including:

The lack of seamless integration with git for machine learning models requires data scientists to save the training code in Github, the dataset in maybe S3 and the actual model somewhere else. Github and Gitlab recently added GitLFS so that large files can be added to the same repo, but the version control of that dataset and the training code works in isolation. This makes it very difficult to associate the dataset with the model.

Custom versioning of only inputs or outputs (such as plots and metrics) are done either completely manually or by ad-hoc scripts that the data science team has to maintain. Unsuccessful attempts are almost always lost and usually repeated in the next iteration.

Collaboration is still a nightmare for many reasons. The versioning of the training code and dataset transformation might be completely isolated. Currently there is no automated way to associate code, data, model, metrics in the same place outside of notebooks, which are unsearchable and have to parsed manually.

Model management is an area of itself. As an organization becomes mature and starts continuously developing, deploying and maintaining models, a model management solution becomes inevitable with the growth in the number of data science artifacts and models.

Data version control is very different from code version control. For data, you need to be able to see all the diffs, how new features were generated; data was cleaned, munged and wrangled; provenance; etc.

Model Monitoring

One of the first questions to answer before deploying a model is what is the business risk of deploying this model? Besides the obvious things to monitor related to the health of the model, there are several aspects about the performance of a model that you will need to monitor in a production setup.

Overall health of the model: This involves monitoring the basic health, uptime, metrics et al of the model. There are many tools both open source and licensed available to monitor the health of your machine learning jobs.

Offline model evaluations: Offline model evaluation is done post training, in the validation and testing phase of the model building lifecycle. This typically involves evaluating certain model metrics (depending on the type of model) you’ve just built before deploying it to production. There are various methods for today for validating the model: cross validation, hold out validation, bootstrap and jackknife. For testing the model, typically the data scientist splits the golden dataset into a train and test dataset in the beginning and tests the model on the pristine test dataset which the model has never seen before. This gives a fair picture of the performance of the model on unseen data.

Online model evaluations: Online evaluation involves comparing models real time with models that have already been deployed. Moreover, if that production system is based on a rendezvous-style architecture, it will be very easy and safe to deploy new models so that they score live data in real-time. In fact, with a rendezvous architecture it is probably easier to deploy a model into a production setting than it is to gather training data and do offline evaluation. With a rendezvous architecture, it is also be possible to replicate the input stream to a development machine without visibly affecting the production system. That lets you deploy models against real data with even lower risk.

Drift monitoring: Drift is the process by which a model trained earlier might not be giving the same level of performance as earlier. Three are broadly three types of drift: data drift, concept drift, data collection changes. Data drift happens when the decision boundary changes and is the most easy to catch among the three types. Target variable drift or concept drift is more sinister and occurs when statistical properties of the target variable change due to environmental, cultural, regulatory and other such factors. Data collection changes occur when the process of data collection changes which unintentionally leads to a change in the model performance.

Explainability & Interpretability

With the increasing amount of regulation that machine learning models in certain industries are subjected to, explainability becomes vital before deploying them in production. Definitions for both:

Interpretability : being able to tell how output would move depending on the input shape.

Explainability: deeply explain how the model comes up with a particular solution.

Certain types of models are explainable from the get-go however, deep learning models and neural nets are complete black boxes as of this date. There have been many tools both proprietary and open source that are now coming up to solve for this. However, there is usually a tradeoff between how explainable a model is and the accuracy of the model.

Model Deployment, Maintenance & Serving

After building the model, data teams face several challenges of being able to deploy, maintain and use those models in production. Machine learning systems are a lot subtler and more complex than traditional software systems which make deploying and maintaining these models harder. Considering this area is still evolving as well, there are no silver bullets or a set of prescribed best practices for the problems mentioned here.

Staging or bringing a new model to production: This involves using Kubernetes or serverless technologies to actually deploy them in production in popular public clouds, creating an API, ensuring there are enough replicas and enough resources for the model, ensuring the latency of the model itself is satisfactory. Since the language for developing the model might be Python, often this step requires a complete rewrite of the model in a language like C++ or Java. Applications such as Sagemaker help easily deploy your estimator object as an API endpoint.

Transitioning from an older model to a new champion model: This is the online evaluations piece of the model lifecycle or the blue green deployment. This also involves retiring a model completely, running different versions of the model in parallel for testing their long term effects on business metrics, etc. Sagemaker again helps you connect several models to the same endpoint.

Monitoring the model and evaluating with production inputs as mentioned above.

CI CD for ML models: As of this day, there are no best practices for continuous learning of ML models and there is no easy way to have a reproducible and deterministic way to recreate machine learning models.

Conclusion

In summary, ML infra is still in its infancy and there are broad swathes of areas where there are no established best practices yet. Since model development is a fundamentally different process than traditional software development, most of the tools are suboptimal for catering to data science teams today. The problems range from model versioning and management, model monitoring, explainability and model deployment and serving.

--

--

Manjot Pahwa
Manjot Pahwa

Written by Manjot Pahwa

VC at Lightspeed, ex-@Stripe India head, ex @Google engineer and Product Manager for Kubernetes

No responses yet