While it is clear where we are headed to there seems to be a road blocker that I will address in this blog. Sometimes perspective is an inspiration, I recently stumbled upon a research paper by Google researchers, titled as Hidden Technical Debt in Machine Learning Systems. It highlights how small ML code is in the software (Big Picture) and how the big parts are often ignored(often due to lack of focus and competencies) leading to technical debt, ineffectiveness and often frustration for organisations.
Pic Credits: Hidden Technical Debt in Machine Learning (authors)
Usually in the production systems, it so happens that it is ~20% Machine Learning and ~80% is Software Engineering code.
With traditional and mundane ways of working, tools and lack of process driven software development. It takes a whole lot of non-ML coding and plumbing to set up a production-ready system.
As more and more machine-learned services make their way into software applications, which themselves are part of business processes, robust life cycle management of these machine-learned models becomes critical for ensuring the integrity of business processes that rely on them. On top of this, According to Gartner, companies struggle to operationalize machine learning models:
“The Gartner Data Science Team Survey of January 2018 found that over 60% of models developed to operationalize them were never actually operationalized.”
So how do we systematically approach this hidden technical debt in Machine Learning? By implementing Machine Learning lifecycle management in your operations.
Machine Learning lifecycle management is an efficient way of working for building, deploying, and managing machine learning models critical for ensuring the integrity of business processes.
This way of working can take your team to high performance mode. But first make sure the foundations are right, the key is to make sure your AI strategy is well aligned with your culture and business strategy and AI is systematically integrated into your business with a clear proof of value. On this, find more on How Enterprises will thrive in the Era of Artificial Intelligence (Credits: Dr. Christian Guttmann).
Now, let's look at Machine Learning lifecycle management from the process and architectural point of view.
From the process point of view...
This is an overview of processes in Machine Learning lifecycle in 3 phases.
1. Code meets data (CI/CD)
This phase is developed and managed by DevOps or Machine Learning Engineer(s) and Data Engineer(s). Code meets data - is enabled by seamless capabilities of Continuous Integration and Continuous Deployment which facilitate and manage this phase.
- Source code management: Using git or other source code management(SCM) system we can manage the source code which can integrate seamlessly with CI, CD and data pipelines. All our code resides in source code management setup.
- Continuous integration and deployment triggers: CI/CD triggers connect everything from commit to deploy. CI/CD pipeline helps you automate steps in your software delivery process, such as initiating code builds, running automated tests, and deploying to a staging or production environment. CI/CD triggers remove manual errors, provide standardised development feedback loops and enable fast product iterations.
- Data Pipelines: Data in traditional software applications tends to be transactional in nature and most of the structured type whereas data for machine learning models can be structured, or unstructured. Unstructured data can further come in multiple forms such as text, audio, video, and images. In addition, data management in the machine learning pipeline has multiple stages, namely data acquisition, data annotation, data cataloging, data preparation, data quality checking, data sampling, data augmentation steps – each involving their own life cycles thereby necessitating a whole new set of processes and tools.
Managed by: DevOps or Machine Learning Engineer + Data Engineer
2. Machine Learning pipelines
These pipelines are developed and managed by Data Scientist(s) and Machine Learning Engineer(s).
- Model Training: Decisions are made on about what algorithms to experiment with the prepared data and the feature assets that are prepared. This includes making decisions about what frameworks to use (Sklearn/TensorFlow/Pytorch/Keras etc.), if neural nets are involved, how many hidden layers and the specific activation functions at each layer etc. Data Scientists/ML Engineers then train the models, after making the train/dev/test set splits into labeled data and run multiple experiments before finally making the model selection. Throughout the training process, many decisions are made on the various hyperparameters, at the same time striving to optimize the network/architecture of the training algorithm to achieve the best results.
- Model Testing: The finalized model is tested against multiple datasets that are collected, also model can be batch tested if necessary. The model is also tested against various competitor services, if accessible, and applicable. Comparing the quality and run-time performance of the model with the competitor’s services and all known competing AI models to establish its quality for each model version is a critical aspect of the testing phase.
All of this is facilitated using needed compute(GPUs, TPUs or CPUs) and storage resources.
Managed by: Data Scientist or Machine Learning Engineer
3. Continuous deployment and monitoring (Ops)
Robust and scalable Ops is facilitated by having a well-chronicled model registry(a repository for models with versions and tags) with deployment and monitoring capabilities to fuel the model(s) use in test and production environments. Ops is developed and managed by DevOps Engineer(s) and Machine Learning Engineer(s).
- Model Management: Trained models are maintained in the model registry with certain versioning and traceability of which source code the model uses, which data it was training on and what parameters were used. These models are packetized using needed dependencies and artifacts(serialized model and architecture spec files) and are ready to be deployed to test or production environments on request.
- Deploy (Test and Production): Trained models need to be tested before deployed in production systems to make business decisions. In order to ensure the models are tested, we deploy them in the test environment which replicated the production environment. Model to be tested is deployed as an API service in test environment to deployment targets like Kubernetes clusters, Container instances or scalable virtual machines depending on the need and use case and tested by inference to test datasets which are taken from production-like environments and results of test are automatically or manually tested by QA assurance expert and approved to deploy(if model performs better than threshold) to production where model will be tested in batches or realtime to make business decisions.
- Monitor: Once model(s) are deployed to test or production environment it is critical to monitor model performance in realtime or batches to ensure that the model is making optimal, efficient and correct business decisions. Monitoring is facilitated by a user interface that monitors the model for MODEL FAIRNESS, TRUST, TRANSPARENCY and ERROR ANALYSIS. In some cases, the statistical properties of the target variable, which the model is trying to predict, may change over time in unforeseen ways”. This is called drift. For example, users’ preferences or sentiments may change somewhat rapidly in certain cases thereby making a recommendation model that was trained using historical preferences to be no longer relevant in predicting current preferences. Therefore, models have to watch for such shifts in target variables. In such cases, a) the model needs to be switched or updated, b) Product owner needs to be alerted or c) Activate re-training pipeline depending upon the needs.
All these phases are backup by the lineage of a machine learning model referring to the origins of the model, includes which source code the model uses, which data it was training on and what parameters were used. Having the full lineage available means that when a problem occurs it is easier to audit what caused the problem. Because machine learning models generate data when making predictions, this lineage can be added to the lineage of the data itself, which is important for certain compliance requirements.
From an architectural point of view...
Credits: Microsoft Reference Architecture
This is a reference Architecture from Microsoft on MLOps which applies to generic use cases. In this way of working,
- Data scientist commits a change to the repo, the build pipeline is triggered.
- The build pipeline triggers steps like code and data quality checks, provisions required to compute and triggers ML pipeline.
- After a successful ML pipeline run the trained model is evaluated(if better than previously trained models) and registered in the model repository which triggers the model testing pipeline. Upon passing the testing pipeline, the model is approved and ready for deploying in the QA testing phase.
- The model is packaged into a docker image and deployed as a web service on a Kubernetes cluster and tested in quality assurance testing.
- Upon passing QA testing, the product owner manually approves a released to deploy the model into production.
- Test results and model inference input and output are monitored and stored for future reference.
In order to replicate such ML lifecycle management here are some recommended tools and frameworks, I have personally tested:
- Gitlab (Source Code management and DevOps tool- GitLab is a web-based DevOps lifecycle tool that provides a Git-repository manager providing wiki, issue-tracking and CI/CD pipeline features.)
- Azure DevOps (Source Code management and DevOps tool - Azure DevOps provides source code repositories, version control, reporting, requirements management, agile project management, automated builds, testing and release management capabilities. It covers the entire application lifecycle, and enables DevOps capabilities. Specially tailored for Azure cloud)
- Jenkins (CI/CD tool - Jenkins is a free and open source automation server. Jenkins helps to automate the non-human part of the software development process, with continuous integration and facilitating technical aspects of continuous delivery.)
- Azure ML - (ML tool for Azure cloud- Azure Machine Learning provides a cloud-based environment you can use to prep data, train, test, deploy, manage, and monitor machine learning models. This service is comprehensive and provisioned only for Azure cloud.)
- Valohai (Cloud agnostic ML tool - Tool-agnostic machine/deep learning management platform where developers can manage experiments and infrastructure. It orchestrates end to end ML services and is cloud agnostic/independent.)
- MLFlow (Open source ML Tool - mlFlow is a framework that supports the machine learning lifecycle. This means that it has components to monitor your model during training and running, ability to store models, load the model in production code and create a pipeline. mlFlow is cloud agnostic/independent.)
- Kubeflow (Deployment tool - Kubeflow enables making deployments of machine learning workflows on Kubernetes simple, portable and scalable)
First thing is to get the process right and then follows the tools. With my AI team at Tieto I have explored and implemented this area in-depth and would be glad to help you/your AI team to ace the process and be up to speed with having an efficient way of working. By selecting the right process and tools, you are all set to implement your Machine Learning lifecycle. With this your team can be assured of a robust, scalable and repeatable way of working for your organization.
Feel free to get in touch and please share your thoughts via social media.