Understanding MLOps: A Comprehensive Guide to Best Practices
Written on
In recent years, I have been involved in deploying machine learning systems in actual production settings for clients in the Consumer Packaged Goods (CPG) and Healthcare sectors.
While grasping business needs, crafting models, and engaging with raw data remain significant hurdles, the most thrilling aspect has been the project's industrialization.
A persistent challenge we face is identifying the best industry practices for maintaining machine learning systems in a production environment. A recent study by Kreuzberger et al. [1] offers insights into what Machine Learning Operations (MLOps) entails, including its principles, the architectural components that support these principles, the roles involved in MLOps projects, and a potential architecture for such systems. It also outlines best practices from an academic viewpoint.
Principles of MLOps
To gain a clearer understanding of MLOps and its objectives, several guiding principles have been identified for projects:
- P1 - CI/CD Automation: This principle facilitates the rapid building, testing, and deployment of code, enhancing team productivity.
- P2 - Workflow Orchestration: Essential for coordinating all steps in an ML workflow pipeline, such as data ingestion, preprocessing, model training/testing, and deployment.
- P3 - Reproducibility: It is crucial to replicate past experiments (both code and models).
- P4 - Versioning: Tracking versions of code, data, and models is vital and supports reproducibility.
- P5 - Collaboration: This involves both technical cooperation (working on the same code, pipelines, etc.) and business engagement, where data scientists must understand business problems and set clear expectations.
- P6 - Continuous ML Training & Evaluation: A key principle that emphasizes the need for regular retraining and evaluation of models.
- P7 - ML Metadata Tracking and Logging: Attaching metadata to each model, including evaluation metrics and code versions, is essential for managing production systems.
- P8 - Continuous Monitoring: Monitoring helps assess model performance, providing insights into retraining needs or the necessity for entirely new models.
- P9 - Feedback Loops: The iterative nature of the process means feedback loops enhance the overall system.
Components
Each principle is supported by specific technical components:
- C1 - CI/CD Component: Supports continuous integration and delivery, enabling continuous ML training and evaluation, while accelerating feedback loops.
- C2 - Source Code Repository: Vital for collaboration and versioning.
- C3 - Workflow Orchestration Component: Facilitates pipeline orchestration, reproducibility, and continuous ML training and evaluation.
- C4 - Feature Store System: Comprising two databases—one for offline training and another for online predictions—this component is critical. For example, when forecasting sales, past sales data must be readily accessible for predictions.
- C5 - Model Training Infrastructure: Necessary infrastructure (CPUs, GPUs, etc.) is required for continual model training and evaluation.
- C6 - Model Registry: A repository for storing model images and metadata, aiding in model distribution and deployment.
- C7 - ML Metadata Stores: Governs the various metadata generated by components like CI/CD and workflow orchestration.
- C8 - Model Serving Component: Responds to requests, typically via REST API, and must be scalable, often utilizing Kubernetes.
- C9 - Monitoring Component: Assesses model performance and tracks production activities, facilitating feedback loops.
These concepts are summarized visually below:
Roles
Structuring an ML project using Agile methodologies can be challenging; however, MLOps aids in this regard. The diverse engineering roles allow for rapid experimentation and iteration. In an MLOps project, several roles mirror those found in typical Agile projects:
- R1 - Business Stakeholder: Identifies business objectives and communicates between the team and company stakeholders.
- R2 - Solution Architect: Responsible for technology design and selection.
- R3 - Data Scientist: Converts business needs into analytical requirements and develops ML models.
- R4 - Data Engineer: Builds and manages data and feature pipelines.
- R5 - Software Engineer: Implements software design patterns and best practices for project engineering.
- R6 - DevOps Engineer: Constructs pipelines, ensuring effective CI/CD automation and ML workflow orchestration.
- R7 - ML/MLOps Engineer: A cross-domain role focused on building and automating ML infrastructure, workflows, and model deployment.
The following illustration summarizes these roles and their interactions:
As depicted, the MLOps Engineer plays a vital cross-functional role, ensuring seamless integration of all components. Collaboration among all roles is essential for delivering a high-quality MLOps project.
Architecture and Workflow
According to Kreuzberger et al. [1], a general, technology-agnostic architecture is proposed, aligning closely with the Team Data Science Process (TDSP).
We will first outline key steps related to the TDSP and then provide an overall architecture overview.
- MLOps Project Initiation
In this phase, referred to as Business Understanding (TDSP), the Business Stakeholder articulates project goals. The Solution Architect identifies suitable technologies, while the Data Scientist collaborates with the Product Owner to clarify the business problem and objectives, assessing the availability of data.
- Feature Engineering Pipeline
The study identifies a Feature Engineering Pipeline where Data Engineers and Data Scientists work together to identify features, ingest data, and preprocess it. This phase aligns with Data Acquisition and Understanding (TDSP) but emphasizes the importance of building automated data extraction and preprocessing pipelines.
- Experimentation
During experimentation, a Data Scientist analyzes the data, prepares it, creates a model, and eventually registers it in a model registry.
D) Deployment Once trained, the model is deployed to the serving layer.
Putting It All Together We have reviewed several components of this architecture. Now, let’s examine how these pieces interact:
Upon project initiation, data sources are shared with the data scientist. Connecting to raw data (B2) can be challenging due to the varied nature of data storage systems, leading to a Data Ingestion/Feature Engineering Pipeline.
Processed data should be stored in a feature store system that supports both offline training and online predictions. As new data arrives, an event-based approach can trigger the retraining/deployment pipeline, or retraining can be scheduled periodically based on data or model drift.
The monitoring component enables proactive measures and assesses model performance in production.
Conclusion
This study demonstrates an academic interest in understanding the behavior of machine learning models in real-world environments, offering a comprehensive overview of what a machine learning system should encompass and defining Machine Learning Operations.
It represents one of the initial frameworks that can enhance the success of machine learning projects.
Nonetheless, various factors contribute to the failure of machine learning initiatives, including the organizational impact of such changes. Transforming work practices and restructuring processes is a formidable challenge.
If you found this article valuable, please consider giving it a clap! To see more content, follow me on Medium!
Main Reference: [1] Dominik Kreuzberger, Niklas Kühl, Sebastian Hirschl, Machine Learning Operations (MLOps): Overview, Definition, and Architecture
Further Readings on Major Cloud Providers:
Machine Learning Operations (MLOps) Framework with Azure: This project assisted a Fortune 500 food company in enhancing demand forecasting.
- docs.microsoft.com
MLOps: Continuous Delivery and Automation in Machine Learning: Discusses techniques for CI/CD implementation.
- cloud.google.com
MLOps - Machine Learning Operations on AWS: Provides tools for delivering high-performance production ML models.
- aws.amazon.com
For those starting a new MLOps project, Microsoft offers accelerators that can assist:
GitHub - Azure/mlops-v2: A repository for Azure MLOps (v2) solution accelerators.
- github.com