1 of 7

MLOps Deep Dive

Dan Sun

2 of 7

Before we start …

Technologies are ever changing and different people have different opinions. The goal here is to focus on the fundamentals which barely change over time (hopefully).

Terminologies in not just ML but in software development in general are overloaded. This discussion tries to:

(1) includes all searchable and accurate terms

(2) provides short yet precise definitions of these terms

(3) always sticks to one term while explaining a specific concept to avoid terminological confusion and ensure terminological consistency.

3 of 7

MLOps stands for Machine Learning Operations

MLOps = Data Engineering + Machine Learning + DevOps (e.g., MLInfra)

MLOps covers aspects related to automating, testing, managing, monitoring, and maintaining ML workflows in production environments.

It is a mess with regards to tons of toolings, practices, and industry standards. However, it is not that hard to figure out what to learn or look at if we clearly see the core components of ML systems.

4 of 7

Data Engineering

Knowing Modern Data Architecture (MDI, also referred to as Modern Data Stack (MDS)) is crucial as data is the building block of machine learning. What on earth is MDI or MDS then? Think of it as a collection of tools that allow you to do all possible actions you would want on your data.

  • Data Storage - storage engines
    • Data Lake (e.g., S3, GCS)
    • Data Warehouse (e.g., Redshift, BigQuery)
    • Feature Store (e.g., feast, neptune, tecton)
    • Other Databases
  • Data Processing - ETL/ELT data pipelines
    • Batch/Offline/Periodic Processing (e.g., Spark/Beam)
    • Stream/Near-Online/Near-Real-Time/Nearline Processing

(e.g., Spark Streaming/Kafka/Flink)

    • Online/Real-Time/Immediate Processing
  • Data Versioning (e.g., DVC)
  • Orchestration (e.g., Airflow/Beam, Kuberflow) *

5 of 7

Machine Learning

This is more referred to as ML Experimentation where ML scientists/researchers do research to (1) perform EDA in an interactive notebook environments (e.g., Jupyter Notebook), (2) create prototype model architectures (e.g., a 2-layer LSTM model), (3) implement transformation and training routines.

  • Data Ingestion & Validation (e.g., TFDV)
  • Exploratory Data Analysis (e.g., Matplotlib, Seaborn, Scipy)
  • Data Transformation & Feature Engineering (e.g., Numpy, Pandas, Dask, Spark)
  • Model Training & Model Development (e.g., Scikit-Learn, XGBoost, Tensorflow/Keras)
    • Distributed training (e.g., multiple GPUs)
  • Model Optimization (e.g., Scikit-Optimize)
    • Efficient hyperparameter tuning at scale
    • Automated (1) feature selection (2) feature engineering (3) model architecture selection (e.g., AutoML)
  • Model Evaluation
    • Perform batch scoring on validation set (≠ live data) in a validation environment (just lower than production env) at scale
    • Calculate pre-defined metrics on different pieces of data

6 of 7

DevOps - Automated Deployment & Serving

ML Deployment - release executable ML applications (for later ML Serving) from ML experimentation by pushing (deploying) them to pre-determined target envs (e.g., cloud-based servers, on-premise servers, edge devices). ML Serving - expose target envs to receive live traffic (e.g., unseen data, prediction requests) and trigger ML inference pipeline to generate predictions (inference results) in order to serve end users.

  • ML Serving (based on certain SLAs)
    • Offline Serving / Batch Prediction (Inference)
    • Stream Serving / Near-Real-Time Prediction (inference)
    • Online Serving / Near Real-Time Prediction (Inference)
  • ML Deployment (based on certain SLAs)
    • What to push (deploy)?
      • Executable ML applications: containerized ML pipelines aka container images (stored in container registry, e.g., GCR); Python packages (stored in artifact registry)
    • Where to push (deploy)? (aka where are those target envs)
      • Deploy in Cloud-based Servers
        • Deploy as a web app in Heroku (PaaS) or in a compute instance (IaaS)
        • Deploy as a container in a K8s cluster (running in GCP ⇔ GKE) *
      • Deploy in On-premises Servers
      • Deploy in Edge Devices
  • CI/CD (e.g., concourse)
  • ML Testing
    • (Pipeline) Code (unit and integration test; smoke test)
    • Data (schema; individual values)
    • Model (artifacts)

7 of 7

DevOps - Automated Monitoring

Monitoring ML systems include (1) model effectiveness (identifying different types of drifts before model performance degradation), and (2) model serving efficiency (for online serving - low latency, for offline serving - high throughput)

  • Resource utilization (e.g., CPUs, GPUs, memory)
  • Statistics (e.g., t-test, p-value, ANOVA, KS)
  • ML related metrics (e.g., precision/recall, Shapley - SHAP Value)
  • Business metrics or other drift detection methodologies