2 of 7

Before we start …

Technologies are ever changing and different people have different opinions. The goal here is to focus on the fundamentals which barely change over time (hopefully).

Terminologies in not just ML but in software development in general are overloaded. This discussion tries to:

(1) includes all searchable and accurate terms

(2) provides short yet precise definitions of these terms

(3) always sticks to one term while explaining a specific concept to avoid terminological confusion and ensure terminological consistency.

3 of 7

MLOps Is a Mess But That's to be Expected

MLOps stands for Machine Learning Operations

MLOps = Data Engineering + Machine Learning + DevOps (e.g., MLInfra)

MLOps covers aspects related to automating, testing, managing, monitoring, and maintaining ML workflows in production environments.

It is a mess with regards to tons of toolings, practices, and industry standards. However, it is not that hard to figure out what to learn or look at if we clearly see the core components of ML systems.

4 of 7

Data Engineering

Knowing M odern D ata A rchitecture (MDI, also referred to as Modern Data Stack (MDS)) is crucial as data is the building block of machine learning. What on earth is MDI or MDS then? Think of it as a collection of tools that allow you to do all possible actions you would want on your data.

Data Storage - storage engines

Data Lake (e.g., S3, GCS)
Data Warehouse (e.g., Redshift, BigQuery)
Feature Store (e.g., feast, neptune, tecton)
Other Databases

Data Processing - ETL/ELT data pipelines

Batch/Offline/Periodic Processing (e.g., Spark/Beam)
Stream/Near-Online/Near-Real-Time/Nearline Processing

(e.g., Spark Streaming/Kafka/Flink)

Online/Real-Time/Immediate Processing

Data Versioning (e.g., DVC)
Orchestration (e.g., Airflow/Beam, Kuberflow) *

5 of 7

Machine Learning

This is more referred to as ML Experimentation where ML scientists/researchers do research to (1) perform EDA in an interactive notebook environments (e.g., Jupyter Notebook), (2) create prototype model architectures (e.g., a 2-layer LSTM model), (3) implement transformation and training routines.

Data Ingestion & Validation (e.g., TFDV)
Exploratory Data Analysis (e.g., Matplotlib, Seaborn, Scipy)
Data Transformation & Feature Engineering (e.g., Numpy, Pandas, Dask, Spark)
Model Training & Model Development (e.g., Scikit-Learn, XGBoost, Tensorflow/Keras)

Distributed training (e.g., multiple GPUs)

Model Optimization (e.g., Scikit-Optimize)

Efficient hyperparameter tuning at scale
Automated (1) feature selection (2) feature engineering (3) model architecture selection (e.g., AutoML)

Model Evaluation

Perform batch scoring on validation set (≠ live data) in a validation environment (just lower than production env) at scale
Calculate pre-defined metrics on different pieces of data

6 of 7

DevOps - Automated Deployment & Serving

ML Deployment - release executable ML applications (for later ML Serving) from ML experimentation by pushing (deploying) them to pre-determined target envs (e.g., cloud-based servers, on-premise servers, edge devices). ML Serving - expose target envs to receive live traffic (e.g., unseen data, prediction requests) and trigger ML inference pipeline to generate predictions (inference results) in order to serve end users.

ML Serving (based on certain SLAs)

Offline Serving / Batch Prediction (Inference)
Stream Serving / Near-Real-Time Prediction (inference)
Online Serving / Near Real-Time Prediction (Inference)

ML Deployment (based on certain SLAs)

What to push (deploy)?

Executable ML applications: containerized ML pipelines aka container images (stored in container registry, e.g., GCR); Python packages (stored in artifact registry)

Where to push (deploy)? (aka where are those target envs)

Deploy in Cloud-based Servers

Deploy as a web app in Heroku (PaaS) or in a compute instance (IaaS)
Deploy as a container in a K8s cluster (running in GCP ⇔ GKE) *

Deploy in On-premises Servers
Deploy in Edge Devices

CI/CD (e.g., concourse)
ML Testing

(Pipeline) Code (unit and integration test; smoke test)
Data (schema; individual values)
Model (artifacts)

7 of 7

DevOps - Automated Monitoring

Monitoring ML systems include (1) model effectiveness (identifying different types of drifts before model performance degradation), and (2) model serving efficiency (for online serving - low latency, for offline serving - high throughput)