2 of 11

About this Talk

Intro and Theory

We will start with the intro and explore what it solves

Quick glimpse of additional exciting features

Quick references to learn and explore more

A jupyter notebook walkthrough exploring the deployment

Hands-On

More to explore

Theory lesson

Resources

3 of 11

What we need?

A simple and unified framework for all kinds of model deployment.
Maximize throughput & latency.
Maximize device utilization.
Runs on CPU and GPU.
Scaled inference i.e an architecture that can schedule requests if running on a cluster.
Batch inference, Ensemble inference, Stream inference, Multi-model inference.
Should be open-source and backed by a huge organization & research community.

Triton can handle all of the above. In short, it solves the server-side of model deployments in production and provides you with an client API to ping that server for inference. To setup the server, all you need to do is setup the configuration files as per your requirement.

4 of 11

Architecture Overview

Just build your client

Supports both HTTP/GRPC

Built in scheduler and model versioning

Supports all backend, ONNX, TorchScript, TensorRT, PyTorch. Tensorflow too.

Hardware agnostic.

5 of 11

A pinch of ONNX, TorchScript and TensoRT

Software agnostic: Runs in C++ and Python.
Hardware agnostic: Runs on CPU and GPU.

ONNX

Creates serializable and optimizable models from PyTorch code.
Software agnostic: Runs in C++ and Python.
If you have a control flow then TorchScript is superior to ONNX, because ONNX requires model to be DAG.

TorchScript

6 of 11

TensorRT is highly optimized to run on NVIDIA GPUs.
It's likely the fastest way to run a model at the moment.
Only supported on GPUs.

TensorRT

7 of 11

– Someone famous

Hands On

8 of 11

More to explore

Model analyzer

Triton provides Prometheus metrics indicating GPU and request statistics

Model can’t fit on a single GPU? Use multi-GPU inference with pipeline parallelism

If you have multiple models in a single pipeline, you can leverage python runtime with compiled models

Performance analyzer

Understand the compute and memory requirements of your models

Measure changes in performance as you experiment with different optimization strategies

Dynamically batch the requests to increase the throughput

Dynamic batching

Metrics

Ensemble Models

Multi-Model Pipeline

9 of 11

Resources

Code used in the presentation: https://github.com/rohitgr7/triton_inference_cc
Page: https://developer.nvidia.com/nvidia-triton-inference-server
GitHub: https://github.com/triton-inference-server/server
Deep into Triton Inference Server: BERT Practical Deployment on NVIDIA GPU: https://youtu.be/cKf-KxJVlzE
LLM + Deployment on Triton: https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31578/
TensorRT: https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31876/
Triton with AWS Sagemaker: https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html

10 of 11

Reach out

Anyone working on Generative AI, Micro SaaS, ML model deployments, do reach out, if you want to share some ideas or brainstorm.
Anyone needs recommendations on how to contribute to open-source or want to know about projects they can contribute to in ML space.

11 of 11

Thank you

My name is Rohit Gupta

ML Engineer @ Mazaal AI
Ex PyTorch Lightning Core

Feel free to reach out/follow :)

rohitgr7

imgrohit