ML Model Deployment
Trion Inference Server
With a pinch of ONNX, TorchScript and TensorRT
03
01
04
About this Talk
02
Intro and Theory
We will start with the intro and explore what it solves
Quick glimpse of additional exciting features
Quick references to learn and explore more
A jupyter notebook walkthrough exploring the deployment
Hands-On
More to explore
Theory lesson
Resources
What we need?
Triton can handle all of the above. In short, it solves the server-side of model deployments in production and provides you with an client API to ping that server for inference. To setup the server, all you need to do is setup the configuration files as per your requirement.
Architecture Overview
Just build your client
Supports both HTTP/GRPC
Built in scheduler and model versioning
Supports all backend, ONNX, TorchScript, TensorRT, PyTorch. Tensorflow too.
Hardware agnostic.
A pinch of ONNX, TorchScript and TensoRT
ONNX
TorchScript
TensorRT
– Someone famous
Hands On
More to explore
Model analyzer
Triton provides Prometheus metrics indicating GPU and request statistics
Model can’t fit on a single GPU? Use multi-GPU inference with pipeline parallelism
If you have multiple models in a single pipeline, you can leverage python runtime with compiled models
Performance analyzer
Understand the compute and memory requirements of your models
Measure changes in performance as you experiment with different optimization strategies
Dynamically batch the requests to increase the throughput
Dynamic batching
Metrics
Ensemble Models
Multi-Model Pipeline
Resources
Reach out