MIL Team Compression
Machine Intelligence Lab, MIPT,
GVA Ltc, Skolkovo Residency
2017-2022
Overview
MIL Team
RAM, Energy, CPU/GPU lower consumption
Less resources to find the best model
Overview
Compression Group Overview
✔ Methods for complex architectures such as Transformers;
✔ Measured optimisation results and fair measurements;
✔ Wide ability for new models compression methods;
✔ Contracts: 7 research projects with compression targets.
MIL Team
03
Compression Group helps clients with computational complexity reduction for DL models while their learning and inference steps that delivers costs reduction and technology dissemination
Our Modules:
Infrastructure for Fair Measurements and Device Transfer
Methods for Inference Stage Optimisation
Delivers the most optimal model architecture and structure
Methods for Learning Stage Optimisation
Product Applicability:
Device
Transfer
Models Learning Process
Models Inference Process
New Processors
Fair or low-bit DL models Quantization
Our Clients:
Fixed number of MAC operations
Examples of topics: deep tech (from biometry to software), health & entertainment, fintech & AR/VR, etc…
Examples of industries: car manufacturers, electronics and devices, internet companies, innovation, etc…
Problems of rapid growth in ML-product and lack of infrastructure and computational resources
Overview
For that companies who:
MIL Team
04
Technological Companies
High-tech
Start-ups
Internal team is focused on the Product, but not the Optimization
02
01
03
Start-up Develops ML-based Product:
Company's solutions are developed based on complex Deep Learning technologies: such as Transformers or img2img
02
01
03
Core Business is strongly linked to AI R&D Results:
Huge computational resources are wasted for training and inference; there is a trend of transferring models to devices
Clients
Optimization
Outsource
MIL Team
The process of searching the best architecture and solution is time consuming, unpredictable and non-deterministicю;
Useless overparameterization of Deep Learning models;
Overview
Compression: What for?
MIL Team
05
Model Optimization
Model Development Optimization
Researchers waste expensive time for developing models for uniform tasks, similar datasets, nearby domains and etc;
02
01
Motivation for learning improvement
Complexity of transferring-to-device process: original high quality models do not fit on the device
02
01
03
Motivation for implementing optimization
The full range of advantages of low-bit processors is not revealed on non-compressed models
Real-time apps require fast inference;
04
05
Clients
Optimization
Outsource
MIL Team
High infrastructure costs due to the consumption of RAM, CPU, SSD, GPU, etc.
Researchers often do not carry out a complete search, losing more effective solutions for the problem statement.
03
The accumulation of expertise in solving the problem of those who worked a lot with it
Overview
Why Outsource?
MIL Team
06
Clients Restrictions:
Alienated task with understandable and measurable result
02
01
03
In-house specialists focus on quality and solution principles, not optimization problem
Why Compression can be outsourced?
Long-term development process for such complex problem statements
The validation process of the result is unified and portable for any area
Clients
Optimization
Outsource
MIL Team
Ready-made optimization methods do not allow to keep within the constraints
Lack of Compression Experts inside
Difficult choice of the most efficient method in each problem statement
The optimality of the solution obtained by the internal team is controversial
Financial planning, research transparency and control of software deliveries
High Risks of Applicable Result
04
It is easy to calculate the economics of proposed solution
Solution
MIL Team
Uniqueness of the final solution
Solution
Our Vision and Proposition
MIL Team
08
Platform with pre-ready scripts
Custom Compression �Research
Any architecture can be compressed
02
01
03
Benefits of custom:
The fixed team of experts is working on the task
Most Popular architectures are compressed
02
01
03
Benefits of readiness:
Timing and costs are lower in comparison with custom
Understandable limitations and final quality
Fast Model �Assessment Methods
Neural Architecture Search
Quantization
Pruning
Knowledge �Distillation
Effective Training Methods
Compression team delivers the final result for Deep Learning model compression task that faces the quality and computational limits.
Our Technologies:
Quantization
Pruning
Distillation
NAS
Vision
Solution
Compression:
● Modern NAS get SotA results, some do not require fine-tuning or retraining;
● 40,000 gpu-hours -> 4-200 gpu-hours;
● Evaluate many architectures and choose the best according to restrictions;
● One training - many ready-made models;
● Model ranking is an important step in one-shot NAS.
MIL Team
09
Models optimization with Neural Architecture Search
Fast Model �Assessment Methods
Neural Architecture Search
Quantization
Pruning
One-shot and weight-�sharing
Solving NAS Methods:
03
01
02
04
Differentiable Architecture Search
Hyper-�Networks
BigNAS Methods
Outcomes:
Knowledge �Distillation
Effective Training Methods
NAS (Neural architecture search) - methods for the automatic design of architectures.
05
etc…
Quantization
Pruning
Distillation
NAS
Vision
Solution
Compression:
● Significant reduction in size & faster inference;
● Portability to mobile devices & low-bit processors;
● The latest methods allows not losing accuracy on certain tasks;
● Large networks with more aggressive quantization are often better than�small networks with soft quantization.
MIL Team
010
Models optimization with Quantization
Fast Model �Assessment Methods
Neural Architecture Search
Quantization
Pruning
Post-training Quantization
Solving Quantisation Methods:
03
01
02
04
Aware-training Quantization
Additive powers-of-two Quantization
Learned Step-size Quantisation
Outcomes:
Knowledge �Distillation
Effective Training Methods
Quantization is used as powerful method for specific hardware (low-bit processors)
05
etc…
Quantization
Pruning
Distillation
NAS
Vision
Solution
Compression:
● The solution of NN over parametrization problem;
● Pruning speed-ups NN if it changes architecture parameters;
● Should be considered as effective method of NN architecture hyperparameters tuning;
● The more variable the output space, the more variability the network should provide.
MIL Team
011
Models optimization with Pruning
Fast Model �Assessment Methods
Neural Architecture Search
Quantization
Pruning
Importance Based
Solving Pruning Methods:
03
01
02
04
Weight Clustering
Aware-
training Pruning
Structure Search
Outcomes:
Knowledge �Distillation
Effective Training Methods
The idea is: zero out certain weights (grouped as filters, layers, blocks or not)
05
etc…
Quantization
Pruning
Distillation
NAS
Vision
Solution
Compression:
● Increased accuracy of small networks;
● The possibility of retraining networks in the absence of initial training data;
● It is used as an auxiliary operation in many tasks;
● Accelerates student network convergence.
MIL Team
012
Models optimization with Knowledge Distillation
Fast Model �Assessment Methods
Neural Architecture Search
Quantization
Pruning
Soft-labels method
Solving Distillation Methods:
03
01
02
04
Convergence in Distillation
“Teacher” assistants
Ensemble of teachers and Self-distillation
Outcomes:
Knowledge �Distillation
Effective Training Methods
Knowledge distillation transfers “knowledge” from one network to another
05
etc…
Quantization
Pruning
Distillation
NAS
Vision
Reasoning
MIL Team
Uniqueness
Reasoning
Competitiveness and Uniqueness:
● Internal Frameworks: Quantization (from APoT to LSQ), Pruning (from HRank to Magnitude);
● Unification of methods usability, transition to end-to-end optimization and on-device.
MIL Team
014
Benefits
Stories
Features:
Experience and Results:
10+ Projects in Compression
Outperform Tensor Flow & PyTorch instruments
Our own frameworks and high reuse rate
First projects and Pruning expertize
Experienced experts and polished methodology
Team Results Storyline:
Now
2018
2020
Future
New topics for effective compression
Several frameworks for fast development
Compression Product for B2B clients
● Low-bit quantisation of ASR models based on
transformers for cost reduction with low-bit CPU usage;
● Quantization and Distillation of Real-time audio
denoising models for porting complex models on device
● Pruning of PyNet, HRNet architectures for speeding up
calculations.
● etc…
Examples:
● Minimal losses of model quality with significant compression in case of complex architectures: from Transformer-based to img2img;
● Fast choice of applicable approach for problem statement limitations.
Challenges Solved:
Decrease the number of operations up to 5x-time;
Predicted results of selecting optimal architectures with the timeline that 3x faster then the classic one;
Reasoning
Benefits from Compression
MIL Team
015
Model Optimization
Model Development Optimization
The process of automated models selection for uniform tasks reduces research costs by 60%;
02
01
Benefits while learning improvement
Transfer-ready models with faced limitations for quality and computational resources;
02
01
03
Benefits while model optimization
2-bit, 3-bit or 4-bit Quantization with expected drops in accuracy: 7%, 3%, 1% relatively on complex models;
Inference acceleration up to 10x-time;
04
05
Infrastructure costs reduction up to 70% while inference mode.
Predicted results of selecting optimal architectures at the 93%+ quality from the best one.
03
Uniqueness
Benefits
Stories
Reasoning
Stories
MIL Team
016
Project goals:
Select a strategy for quantizing the ASR model based on transformers, if necessary, also making changes to the network architecture so that the quality of the quantized model (WER metric) is not much worse than the quality of the model with full accuracy (a decrease in quality of several percent is allowed).
MIL Team's solution:
Take the available SOTA implementation of the ASR model architecture based on transformers and embed quantization into it. For successful and fair quantization, it is necessary, first, to implement quantized versions for all modules used inside the Torch model. Secondly, you need a convenient tool to replace the original Torch modules with quantized ones. Third, a series of experiments must be carried out to select the best quantization strategy.
To build the model were used:
● the ASR transformer architecture implemented in the Fairseq open repository, presented in the article Transformers with convolutional context for ASR.
● a LibriSpeech dataset for training a speech recognition model, consisting of pairs of audio and text files with English speech.
Story 1
Uniqueness
Benefits
Stories
Project goals:
Development and implementation of methods for fast and accurate comparison of neural network architectures. The implemented methods should significantly outperform the direct method of full single network training in speed, with a slight drop in the ranking quality. In particular, a tenfold acceleration of the architecture comparison should be obtained with a loss of ranking quality by no more than 10% (in terms of the ranking metric - Kendall Tau).
MIL Team Solution:
Implement, analyze and improve various methods for assessing the quality of architectures. One of the implemented methods is the creation of a super-network based on the space of the considered architectures. This approach allows training all models from space in one-shot mode and can significantly save time and computational resources. In addition, techniques such as evaluating the quality of the model using fewer training data, stopping training early, using classifiers and regression models are considered.
To build the model were used:
● Open datasets ImageNet and Cifar10;
● Dataset of architectures indicating the quality of their full training on ImageNet.
Reasoning
Stories
MIL Team
017
Story 2
Uniqueness
Benefits
Stories
MIL Team Cooperation
MIL Team
alex.goncharov@mil-team.com
Site
+7 (915) 116 22 72
Phone
mil-team.com