1 of 18

MIL Team Compression

Machine Intelligence Lab, MIPT,

GVA Ltc, Skolkovo Residency

2017-2022

2 of 18

Overview

MIL Team

3 of 18

RAM, Energy, CPU/GPU lower consumption

Less resources to find the best model

Overview

Compression Group Overview

✔ Methods for complex architectures such as Transformers;

Measured optimisation results and fair measurements;

✔ Wide ability for new models compression methods;

✔ Contracts: 7 research projects with compression targets.

MIL Team

03

Compression Group helps clients with computational complexity reduction for DL models while their learning and inference steps that delivers costs reduction and technology dissemination

Our Modules:

Infrastructure for Fair Measurements and Device Transfer

Methods for Inference Stage Optimisation

Delivers the most optimal model architecture and structure

Methods for Learning Stage Optimisation

Product Applicability:

Device

Transfer

Models Learning Process

Models Inference Process

New Processors

Fair or low-bit DL models Quantization

Our Clients:

Fixed number of MAC operations

4 of 18

Examples of topics: deep tech (from biometry to software), health & entertainment, fintech & AR/VR, etc…

Examples of industries: car manufacturers, electronics and devices, internet companies, innovation, etc…

Problems of rapid growth in ML-product and lack of infrastructure and computational resources

Overview

For that companies who:

MIL Team

04

Technological Companies

High-tech

Start-ups

Internal team is focused on the Product, but not the Optimization

02

01

03

Start-up Develops ML-based Product:

Company's solutions are developed based on complex Deep Learning technologies: such as Transformers or img2img

02

01

03

Core Business is strongly linked to AI R&D Results:

Huge computational resources are wasted for training and inference; there is a trend of transferring models to devices

Clients

Optimization

Outsource

MIL Team

5 of 18

The process of searching the best architecture and solution is time consuming, unpredictable and non-deterministicю;

Useless overparameterization of Deep Learning models;

Overview

Compression: What for?

MIL Team

05

Model Optimization

Model Development Optimization

Researchers waste expensive time for developing models for uniform tasks, similar datasets, nearby domains and etc;

02

01

Motivation for learning improvement

Complexity of transferring-to-device process: original high quality models do not fit on the device

02

01

03

Motivation for implementing optimization

The full range of advantages of low-bit processors is not revealed on non-compressed models

Real-time apps require fast inference;

04

05

Clients

Optimization

Outsource

MIL Team

High infrastructure costs due to the consumption of RAM, CPU, SSD, GPU, etc.

Researchers often do not carry out a complete search, losing more effective solutions for the problem statement.

03

6 of 18

The accumulation of expertise in solving the problem of those who worked a lot with it

Overview

Why Outsource?

MIL Team

06

Clients Restrictions:

Alienated task with understandable and measurable result

02

01

03

In-house specialists focus on quality and solution principles, not optimization problem

Why Compression can be outsourced?

Long-term development process for such complex problem statements

The validation process of the result is unified and portable for any area

Clients

Optimization

Outsource

MIL Team

Ready-made optimization methods do not allow to keep within the constraints

Lack of Compression Experts inside

Difficult choice of the most efficient method in each problem statement

The optimality of the solution obtained by the internal team is controversial

Financial planning, research transparency and control of software deliveries

High Risks of Applicable Result

04

It is easy to calculate the economics of proposed solution

7 of 18

Solution

MIL Team

8 of 18

Uniqueness of the final solution

Solution

Our Vision and Proposition

MIL Team

08

Platform with pre-ready scripts

Custom Compression �Research

Any architecture can be compressed

02

01

03

Benefits of custom:

The fixed team of experts is working on the task

Most Popular architectures are compressed

02

01

03

Benefits of readiness:

Timing and costs are lower in comparison with custom

Understandable limitations and final quality

Fast Model �Assessment Methods

Neural Architecture Search

Quantization

Pruning

Knowledge �Distillation

Effective Training Methods

Compression team delivers the final result for Deep Learning model compression task that faces the quality and computational limits.

Our Technologies:

Quantization

Pruning

Distillation

NAS

Vision

9 of 18

Solution

Compression:

Modern NAS get SotA results, some do not require fine-tuning or retraining;

● 40,000 gpu-hours -> 4-200 gpu-hours;

● Evaluate many architectures and choose the best according to restrictions;

● One training - many ready-made models;

● Model ranking is an important step in one-shot NAS.

MIL Team

09

Models optimization with Neural Architecture Search

Fast Model �Assessment Methods

Neural Architecture Search

Quantization

Pruning

One-shot and weight-�sharing

Solving NAS Methods:

03

01

02

04

Differentiable Architecture Search

Hyper-�Networks

BigNAS Methods

Outcomes:

Knowledge �Distillation

Effective Training Methods

NAS (Neural architecture search) - methods for the automatic design of architectures.

05

etc…

Quantization

Pruning

Distillation

NAS

Vision

10 of 18

Solution

Compression:

Significant reduction in size & faster inference;

● Portability to mobile devices & low-bit processors;

● The latest methods allows not losing accuracy on certain tasks;

● Large networks with more aggressive quantization are often better than�small networks with soft quantization.

MIL Team

010

Models optimization with Quantization

Fast Model �Assessment Methods

Neural Architecture Search

Quantization

Pruning

Post-training Quantization

Solving Quantisation Methods:

03

01

02

04

Aware-training Quantization

Additive powers-of-two Quantization

Learned Step-size Quantisation

Outcomes:

Knowledge �Distillation

Effective Training Methods

Quantization is used as powerful method for specific hardware (low-bit processors)

05

etc…

Quantization

Pruning

Distillation

NAS

Vision

11 of 18

Solution

Compression:

The solution of NN over parametrization problem;

● Pruning speed-ups NN if it changes architecture parameters;

● Should be considered as effective method of NN architecture hyperparameters tuning;

● The more variable the output space, the more variability the network should provide.

MIL Team

011

Models optimization with Pruning

Fast Model �Assessment Methods

Neural Architecture Search

Quantization

Pruning

Importance Based

Solving Pruning Methods:

03

01

02

04

Weight Clustering

Aware-

training Pruning

Structure Search

Outcomes:

Knowledge �Distillation

Effective Training Methods

The idea is: zero out certain weights (grouped as filters, layers, blocks or not)

05

etc…

Quantization

Pruning

Distillation

NAS

Vision

12 of 18

Solution

Compression:

Increased accuracy of small networks;

● The possibility of retraining networks in the absence of initial training data;

● It is used as an auxiliary operation in many tasks;

● Accelerates student network convergence.

MIL Team

012

Models optimization with Knowledge Distillation

Fast Model �Assessment Methods

Neural Architecture Search

Quantization

Pruning

Soft-labels method

Solving Distillation Methods:

03

01

02

04

Convergence in Distillation

“Teacher” assistants

Ensemble of teachers and Self-distillation

Outcomes:

Knowledge �Distillation

Effective Training Methods

Knowledge distillation transfers “knowledge” from one network to another

05

etc…

Quantization

Pruning

Distillation

NAS

Vision

13 of 18

Reasoning

MIL Team

14 of 18

Uniqueness

Reasoning

Competitiveness and Uniqueness:

● Internal Frameworks: Quantization (from APoT to LSQ), Pruning (from HRank to Magnitude);

● Unification of methods usability, transition to end-to-end optimization and on-device.

MIL Team

014

Benefits

Stories

Features:

Experience and Results:

10+ Projects in Compression

Outperform Tensor Flow & PyTorch instruments

Our own frameworks and high reuse rate

First projects and Pruning expertize

Experienced experts and polished methodology

Team Results Storyline:

Now

2018

2020

Future

New topics for effective compression

Several frameworks for fast development

Compression Product for B2B clients

● Low-bit quantisation of ASR models based on

transformers for cost reduction with low-bit CPU usage;

● Quantization and Distillation of Real-time audio

denoising models for porting complex models on device

● Pruning of PyNet, HRNet architectures for speeding up

calculations.

● etc…

Examples:

● Minimal losses of model quality with significant compression in case of complex architectures: from Transformer-based to img2img;

● Fast choice of applicable approach for problem statement limitations.

Challenges Solved:

15 of 18

Decrease the number of operations up to 5x-time;

Predicted results of selecting optimal architectures with the timeline that 3x faster then the classic one;

Reasoning

Benefits from Compression

MIL Team

015

Model Optimization

Model Development Optimization

The process of automated models selection for uniform tasks reduces research costs by 60%;

02

01

Benefits while learning improvement

Transfer-ready models with faced limitations for quality and computational resources;

02

01

03

Benefits while model optimization

2-bit, 3-bit or 4-bit Quantization with expected drops in accuracy: 7%, 3%, 1% relatively on complex models;

Inference acceleration up to 10x-time;

04

05

Infrastructure costs reduction up to 70% while inference mode.

Predicted results of selecting optimal architectures at the 93%+ quality from the best one.

03

Uniqueness

Benefits

Stories

16 of 18

Reasoning

Stories

MIL Team

016

Project goals:

Select a strategy for quantizing the ASR model based on transformers, if necessary, also making changes to the network architecture so that the quality of the quantized model (WER metric) is not much worse than the quality of the model with full accuracy (a decrease in quality of several percent is allowed).

MIL Team's solution:

Take the available SOTA implementation of the ASR model architecture based on transformers and embed quantization into it. For successful and fair quantization, it is necessary, first, to implement quantized versions for all modules used inside the Torch model. Secondly, you need a convenient tool to replace the original Torch modules with quantized ones. Third, a series of experiments must be carried out to select the best quantization strategy.

To build the model were used:

● the ASR transformer architecture implemented in the Fairseq open repository, presented in the article Transformers with convolutional context for ASR.

● a LibriSpeech dataset for training a speech recognition model, consisting of pairs of audio and text files with English speech.

Story 1

Uniqueness

Benefits

Stories

17 of 18

Project goals:

Development and implementation of methods for fast and accurate comparison of neural network architectures. The implemented methods should significantly outperform the direct method of full single network training in speed, with a slight drop in the ranking quality. In particular, a tenfold acceleration of the architecture comparison should be obtained with a loss of ranking quality by no more than 10% (in terms of the ranking metric - Kendall Tau).

MIL Team Solution:

Implement, analyze and improve various methods for assessing the quality of architectures. One of the implemented methods is the creation of a super-network based on the space of the considered architectures. This approach allows training all models from space in one-shot mode and can significantly save time and computational resources. In addition, techniques such as evaluating the quality of the model using fewer training data, stopping training early, using classifiers and regression models are considered.

To build the model were used:

● Open datasets ImageNet and Cifar10;

● Dataset of architectures indicating the quality of their full training on ImageNet.

Reasoning

Stories

MIL Team

017

Story 2

Uniqueness

Benefits

Stories

18 of 18

MIL Team Cooperation

MIL Team

alex.goncharov@mil-team.com

E-mail

Site

+7 (915) 116 22 72

Phone

mil-team.com