ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
Lecture 2:
2
You Only Look Once: Unified, Real-Time Object Detection
https://arxiv.org/abs/1506.02640
3
Fully convolutional networks for semantic segmentation
https://arxiv.org/abs/1411.4038
4
TSM: Temporal Shift Module for Efficient Video Understanding
https://arxiv.org/abs/1811.08383
5
6
Lecture 3:
7
Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
8
Language Models are Unsupervised Multitask Learners
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
9
LLaMA: Open and Efficient Foundation Language Models
https://arxiv.org/abs/2302.13971
10
11
Lecture 4:
12
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
https://arxiv.org/abs/1510.00149
13
Learning structured sparsity in deep neural networks
https://arxiv.org/abs/1608.03665
14
Rethinking the value of network pruning
https://arxiv.org/abs/1810.05270
15
16
Lecture 5:
17
Trained ternary quantization
https://openreview.net/pdf?id=S1_pAu9xl
18
Quantization and training of neural networks for efficient integer-arithmetic-only inference
https://arxiv.org/abs/1712.05877
19
Incremental network quantization: Towards lossless cnns with low-precision weights
https://openreview.net/pdf?id=HyQJ-mclg
20
21
Lecture 6:
22
Learning from Multiple Teacher Networks
https://arxiv.org/abs/2103.04062
23
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
https://arxiv.org/abs/2201.11903
24
MnasNet: Platform-Aware Neural Architecture Search for Mobile
https://arxiv.org/abs/1807.11626
25
26
Lecture 7:
27
Smoothquant: Accurate and efficient post-training quantization for large language models
https://arxiv.org/abs/2211.10438
28
A Simple and Effective Pruning Approach for Large Language Models
https://arxiv.org/abs/2306.11695
29
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
https://arxiv.org/pdf/2404.00456
30
31
Lecture 8:
32
Neural gradients are near-lognormal: improved quantized and sparse training
https://arxiv.org/abs/2006.08173
33
Lora: Low-rank adaptation of large language models
https://arxiv.org/abs/2106.09685
34
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
https://arxiv.org/abs/2410.19313
35
36
Lecture 9:
37
Federated optimization in heterogeneous networks
https://arxiv.org/abs/1812.06127
38
TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
https://arxiv.org/abs/1705.07878
39
odnn: Local distributed mobile computing system for deep neural network
https://ieeexplore.ieee.org/document/7927211
40
41
Lecture 10:
42
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
https://arxiv.org/abs/2404.18911
43
MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
https://openreview.net/pdf?id=PEpbUobfJv
44
Efficient Memory Management for Large Language Model Serving with PagedAttention
https://arxiv.org/abs/2309.06180
45
46
Lecture 11:
47
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
https://ieeexplore.ieee.org/abstract/document/8192478
48
Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning
https://dl.acm.org/doi/pdf/10.1145/2654822.2541967
49
Cnvlutin: Ineffectual-neuron-free deep neural network computing
https://ieeexplore.ieee.org/document/7551378
50
51
Lecture 12:
52
GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference
https://arxiv.org/abs/2005.03842
53
Tender: Accelerating large language models via tensor decomposition and runtime requantization
https://ieeexplore.ieee.org/document/10609625
54
FlashDecoding++: Faster Large Language Model Inference on GPUs
https://arxiv.org/abs/2311.01282
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100