Reading List: CSCI 3303

	A	B
1	Lecture 2:
2	You Only Look Once: Unified, Real-Time Object Detection	https://arxiv.org/abs/1506.02640
3	Fully convolutional networks for semantic segmentation	https://arxiv.org/abs/1411.4038
4	TSM: Temporal Shift Module for Efficient Video Understanding	https://arxiv.org/abs/1811.08383
5
6	Lecture 3:
7	Longformer: The Long-Document Transformer	https://arxiv.org/abs/2004.05150
8	Language Models are Unsupervised Multitask Learners	https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
9	LLaMA: Open and Efficient Foundation Language Models	https://arxiv.org/abs/2302.13971
10
11	Lecture 4:
12	Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding	https://arxiv.org/abs/1510.00149
13	Learning structured sparsity in deep neural networks	https://arxiv.org/abs/1608.03665
14	Rethinking the value of network pruning	https://arxiv.org/abs/1810.05270
15
16	Lecture 5:
17	Trained ternary quantization	https://openreview.net/pdf?id=S1_pAu9xl
18	Quantization and training of neural networks for efficient integer-arithmetic-only inference	https://arxiv.org/abs/1712.05877
19	Incremental network quantization: Towards lossless cnns with low-precision weights	https://openreview.net/pdf?id=HyQJ-mclg
20
21	Lecture 6:
22	Learning from Multiple Teacher Networks	https://arxiv.org/abs/2103.04062
23	Chain-of-Thought Prompting Elicits Reasoning in Large Language Models	https://arxiv.org/abs/2201.11903
24	MnasNet: Platform-Aware Neural Architecture Search for Mobile	https://arxiv.org/abs/1807.11626
25
26	Lecture 7:
27	Smoothquant: Accurate and efficient post-training quantization for large language models	https://arxiv.org/abs/2211.10438
28	A Simple and Effective Pruning Approach for Large Language Models	https://arxiv.org/abs/2306.11695
29	QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs	https://arxiv.org/pdf/2404.00456
30
31	Lecture 8:
32	Neural gradients are near-lognormal: improved quantized and sparse training	https://arxiv.org/abs/2006.08173
33	Lora: Low-rank adaptation of large language models	https://arxiv.org/abs/2106.09685
34	COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training	https://arxiv.org/abs/2410.19313
35
36	Lecture 9:
37	Federated optimization in heterogeneous networks	https://arxiv.org/abs/1812.06127
38	TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning	https://arxiv.org/abs/1705.07878
39	odnn: Local distributed mobile computing system for deep neural network	https://ieeexplore.ieee.org/document/7927211
40
41	Lecture 10:
42	Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting	https://arxiv.org/abs/2404.18911
43	MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads	https://openreview.net/pdf?id=PEpbUobfJv
44	Efficient Memory Management for Large Language Model Serving with PagedAttention	https://arxiv.org/abs/2309.06180
45
46	Lecture 11:
47	SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks	https://ieeexplore.ieee.org/abstract/document/8192478
48	Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning	https://dl.acm.org/doi/pdf/10.1145/2654822.2541967
49	Cnvlutin: Ineffectual-neuron-free deep neural network computing	https://ieeexplore.ieee.org/document/7551378
50
51	Lecture 12:
52	GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference	https://arxiv.org/abs/2005.03842
53	Tender: Accelerating large language models via tensor decomposition and runtime requantization	https://ieeexplore.ieee.org/document/10609625
54	FlashDecoding++: Faster Large Language Model Inference on GPUs	https://arxiv.org/abs/2311.01282
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100