1 of 102

1

1

劉晉良

Jinn-Liang Liu

清華大學動力機械工程學系

Department of Power Mechanical EngineeringNational Tsing Hua University, Taiwan

Dec 17, 2024 - Dec 8, 2025

Attention, Transformer, LLM

2 of 102

2024/12

203627, 2025/11

3 of 102

No Recurrent, No Convolution,

Global Dependency, Highly Parallel

4 of 102

Tokens, Embedding, Position Encoding

"dog bites man" vs "man bites dog"

N Tokens

D Features (Vectors)

5 of 102

Vector dimensions can range from a few hundred (e.g., 384 for all-MiniLM-L6-v2) to several thousands (e.g., 3072 for text-embedding-3-large)

Old

6 of 102

Transformer

New

7 of 102

8 of 102

Retrival, Augmented, Generation

Vector Indexing for DB

Hierarchical Navigable Small World (HNSW)

NN Layers

IVF (Inverted File)

Clustering

9 of 102

Tokens, Embedding, Position Encoding

10 of 102

Transformer: Model Architecture

11 of 102

Attention

12 of 102

Attention

13 of 102

Attention

14 of 102

Attention

15 of 102

Self-Attention

Symmetric

Asymmetric

(caulking iron vs tool)

16 of 102

Perceptron (Feed Forward Network)

Mark I Perceptron(Wikipedia)First AI Machine

"Devices of this sort are expected ultimately to be capable of concept formation, language translation, collation of military intelligence, and the solution of problems through inductive logic." Rosenblatt, 1957

17 of 102

Transformer Block

k

18 of 102

Multi-Head Self-Attention

Self-A

Q

K

V

Mask

Cross-A

Self-A

19 of 102

Multi-Head Self-Attention

20 of 102

Self-A

Self-A

Cross-A

Cross-A

Mask

21 of 102

2015 Luong et. al.

2017 Vaswani et. al.

Attention Matrix

22 of 102

Performance, Complexity

23 of 102

Understanding

24 of 102

The GPT2 model contains N Transformer decoder blocks. Each block includes a multi-head masked attention layer, a multi-layer perceptron layer, normalization, and dropout layers. The residual connection (branching line to the addition operator) allows the block to learn from the previous block's input. The multi-head masked attention layer (right panel) calculates attention scores using Q, K, and V vectors to capture sequential relationships in the input sequence. Transformers are typically pre-trained on enormous corpora in a self-supervised manner, prior to being fine-tuned.

No Encoder!

25 of 102

26 of 102

27 of 102

28 of 102

29 of 102

30 of 102

31 of 102

32 of 102

33 of 102

34 of 102

35 of 102

36 of 102

37 of 102

38 of 102

39 of 102

40 of 102

41 of 102

42 of 102

43 of 102

44 of 102

45 of 102

46 of 102

47 of 102

48 of 102

49 of 102

50 of 102

51 of 102

00:00 - Introduction to RAG

00:24 - Why Traditional Search Methods Don't Work

00:55 - The RAG Method Explained

01:54 - Step 1: Retrieval Process

02:25 - Step 2: Augmentation Explained

03:15 - Step 3: Generation Process

03:54 - Strategies for RAG Calibration

05:01 - Practical Lab Demo Introduction

05:27 - Demo - Set up Development Environment

06:10 - Demo - Initialize Vector Database

06:29 - Demo - Chunking Strategy and Embedding

07:19 - Demo - Feed AI Brain

07:50 - Demo - Semantic Search

08:16 - Demo - Launch a Simple Web Interface

09:43 - Conclusion & Free Lab Access

52 of 102

53 of 102

54 of 102

55 of 102

56 of 102

57 of 102

58 of 102

59 of 102

60 of 102

61 of 102

62 of 102

63 of 102

64 of 102

65 of 102

66 of 102

67 of 102

68 of 102

69 of 102

70 of 102

71 of 102

72 of 102

73 of 102

74 of 102

75 of 102

76 of 102

77 of 102

78 of 102

79 of 102

80 of 102

81 of 102

82 of 102

83 of 102

84 of 102

85 of 102

86 of 102

87 of 102

88 of 102

89 of 102

90 of 102

91 of 102

92 of 102

93 of 102

94 of 102

95 of 102

96 of 102

97 of 102

Help Desk

98 of 102

Processing

99 of 102

Expertise

Complexity

Computational

Maintenance

Catastrophic

Fogeting

100 of 102

101 of 102

102 of 102