1 of 43

Rekurentní sítě, sekvence, CTC, pozornost.

KNN - Convolutional Neural Networks

Michal Hradiš - Brno University of Technology

2 of 43

Sequence processing

TEXT

Sound

Image

Video

Documents

3 of 43

Classification

  • sentiment
  • language

4 of 43

Tasks - element classification (segmentation)

5 of 43

Sequence generation

  • Chatbots
  • Document question answering
  • Image description
  • Speech to text
  • Text to speech
  • Translation
  • What is the probability that word population will reach 12 billion before 2100?

The probability is 6 %.

6 of 43

How to work with sequences

  • Convolutions
  • Recurrent networks
  • Attention
  • Graph networks

7 of 43

Sequence procesing

8 of 43

Recurrent layers - start with single fully connected layer

Input feature vector

Output activation vector

f(x) = sigmoid(Wx + b)

9 of 43

Sequence processing - communictaion

10 of 43

Sequence processing - communictaion

Recurrent

Convolution

Attention

Graph networks

11 of 43

Recurrent layers

12 of 43

Recurrent layers

Input vector sequence

13 of 43

Recurrent layers

Input vector sequence

Input state vector

Output state vector

14 of 43

Recurrent layers

Input vector sequence

Start state vector

Final

state

vector

Output vector sequence

15 of 43

Vanila RNN

Christopher Olah: Understanding LSTM Networks. https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Fully connected layer:

RNN:

RNN:

16 of 43

Recurrent layers - training

t1

t2

t3

t5

t4

t6

t8

t7

tend

loss

loss

loss

loss

loss

loss

loss

loss

loss

Final objective function is a sum of all loss functions

Valid comp. graph - oriented acyclic

Gradient backpropagation is “standard”

Optimization alg. is “standard”

17 of 43

Recurrent layers - long sequences and gradients

t8

tend

loss

loss

18 of 43

GRU - Gated Recurrent Unit

How to get better long distance gradients?

Can use similar “bypass” principle as in residual networks.

Christopher Olah: Understanding LSTM Networks. https://colah.github.io/posts/2015-08-Understanding-LSTMs/

19 of 43

LSTM - Long Short Term Memory

Christopher Olah: Understanding LSTM Networks. https://colah.github.io/posts/2015-08-Understanding-LSTMs/

20 of 43

Bidirectional recurrent layer

Deep Dive into Bidirectional LSTM

https://www.i2tutorials.com/technology/deep-dive-into-bidirectional-lstm/

21 of 43

1D convolution (temporal convolution)

22 of 43

Attention?

John

went

home

through

snow

very

deep

.

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

Subject

Verb

Object

...

...

...

object

...

...

23 of 43

Attention?

John

went

home

through

snow

very

deep

.

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

Subject

Verb

Object

...

...

...

object

...

...

0.1

0

0.2

0

0

0.3

0.4

0

Multiply with a weight

sum

24 of 43

Attention

25 of 43

Attention Is All You Need (Vaswani et al, NIPS 2017)

Tool: Tensor2tensor

26 of 43

Transformer - positional encoding

27 of 43

ALiBi

Press et al.: TRAIN SHORT, TEST LONG: ATTENTION WITH LINEAR

BIASES ENABLES INPUT LENGTH EXTRAPOLATION, 2022.

28 of 43

Rotary position encoding

Jianlin Su et al.: RoFormer: Enhanced Transformer with Rotary Position Embedding

29 of 43

Mix convolution and recurrent layers

1D conv.

1D pooling

Fully connected

Recurrent layer

30 of 43

Mix convolution and recurrent layers

1D point conv.

Attention

Fully connected

31 of 43

CONV 24

CONV 24

CONV 48

CONV 48

CONV 96

CONV 96

POOL 2x

POOL 2x

POOL 2x

LSTM

OCR - text line transcription

CONV 256

CONV 1D

P(“n”|image, position) = 0.97

CONV 1D

SOFTMAX

Char probs.

P(“c”|image, position) = 0.94

32 of 43

OCR - text line transcription with CTC

A

B

C

D

E

F

G

H

#

#######fe#aa###dd#...

33 of 43

CTC - Conectionist temporal classification

Loss function

Labels: sequence of class id

Net output: sequence of class probability vectors

Idea:

  • Try all possible alignments,
  • compute “cross-entropy loss” for each alignment,
  • average all weighted by alignment probability,
  • use dynamic programming to make it fast

34 of 43

Sequence Regression

The

product

is

the

know

best

I

.

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

Score 4.2/5

35 of 43

Sequence Classification

John

went

home

through

snow

very

deep

.

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

English language

36 of 43

Sequence Classification (better)

John

went

home

through

snow

very

deep

.

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

SUM

English language

37 of 43

Classification

Czech language

1D conv.

1D pooling

Fully connected

Recurrent layer

38 of 43

Word tagging / text transcription / sequence segmentation

John

went

home

through

snow

very

deep

.

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

Subject

Verb

Object

...

...

...

...

...

39 of 43

Reading Comprehension / conditioned sequence segmentation

Question embed.

Question encoder

He

killed

him

in

1985

July

of

.

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

out

out

out

out

start

in

end

out

When

LUT

?

LUT

40 of 43

Learn long distance dependencies?

John

went

home

through

snow

very

deep

.

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

Subject

Verb

Object

...

...

...

object

...

...

41 of 43

BERT - Bidirectional Encoder Representations from Transformers

Trained to fill in masked words in a sentence.

42 of 43

BERT

Trained to estimate text continuity.

Are two sentences sequential in the text corpus?

Transformer encoder

Sequential / random

43 of 43

Multi-modal

Huang et al.: LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking