1 of 33

Multi-Modal Large Language Models

Sreyan Ghosh

1

2 of 33

About Me

Multi-Modal Large Language Models

Current: 2nd Year C.S. Ph.D. Student

Advisor: Dr. Dinesh Manocha and Dr. Ramani Duraiswami

Interests: Resource Efficient Deep Learning (applied to SLP)

Past -

2

https://sreyan88.github.io/

3 of 33

What are Large Language Models?

Multi-Modal Large Language Models

3

Large Language Models are just large neural networks (neural networks that can act as language models) trained on web-scale data.

  1. World Knowledge.
  2. Reasoning.
  3. Emergent abilities.
  4. Multi-modal reasoning.

4 of 33

(L)LMs come in various types.

Multi-Modal Large Language Models

4

Decoder-only LMs

Encoder-Decoder LMs

Encoder-Only LMs

5 of 33

And Sizes.

Multi-Modal Large Language Models

5

6 of 33

Common Architectures

Multi-Modal Large Language Models

6

7 of 33

How do they work?

Multi-Modal Large Language Models

7

  1. Auto-regressive n-layer transformer decoder
  2. Each token only conditioned on preceding context
  3. BPE tokenization.
  4. Pre-trained on raw text as a language model (Maximize the probability of predicting the next word)
  5. Fine-tuned on labeled data (and language modeling). Fine-tuning may include alignment tuning, safety tuning, human-preference tuning, etc.

8 of 33

ChatGPT released in November 2022

Multi-Modal Large Language Models

8

9 of 33

And achieved a lot!

Multi-Modal Large Language Models

9

10 of 33

And it kept getting better.

Multi-Modal Large Language Models

10

lmsys.org

11 of 33

Evolution of (Large) Language Models

Multi-Modal Large Language Models

11

12 of 33

GPT-4V was released in March 2023

Multi-Modal Large Language Models

12

13 of 33

What are Multi-Modal LLMs?

Multi-Modal Large Language Models

13

Multimodal language models are AI systems designed to understand, interpret, and generate information across different forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual information.

14 of 33

Timeline of MM-LLMs

Multi-Modal Large Language Models

14

15 of 33

Who gets it better?

Multi-Modal Large Language Models

15

16 of 33

The community has come a long way.

Multi-Modal Large Language Models

16

17 of 33

Multi-Modal Large Language Models

17

18 of 33

A generic architecture of MM-LLM

Multi-Modal Large Language Models

18

You look at the previous context + the other modality

19 of 33

With time, OS models kept getting better!

Multi-Modal Large Language Models

19

20 of 33

And applied to other domains!

Multi-Modal Large Language Models

20

21 of 33

Multi-Modal Large Language Models

21

22 of 33

Different Use-Cases of MM-LLMs

Multi-Modal Large Language Models

22

23 of 33

MM-LLMs for more modalities (Audio)

Multi-Modal Large Language Models

23

24 of 33

MM-LLMs for more modalities (Video)

Multi-Modal Large Language Models

24

25 of 33

Other modalities are catching up!

Multi-Modal Large Language Models

25

Graph LLMs

Time-series LLMs

26 of 33

The Dark Side.

Multi-Modal Large Language Models

26

27 of 33

(MM)LLMs Hallucinate Often.

Multi-Modal Large Language Models

27

28 of 33

Needle in the Haystack Problem.

Multi-Modal Large Language Models

28

29 of 33

Evaluation of LLMs is hard!

Multi-Modal Large Language Models

29

https://arxiv.org/pdf/2306.05685

Prior to 2022, NLP research was focused on discriminative tasks

  1. LLM research widely employ GPT-4 as a judge.

a) Human evaluation is expensive.

b) Does GPT-4 know everything?

c) Non-determinism is always there.

d) What does GPT-4 use?

30 of 33

Holistic alignment is also hard!

Multi-Modal Large Language Models

30

Alignment to Human Preferences ~ safety alignment, factual alignment, engagement alignment …. and it goes on

31 of 33

Lack of good quality training data!

Multi-Modal Large Language Models

31

Getting harder for academia!

32 of 33

Conclusion

Multi-Modal Large Language Models

32

  1. (MM)-LLMs are great, but very difficult to get right.
  2. More data and compute always win.
  3. Lot of problems still to work on.
  4. LLMs still to disrupt other fields.

33 of 33