1 of 33

Multi-Modal Large Language Models

Sreyan Ghosh

2 of 33

About Me

Multi-Modal Large Language Models

Current: 2^nd Year C.S. Ph.D. Student

Advisor: Dr. Dinesh Manocha and Dr. Ramani Duraiswami

Interests: Resource Efficient Deep Learning (applied to SLP)

Past -

https://sreyan88.github.io/

3 of 33

What are Large Language Models?

Multi-Modal Large Language Models

Large Language Models are just large neural networks (neural networks that can act as language models) trained on web-scale data.

World Knowledge.
Reasoning.
Emergent abilities.
Multi-modal reasoning.

4 of 33

(L)LMs come in various types.

Multi-Modal Large Language Models

Decoder-only LMs

Encoder-Decoder LMs

Encoder-Only LMs

5 of 33

And Sizes.

Multi-Modal Large Language Models

6 of 33

Common Architectures

Multi-Modal Large Language Models

7 of 33

How do they work?

Multi-Modal Large Language Models

Auto-regressive n-layer transformer decoder
Each token only conditioned on preceding context
BPE tokenization.
Pre-trained on raw text as a language model (Maximize the probability of predicting the next word)
Fine-tuned on labeled data (and language modeling). Fine-tuning may include alignment tuning, safety tuning, human-preference tuning, etc.

8 of 33

ChatGPT released in November 2022

Multi-Modal Large Language Models

9 of 33

And achieved a lot!

Multi-Modal Large Language Models

10 of 33

And it kept getting better.

Multi-Modal Large Language Models

lmsys.org

11 of 33

Evolution of (Large) Language Models

Multi-Modal Large Language Models

12 of 33

GPT-4V was released in March 2023

Multi-Modal Large Language Models

13 of 33

What are Multi-Modal LLMs?

Multi-Modal Large Language Models

Multimodal language models are AI systems designed to understand, interpret, and generate information across different forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual information.

14 of 33

Timeline of MM-LLMs

Multi-Modal Large Language Models

15 of 33

Who gets it better?

Multi-Modal Large Language Models

16 of 33

The community has come a long way.

Multi-Modal Large Language Models

17 of 33

Multi-Modal Large Language Models

18 of 33

A generic architecture of MM-LLM

Multi-Modal Large Language Models

You look at the previous context + the other modality

19 of 33

With time, OS models kept getting better!

Multi-Modal Large Language Models

20 of 33

And applied to other domains!

Multi-Modal Large Language Models

21 of 33

Multi-Modal Large Language Models

22 of 33

Different Use-Cases of MM-LLMs

Multi-Modal Large Language Models

23 of 33

MM-LLMs for more modalities (Audio)

Multi-Modal Large Language Models

24 of 33

MM-LLMs for more modalities (Video)

Multi-Modal Large Language Models

25 of 33

Other modalities are catching up!

Multi-Modal Large Language Models

Graph LLMs

Time-series LLMs

26 of 33

The Dark Side.

Multi-Modal Large Language Models

27 of 33

(MM)LLMs Hallucinate Often.

Multi-Modal Large Language Models

28 of 33

Needle in the Haystack Problem.

Multi-Modal Large Language Models

29 of 33

Evaluation of LLMs is hard!

Multi-Modal Large Language Models

https://arxiv.org/pdf/2306.05685

Prior to 2022, NLP research was focused on discriminative tasks

LLM research widely employ GPT-4 as a judge.

a) Human evaluation is expensive.

b) Does GPT-4 know everything?

c) Non-determinism is always there.

d) What does GPT-4 use?

30 of 33

Holistic alignment is also hard!

Multi-Modal Large Language Models

Alignment to Human Preferences ~ safety alignment, factual alignment, engagement alignment …. and it goes on

31 of 33

Lack of good quality training data!

Multi-Modal Large Language Models

Getting harder for academia!

32 of 33

Conclusion

Multi-Modal Large Language Models

(MM)-LLMs are great, but very difficult to get right.
More data and compute always win.
Lot of problems still to work on.
LLMs still to disrupt other fields.

1 of 33

2 of 33

3 of 33

4 of 33

5 of 33

6 of 33

7 of 33

8 of 33

9 of 33

10 of 33

11 of 33

12 of 33

13 of 33

14 of 33

15 of 33

16 of 33

17 of 33

18 of 33

19 of 33

20 of 33

21 of 33

22 of 33

23 of 33

24 of 33

25 of 33

26 of 33

27 of 33

28 of 33

29 of 33

30 of 33

31 of 33

32 of 33

33 of 33