1 of 27

Vision Language Models & PaliGemma

Nitin Tiwari

Google Developer Expert – Machine Learning�Associate Data Scientist @ Colgate-Palmolive

2 of 27

$whoami

Associate Data Scientist at Colgate-Palmolive India Ltd.
Google Developer Expert in Machine Learning
Contributor to Gemma Cookbook, TensorFlow.js, Google Dev Library
Technical Speaker (50+ talks across India and internationally)
Write blogs on AI/ML on Medium

3 of 27

Content

Introduction to Vision Language Models
Architecture of a typical VLM
What is PaliGemma?
Examples of PaliGemma
PaliGemma 2: What’s new?
Resources for you to get started

4 of 27

Vision Language Models

5 of 27

Vision Language Models

Vision Language Models are multimodal models that learn from images and text, generating text outputs from image and text inputs.

They excel in zero-shot capabilities, generalization, and various tasks like image recognition, question answering, and document understanding.

These models can also capture spatial properties and output bounding boxes or segmentation masks for specific subjects.

6 of 27

Vision Language Models

7 of 27

Architecture of a typical VLM

Image Encoder

Multimodal Projector

Text Decoder

Text Prompt

Text Output

Frozen (not trainable)

Not frozen

Text Decoder

Multimodal Projector

Projected visual tokens + text embedded tokens

Text Prompt

{CLIP, SigLIP, etc.}

8 of 27

Example: Contrastive Pre-training

9 of 27

PaliGemma

10 of 27

PaliGemma

PaliGemma is an open Vision Language Model from Google with just 3B parameters designed to perform a variety of tasks such as image captioning, object detection, image segmentation, visual question & answering, optical character recognition, and more.

Inspired by PaLI-3, PaliGemma uses SigLIP (Sigmoid Loss for Language Image Pre-Training) as the image encoder and Gemma 2B as the underlying language model.

SigLIP�(image encoder)

Gemma 2B�(text decoder)

I’ll handle the text part

I’ll handle the vision capabilities

+

=

11 of 27

PaliGemma

PaliGemma is available in different variants:

Pre-trained model: The general purpose base model trained for a variety of tasks.
Fine-tuned model: Customized/fine-tuned for doing specific tasks.
Mixed model: A mixture of PT and FT PaliGemma models.

PaliGemma comes in 3 resolutions – 224x224, 448x448 and 896x896.

But… when do we need a fine-tuned model?�🤔

12 of 27

When do we need to fine-tune?

Prompt: Detect car Output

Pre-trained models are suited best for general purpose tasks. As long as the pre-trained model works well, there’s no need to fine-tune it.

13 of 27

When do we need to fine-tune?

Prompt: Detect red car Output

Pre-trained model has failed to detect red cars specifically. So, we need to fine-tune it for this specific task.

14 of 27

Examples of PaliGemma��* All the images/videos used are real, and tested without editing. *

15 of 27

Visual Question & Answering

Given an image, PaliGemma can answer questions based on it.

Prompt: What is present in this image?

Response: Google Cloud Platform

16 of 27

Zero-shot Object Detection

PaliGemma is capable of zero-shot object detection, meaning, without training the model on specific datasets, it can detect objects with just a text prompt.

17 of 27

Instance Segmentation

PaliGemma can also segment objects in images and videos merely using a text prompt.

18 of 27

How does it work?

Text Prompt: Segment person, mug, book

<loc100><loc200><loc300><loc400><seg058><seg071> … person

<loc50><loc80><loc120><loc400><seg064><seg082> … book

Parse the output from PaliGemma model and draw segmentation mask on the input image.

19 of 27

PaliGemma using �Hugging Face Transformers

20 of 27

Import libraries

Load pre-trained PaliGemma model

�Give input image and text prompt

Pass the input to PaliGemma model

from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "google/paligemma-3b-mix-448"

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16) processor = PaliGemmaProcessor.from_pretrained(model_id)

input_image = Image.open(‘image.jpg’)

prompt = ‘’detect person’’

inputs = processor(text=input_text, images=input_image, padding="longest", do_convert_rgb=True, return_tensors="pt").to("cuda")

model.to(device)

inputs = inputs.to(dtype=model.dtype)

21 of 27

Get output

Parse the model response

with torch.no_grad():

output = model.generate(**inputs, max_length=496)

paligemma_response = processor.decode(output[0], skip_special_tokens=True)[len(input_text):].lstrip("\n") print(paligemma_response)

# Parse the PaliGemma model response as per need.

def parse_response(paligemma_response):

…..

Visual Question & Answering

Object Detection

Instance Segmentation

Image Captioning

22 of 27

What’s new in PaliGemma 2?

23 of 27

PaliGemma 2

Uses Gemma 2 as the language model, vision encoder (SigLIP) remains the same.

Available in 3B, 10B, 28B parameters size and (224 x 224), (448 x 448), (896 x 896) resolutions.

PaliGemma 2 generates detailed, contextually relevant captions for images.

Experimented on new set of transfer tasks:

Transfer Task	Dataset used
Table Structure Recognition	PubTabNet, FinTabNet
Chemical Molecular Structure Recognition	PubChem
Radiography Detection	GrandStaff
Music Score Recognition	MIMIC-CXR