1 of 27

Vision Language Models & PaliGemma

Nitin Tiwari

Google Developer Expert – Machine Learning�Associate Data Scientist @ Colgate-Palmolive

2 of 27

$whoami

  • Associate Data Scientist at Colgate-Palmolive India Ltd.
  • Google Developer Expert in Machine Learning
  • Contributor to Gemma Cookbook, TensorFlow.js, Google Dev Library
  • Technical Speaker (50+ talks across India and internationally)
  • Write blogs on AI/ML on Medium

3 of 27

Content

  • Introduction to Vision Language Models
  • Architecture of a typical VLM
  • What is PaliGemma?
  • Examples of PaliGemma
  • PaliGemma 2: What’s new?
  • Resources for you to get started

4 of 27

Vision Language Models

5 of 27

Vision Language Models

  • Vision Language Models are multimodal models that learn from images and text, generating text outputs from image and text inputs.

  • They excel in zero-shot capabilities, generalization, and various tasks like image recognition, question answering, and document understanding.

  • These models can also capture spatial properties and output bounding boxes or segmentation masks for specific subjects.

6 of 27

Vision Language Models

7 of 27

Architecture of a typical VLM

Image Encoder

Multimodal Projector

Text Decoder

Text Prompt

Text Output

Frozen (not trainable)

Not frozen

Text Decoder

Multimodal Projector

Projected visual tokens + text embedded tokens

Text Prompt

{CLIP, SigLIP, etc.}

8 of 27

Example: Contrastive Pre-training

9 of 27

PaliGemma

10 of 27

PaliGemma

  • PaliGemma is an open Vision Language Model from Google with just 3B parameters designed to perform a variety of tasks such as image captioning, object detection, image segmentation, visual question & answering, optical character recognition, and more.

  • Inspired by PaLI-3, PaliGemma uses SigLIP (Sigmoid Loss for Language Image Pre-Training) as the image encoder and Gemma 2B as the underlying language model.

SigLIP�(image encoder)

Gemma 2B�(text decoder)

I’ll handle the text part

I’ll handle the vision capabilities

+

=

11 of 27

PaliGemma

PaliGemma is available in different variants:

  • Pre-trained model: The general purpose base model trained for a variety of tasks.
  • Fine-tuned model: Customized/fine-tuned for doing specific tasks.
  • Mixed model: A mixture of PT and FT PaliGemma models.

PaliGemma comes in 3 resolutions – 224x224, 448x448 and 896x896.

But… when do we need a fine-tuned model?🤔

12 of 27

When do we need to fine-tune?

Prompt: Detect car Output

Pre-trained models are suited best for general purpose tasks. As long as the pre-trained model works well, there’s no need to fine-tune it.

13 of 27

When do we need to fine-tune?

Prompt: Detect red car Output

Pre-trained model has failed to detect red cars specifically. So, we need to fine-tune it for this specific task.

14 of 27

Examples of PaliGemma��* All the images/videos used are real, and tested without editing. *

15 of 27

Visual Question & Answering

Given an image, PaliGemma can answer questions based on it.

Prompt: What is present in this image?

Response: Google Cloud Platform

16 of 27

Zero-shot Object Detection

PaliGemma is capable of zero-shot object detection, meaning, without training the model on specific datasets, it can detect objects with just a text prompt.

17 of 27

Instance Segmentation

PaliGemma can also segment objects in images and videos merely using a text prompt.

18 of 27

How does it work?

Text Prompt: Segment person, mug, book

<loc100><loc200><loc300><loc400><seg058><seg071> … person

<loc50><loc80><loc120><loc400><seg064><seg082> … book

Parse the output from PaliGemma model and draw segmentation mask on the input image.

19 of 27

PaliGemma using �Hugging Face Transformers

20 of 27

Import libraries

Load pre-trained PaliGemma model

�Give input image and text prompt

Pass the input to PaliGemma model

from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "google/paligemma-3b-mix-448"

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16) processor = PaliGemmaProcessor.from_pretrained(model_id)

input_image = Image.open(‘image.jpg’)

prompt = ‘’detect person’’ 

inputs = processor(text=input_text, images=input_image, padding="longest", do_convert_rgb=True, return_tensors="pt").to("cuda")

model.to(device)

inputs = inputs.to(dtype=model.dtype)

21 of 27

Get output

Parse the model response

with torch.no_grad():

output = model.generate(**inputs, max_length=496)

paligemma_response = processor.decode(output[0], skip_special_tokens=True)[len(input_text):].lstrip("\n") print(paligemma_response)

# Parse the PaliGemma model response as per need.

def parse_response(paligemma_response):

…..

…..

…..

Visual Question & Answering

Object Detection

Instance Segmentation

Image Captioning

22 of 27

What’s new in PaliGemma 2?

23 of 27

PaliGemma 2

  • Uses Gemma 2 as the language model, vision encoder (SigLIP) remains the same.

  • Available in 3B, 10B, 28B parameters size and (224 x 224), (448 x 448), (896 x 896) resolutions.

  • PaliGemma 2 generates detailed, contextually relevant captions for images.

  • Experimented on new set of transfer tasks:

Transfer Task

Dataset used

Table Structure Recognition

PubTabNet, FinTabNet

Chemical Molecular Structure Recognition

PubChem

Radiography Detection

GrandStaff

Music Score Recognition

MIMIC-CXR

24 of 27

PaliGemma 2

Table Structure Recognition

Radiography Detection

Music Score Recognition

Chemical Molecular Structure Recognition

25 of 27

Resources

26 of 27

Resources

  • Vision Language Models: https://huggingface.co/blog/vlms

  • Gemma official documentation: https://ai.google.dev/gemma/

  • Get started with inferencing PaliGemma with examples: https://github.com/NSTiwari/PaliGemma

27 of 27

Thank You.

github.com/NSTiwari

medium.com/@tiwarinitin1999

twitter.com/@NSTiwari21

linkedin.com/in/tiwari-nitin