Vision Language Models & PaliGemma
Nitin Tiwari
Google Developer Expert – Machine Learning�Associate Data Scientist @ Colgate-Palmolive
$whoami
Content
Vision Language Models
Vision Language Models
Vision Language Models
Architecture of a typical VLM
Image Encoder
Multimodal Projector
Text Decoder
Text Prompt
Text Output
Frozen (not trainable)
Not frozen
Text Decoder
Multimodal Projector
Projected visual tokens + text embedded tokens
Text Prompt
{CLIP, SigLIP, etc.}
Example: Contrastive Pre-training
PaliGemma
PaliGemma
SigLIP�(image encoder)
Gemma 2B�(text decoder)
I’ll handle the text part
I’ll handle the vision capabilities
+
=
PaliGemma
PaliGemma is available in different variants:
PaliGemma comes in 3 resolutions – 224x224, 448x448 and 896x896.
But… when do we need a fine-tuned model?�🤔
When do we need to fine-tune?
Prompt: Detect car Output
Pre-trained models are suited best for general purpose tasks. As long as the pre-trained model works well, there’s no need to fine-tune it.
When do we need to fine-tune?
Prompt: Detect red car Output
Pre-trained model has failed to detect red cars specifically. So, we need to fine-tune it for this specific task.
Examples of PaliGemma��* All the images/videos used are real, and tested without editing. *
Visual Question & Answering
Given an image, PaliGemma can answer questions based on it.
Prompt: What is present in this image?
Response: Google Cloud Platform
Zero-shot Object Detection
PaliGemma is capable of zero-shot object detection, meaning, without training the model on specific datasets, it can detect objects with just a text prompt.
Instance Segmentation
PaliGemma can also segment objects in images and videos merely using a text prompt.
How does it work?
Text Prompt: Segment person, mug, book
<loc100><loc200><loc300><loc400><seg058><seg071> … person
<loc50><loc80><loc120><loc400><seg064><seg082> … book
Parse the output from PaliGemma model and draw segmentation mask on the input image.
PaliGemma using �Hugging Face Transformers
Import libraries
Load pre-trained PaliGemma model
�Give input image and text prompt
Pass the input to PaliGemma model
from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "google/paligemma-3b-mix-448"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16) processor = PaliGemmaProcessor.from_pretrained(model_id)
input_image = Image.open(‘image.jpg’)
prompt = ‘’detect person’’
inputs = processor(text=input_text, images=input_image, padding="longest", do_convert_rgb=True, return_tensors="pt").to("cuda")
model.to(device)
inputs = inputs.to(dtype=model.dtype)
Get output
Parse the model response
with torch.no_grad():
output = model.generate(**inputs, max_length=496)
paligemma_response = processor.decode(output[0], skip_special_tokens=True)[len(input_text):].lstrip("\n") print(paligemma_response)
# Parse the PaliGemma model response as per need.
def parse_response(paligemma_response):
…..
…..
…..
Visual Question & Answering
Object Detection
Instance Segmentation
Image Captioning
What’s new in PaliGemma 2?
PaliGemma 2
Transfer Task | Dataset used |
Table Structure Recognition | PubTabNet, FinTabNet |
Chemical Molecular Structure Recognition | PubChem |
Radiography Detection | GrandStaff |
Music Score Recognition | MIMIC-CXR |
PaliGemma 2
Table Structure Recognition
Radiography Detection
Music Score Recognition
Chemical Molecular Structure Recognition
Resources
Resources
Thank You.
github.com/NSTiwari
medium.com/@tiwarinitin1999
twitter.com/@NSTiwari21
linkedin.com/in/tiwari-nitin