Training and Inferencing Multimodal Models in Computer Vision
UIC AI Ecosystem Symposiums - September 13, 2024
1
What is a “Model in Computer Vision”?
2
Vision Modalities
3
Unimodal: Input x is given by a single sensor such as a RGB camera that processes photons at various wavelengths to give us intensity values at those wavelengths
Video
Audio
Multimodal Models with Language Input: VQA
4
https://vision-explorer.allenai.org/visual_question
Image
Encoder
Text
Encoder
Transformer
Text Guidance
Decoder/Classifier
Hadamard
Enhancing Text with Knowledge Graph
5
Caption:
A man riding a bicycle down a city street.
Parse to triplet
bicycle
man
street
riding
down
Query from KG
backpack
wheel
car
truck
fire hydrant
traffic light
permission
Question:
Is this person crossing illegally or legally?
down
riding
bicycle
man
street
Examples from our VK-OOD paper
Caveat: KG text may give us misleading embeddings, so we need Outlier Detection in the VQA architecture
Training Multimodal Models: Finding parameters W
6
Training graphs
Loss functions
Cross-entropy Loss
Figures from our RLO paper
Recent focus: Analyze Patents from USPTO
7
The image is a white outline of a Transition section for a turbocharged engine. The shape of the image is a combination of a cylinder and a cone, with a curved surface connecting the two.
Caption
Examples from our IMPACT patent dateset and public MSVD dataset
2D Image
3D Image
Engine Component
Ongoing Work: Multimodal Models for Patents
8
Examples from our IMPACT patent dateset
Image-text contrastive learning models
Image
Encoder
Text
Encoder
Fruit
Gum
Multimodal Large Language Models
Engine
Text
Encoder
Image
Encoder
Input
Projector
Generator
Multimodal Generative Models
Input
Projector
Image
Encoder
Fruit
Gum
Text
Encoder
Reference
9