2 of 9

What is a “Model in Computer Vision”?

From Azure: Machine Learning models consist of the binary files that represent a machine learning model and any corresponding metadata (ok.. but not useful)

From Google: In general, (a model is) any mathematical construct that processes input data and returns output
From AWS: A computer vision model (f) takes an image (x) as input and outputs information about the objects (y) that it detects, such as the type of object and its location
Mathematically, a computer vision model is a function that takes images (x) as input and returns details (y) regarding semantic information contained in the image

3 of 9

Vision Modalities

Unimodal: Input x is given by a single sensor such as a RGB camera that processes photons at various wavelengths to give us intensity values at those wavelengths

Video

Audio

4 of 9

Multimodal Models with Language Input: VQA

https://vision-explorer.allenai.org/visual_question

Image

Encoder

Text

Encoder

Transformer

Text Guidance

Decoder/Classifier

Hadamard

5 of 9

Enhancing Text with Knowledge Graph

Caption:

A man riding a bicycle down a city street.

Parse to triplet

bicycle

man

street

riding

down

Query from KG

backpack

wheel

car

truck

fire hydrant

traffic light

permission

Question:

Is this person crossing illegally or legally?

down

riding

bicycle

man

street

Examples from our VK-OOD paper

Caveat: KG text may give us misleading embeddings, so we need Outlier Detection in the VQA architecture

6 of 9

Training Multimodal Models: Finding parameters W

Training graphs

Loss functions

Cross-entropy Loss

Figures from our RLO paper

Cross-entropy loss upper bounds 0-1 loss (0 when f^k(x) is equal to y^k and 1 otherwise). Standard but it is almost linear, often ill-conditioned with respect to W. Regularization, other losses are used in-tandem
Parameters are not discrete, so it is not possible to do a grid search. Have to use continuous optimization methods with appropriate step size/learning rate strategies
f is usually nonsmooth with respect to the parameters W. For example: some of the blocks/modules in the architecture may be defined implicitly (say as an optimization problem, differential equation etc.), so have to think carefully about computing gradients because evaluation itself can be NP-Hard

7 of 9

Recent focus: Analyze Patents from USPTO

Classification: What is the patent about? Assign labels (y) to individual patent such as Medical, Computing, Chemical etc., using text, images, flowcharts (x) contained in the patent
Generative: Producing detailed description of images in patents. Can use PLMs for this
Challenging: Enhance or fill sketches in patents

The image is a white outline of a Transition section for a turbocharged engine. The shape of the image is a combination of a cylinder and a cone, with a curved surface connecting the two.

Caption

Examples from our IMPACT patent dateset and public MSVD dataset

2D Image

3D Image

Engine Component

8 of 9

Ongoing Work: Multimodal Models for Patents

Examples from our IMPACT patent dateset

Image-text contrastive learning models

Image

Encoder

Text

Encoder

Fruit

Gum

Multimodal Large Language Models

Engine

Text

Encoder

Image

Encoder

Input

Projector

Generator

Multimodal Generative Models

Input

Projector

Image

Encoder

Fruit

Gum

Text

Encoder

9 of 9

Reference

[1] Wu, Z., Yao, T., Fu, Y., & Jiang, Y. G. (2017). Deep learning for video classification and captioning. In Frontiers of multimedia research (pp. 3-29).
[2] Wang, Z., Medya, S., & Ravi, S. (2023). Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis. Advances in Neural Information Processing Systems, 36, 13854-13872.
[3] Wang, Z., Veluswami, P. R., Mishra, H., & Ravi, S. N. (2023). Accelerated Neural Network Training with Rooted Logistic Objectives. arXiv preprint arXiv:2310.03890.

1 of 9

2 of 9

3 of 9

4 of 9

5 of 9

6 of 9

7 of 9

8 of 9

9 of 9