1 of 1

This work was supported by the Microsoft Academic Partnership Grant (MAPG) 2023.

Paper and code are available here:

https://vl2g.github.io/projects/PatentLMM/

PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures

Shreya Shukla¹*, Nakul Sharma¹*, Manish Gupta², Anand Mishra¹

¹Indian Institute of Technology Jodhpur, India ²Microsoft, India

{shukla.12, sharma.86, mishra}@iitj.ac.in, gmanish@microsoft.com

* denotes equal contribution

Our Task

Summary

PatentDesc-355K

Proposed Method

PatentMME

PatentLMM

Results

PatentDesc-355K - Large-scale dataset of (patent figure, brief description, detailed description) triplets.
PatentMME - Novel training objectives for learning patent-image-specific representations.
PatentLMM outperforms all the baselines including zero-shot GPT-4V.
Future Work - Patent document-level reasoning for better cross-figure references, and incorporation of external knowledge bases for technically sound descriptions.

Comprehensive dataset containing 355K patent figure images and corresponding brief and detailed descriptions from 60K+ unique patents crawled from google patents

(b)

(d)

(a)

(c)

Brief Descriptions

Detailed Descriptions

FIG. 10 illustrates an example process for video content erasure in accordance with various embodiments;

FIG. 10 illustrates an example of inpainting applied to the input video of FIG. 7. Within video frame 702, the desirable content of a unicorn 704 and undesirable content of a gun in hand 706 is discernible. The content filter performs a step of segmentation 1008 of the undesirable feature. A representation of the segmented video frame shows boundaries 1012 of the regions of undesirable content. The content filter performs a step of erasure 1014. A representation of the video frame with the undesirable content erased 1016 shows the segmented boundary with the pixels of the erased undesirable content shown in black 1018. Next, the filter performs a step of inpainting 1020. This is possible using various image inpainting techniques. The resulting video frame 1022 illustrates the desirable unicorn content 704 unchanged but the undesirable gun in hand content 706 replaced by a pointing finger 1024. Such inpainting works well when the generative neural network that produces the pixels within the erased region is trained on a sufficient number of sample videos that contain similar content, such as hands, but no undesirable content, such as hands with guns. If the training content includes hands with pointing fingers, the inpainting neural network might find that a pointing finger most closely matches with the parts of the hand that are not erased. Therefore, it will predict fill pixel values that produce resulting media that looks like a pointing…

Brief Description

Detailed Description

PatentMME (Multimodal Transformer Encoder)

PC Head

LA-MIM Head

MLM Head

Patchify and Flatten

Linear Layer

Seg_PAD

Patch₂

Patch₁

Patch_PAD

Seg_PAD

Seg_b

Seg₁

last+1

last

last-1

[CLS]

V₂

[MASK]

[SPE]

[SEP]

T_last

[MASK]

T₁

Visual Element Detection

Word Embedding

OCR

Parser

Mask Tokens

Token/Patch Embedding

1D Position Embedding

2D Position Embedding

Patch₅₇₆

Patch₅₇₅

576

575

V₅₇₆

[MASK]

….

Layout-Aware

Masking

PC loss

Text embeddings

Visual embeddings

LA-MIM loss

MLM loss

Contextual embeddings

Trainable layers

GT Annotations

Input Patent Image

Extracted Elements

….

PatentMME is pre-trained on ~900K patent figure images crawled from HUPD corpus, in a weakly supervised setting, using the following objectives :

PatentLLaMA

PatentMME

Projection Layer

<image>

Language Instruction

Generated Description

Trainable layers

Frozen layers

Special tokens

.…

</image>

.…

Patent Image

PatentLLaMA: We continue pre-training LLaMA2 7B with added LoRA adapters on the HUPD patent text for domain adaption.

PatentLMM is trained in two stages:

Alignment using projection layer,
Finetuning projection layer + PatentLLaMA w/ LoRA

Masked Language Modeling (MLM)

Layout-Aware Masked Image Modeling (LA-MIM)

Patch Classification (PC)

Generated Brief and Detailed Descriptions

Ground Truth Brief and Detailed Descriptions

FIG. 1 illustrates a mobile smartphone in accordance with an aspect of the invention.

FIG. 1 shows the details of an exemplary electronic device in accordance with aspects of the invention.

Referring now to FIG. 1, an exemplary electronic device 104 is illustrated. It should be understood that the electronic device 104 may be any one of a number of electronic devices including, but not limited to, a cell phone, a smart phone, a personal digital assistant, a laptop computer, a desktop computer, a netbook computer, a server computer, etc. The electronic device 104 includes a processor 114, a memory 116, a display 118, a user interface 122, a transceiver 120, and an optional touch screen 160. The processor 114 is configured to execute instructions received from the memory 116, and is, for example, a general purpose processor, a field programmable gate array, or any other suitable processor. The processor 114 is configured to execute instructions received from the memory 116, including instructions for displaying a user interface on the display 118.…

FIG. 1 shows the details of an exemplary electronic device in accordance with aspects of the invention. The electronic device 104 includes a processor 114, memory 116, display 118, user interface 122, and the like. The processor 114 may be a central processing unit configured to execute instructions including instructions related to software programs. The display 118 may be a liquid crystal display having a backlight to illuminate the various color liquid crystals to provide a colorful display. The user interface 122 may be any type of physical input having buttons and further may be implemented as a touchscreen 180.The electronic device 104 may further include in the memory 116, an operating system 148, a communication component 150, a contact/motion component 152, a graphics component 154, and the like. The operating system 148 together with the various components providing…

Table 1: Quantitative Evaluation Results for brief and detailed descriptions.

Table 2: Automated evaluation of descriptions using GPT-4V.

Vision Element Detection - We train a FasterRCNN model using 350 manually-annotated patent images, to detect the following 5 categories -

Node
Arrow
Text
Node labels
Figure labels

The extracted elements serve as weak labels for the LA-MIM and PC objectives used in PatentMME pre-training.

Our PatentLMM comprises PatentMME & PatentLLaMA.