This work was supported by the Microsoft Academic Partnership Grant (MAPG) 2023.
Paper and code are available here:
PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures
Shreya Shukla¹*, Nakul Sharma¹*, Manish Gupta², Anand Mishra¹
¹Indian Institute of Technology Jodhpur, India ²Microsoft, India
{shukla.12, sharma.86, mishra}@iitj.ac.in, gmanish@microsoft.com
* denotes equal contribution
Our Task
Summary
PatentDesc-355K
Proposed Method
PatentMME
PatentLMM
Results
Comprehensive dataset containing 355K patent figure images and corresponding brief and detailed descriptions from 60K+ unique patents crawled from google patents
(b)
(d)
(a)
(c)
Brief Descriptions
Detailed Descriptions
FIG. 10 illustrates an example process for video content erasure in accordance with various embodiments;
FIG. 10 illustrates an example of inpainting applied to the input video of FIG. 7. Within video frame 702, the desirable content of a unicorn 704 and undesirable content of a gun in hand 706 is discernible. The content filter performs a step of segmentation 1008 of the undesirable feature. A representation of the segmented video frame shows boundaries 1012 of the regions of undesirable content. The content filter performs a step of erasure 1014. A representation of the video frame with the undesirable content erased 1016 shows the segmented boundary with the pixels of the erased undesirable content shown in black 1018. Next, the filter performs a step of inpainting 1020. This is possible using various image inpainting techniques. The resulting video frame 1022 illustrates the desirable unicorn content 704 unchanged but the undesirable gun in hand content 706 replaced by a pointing finger 1024. Such inpainting works well when the generative neural network that produces the pixels within the erased region is trained on a sufficient number of sample videos that contain similar content, such as hands, but no undesirable content, such as hands with guns. If the training content includes hands with pointing fingers, the inpainting neural network might find that a pointing finger most closely matches with the parts of the hand that are not erased. Therefore, it will predict fill pixel values that produce resulting media that looks like a pointing…
Brief Description
Detailed Description
PatentMME (Multimodal Transformer Encoder)
PC Head
LA-MIM Head
MLM Head
..
..
Patchify and Flatten
Linear Layer
SegPAD
Patch2
Patch1
PatchPAD
SegPAD
Segb
Segb
Seg1
Seg1
0
2
1
0
last+1
last
last-1
2
1
[CLS]
V2
[MASK]
[SPE]
[SEP]
Tlast
[MASK]
[MASK]
T1
Visual Element Detection
Word Embedding
OCR
Parser
Mask Tokens
Token/Patch Embedding
1D Position Embedding
2D Position Embedding
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Patch576
Patch575
576
575
V576
[MASK]
+
+
+
+
….
….
….
….
….
….
Layout-Aware
Masking
PC loss
Text embeddings
Visual embeddings
LA-MIM loss
MLM loss
h
Contextual embeddings
Trainable layers
GT Annotations
Input Patent Image
Extracted Elements
h
h
h
h
h
h
….
h
h
h
h
….
h
PatentMME is pre-trained on ~900K patent figure images crawled from HUPD corpus, in a weakly supervised setting, using the following objectives :
PatentLLaMA
PatentMME
h
h
h
h
Projection Layer
p
p
p
p
<image>
Language Instruction
Generated Description
Trainable layers
Frozen layers
Special tokens
.…
.…
</image>
.…
.…
Patent Image
PatentLLaMA: We continue pre-training LLaMA2 7B with added LoRA adapters on the HUPD patent text for domain adaption.
PatentLMM is trained in two stages:
Masked Language Modeling (MLM)
Layout-Aware Masked Image Modeling (LA-MIM)
Patch Classification (PC)
Generated Brief and Detailed Descriptions
Ground Truth Brief and Detailed Descriptions
FIG. 1 illustrates a mobile smartphone in accordance with an aspect of the invention.
FIG. 1 shows the details of an exemplary electronic device in accordance with aspects of the invention.
Referring now to FIG. 1, an exemplary electronic device 104 is illustrated. It should be understood that the electronic device 104 may be any one of a number of electronic devices including, but not limited to, a cell phone, a smart phone, a personal digital assistant, a laptop computer, a desktop computer, a netbook computer, a server computer, etc. The electronic device 104 includes a processor 114, a memory 116, a display 118, a user interface 122, a transceiver 120, and an optional touch screen 160. The processor 114 is configured to execute instructions received from the memory 116, and is, for example, a general purpose processor, a field programmable gate array, or any other suitable processor. The processor 114 is configured to execute instructions received from the memory 116, including instructions for displaying a user interface on the display 118.…
FIG. 1 shows the details of an exemplary electronic device in accordance with aspects of the invention. The electronic device 104 includes a processor 114, memory 116, display 118, user interface 122, and the like. The processor 114 may be a central processing unit configured to execute instructions including instructions related to software programs. The display 118 may be a liquid crystal display having a backlight to illuminate the various color liquid crystals to provide a colorful display. The user interface 122 may be any type of physical input having buttons and further may be implemented as a touchscreen 180.The electronic device 104 may further include in the memory 116, an operating system 148, a communication component 150, a contact/motion component 152, a graphics component 154, and the like. The operating system 148 together with the various components providing…
Table 1: Quantitative Evaluation Results for brief and detailed descriptions.
Table 2: Automated evaluation of descriptions using GPT-4V.
Vision Element Detection - We train a FasterRCNN model using 350 manually-annotated patent images, to detect the following 5 categories -
The extracted elements serve as weak labels for the LA-MIM and PC objectives used in PatentMME pre-training.
Our PatentLMM comprises PatentMME & PatentLLaMA.