HTML Atomic UI Elements Extraction from Hand-Drawn Website Images using Mask-RCNN and novel Multi-Pass Inference Technique
Prasang Gupta and Swayambodha Mohapatra
Pricewaterhouse Coopers Advisory
Mumbai, India
PwC
Team Members
2
Prasang Gupta
Experienced Associate in US Emerging Technologies
prasang.gupta@pwc.com
Swayambodha Mohapatra
Experienced Associate in US Emerging Technologies
swayambodha.mohapatra@pwc.com
PwC
PwC
Agenda�
3
Our
Solutions and Methodology
Performance and Future Scope
Problem Statement
Agenda�
PwC
The ImageCLEF 2020 problem statement involved detecting a set of atomic user interface elements (UI) in hand-drawn images of websites.
4
A sample output file with the bounding boxes for different classes and the confidence scores.
PwC
PwC
Dataset provided was skewed with respect to classes like button, image and paragraph ...
5
The dataset contained 2363 images labelled images (21 labels) in the development set and 587 unlabelled images in the test set. However, the classes were very skewed in the data.
Only a few classes were prominent in the data while the others were very scarce.
Out of the 21 classes, 10 classes were present in less than 100 images in the whole development set, while classes like Button, Paragraph and Image were present abundantly.
PwC
… and the dataset also contained skewed images and overlapping UI elements
6
There were several visible problems in some of the images present in the dataset.
Skewed images led to a slanting bounding box which would hamper the learning of the model.
Some elements overlapped with each other in some images which led to mixed boundaries of elements.
Case of Repeated Images. These 2 images are essentially same, but were present in the dataset as 2 different images
PwC
Applied data pre-processing to deal with the skewness …
7
Repeated Images
Identified such cases by using a distance thresholding algorithm.
DLIB model trained on just 15 instances of Image class.
Distance thresholding algorithm between 2 images which are same but slightly shifted.
DLIB model
A DLIB model was trained to identify the labels which were scarcely present in the dataset as the model does not require lot of data to train
PwC
… and converted the images to B&W for optimising training
8
To ensure uniformity in images of the training dataset, we converted all of them into Black and White. The algorithms used for the conversion are :
Grayscale image
Final Noise-reduced image
Output after Adaptive Gaussian algorithm
Color to Grayscale :
Grayscale to Black & White :
Removing Noise :
PwC
Implemented a vanilla Mask-RCNN model as baseline
9
Initially tried both Mask RCNN and YOLO v3 Models. Since there was no need for real-time detection, chose Mask RCNN because of better results.
Implemented Transfer Learning by using a pre-trained Mask RCNN model trained on COCO dataset. Trained the model for 200 epochs.
Able to detect large UI elements in the image. Didn’t perform well in detecting smaller UI elements, which led to lower recall score.
Output generated by Mask RCNN Model (Run 1) on one of the images belonging to the test split of the dataset.
mAP value
57.34%
Overall Precision
94.04%
Overall Recall
41.7%
PwC
Our vanilla model had a high precision, but missed out on recognising the smaller elements
10
Thought of a novel technique to solve the issue of not being able to detect small UI elements.
STEP 1
Pass the image through the inference model and get the bounding box predictions
STEP 2
Fill all the bounding box predictions with background colour (white)
Correctly Detected
Missed Out
STEP 3
Pass the edited image again through the inference model to “force” the model to predict the missed out elements
Detected the Missed Out Elements !
PwC
PwC
Improving our model detection capability with our novel “Multi-Pass Inference Technique”
11
Output generated after passing the input image through the model once. (1st pass)
Output generated after 2nd pass through the model. Smaller UI elements get detected.
Output generated after appending the results from both passes.
STEP 1
STEP 2
STEP 3
Success !
PwC
… and then improved on this “Multi-Pass Inference Technique” for better performance
12
Only the bounding boxes with the highest confidence scores after the second pass were added to the final results. �Done to ensure that stray elements detected after white space replacement in first step are not added.
Intermediate Output generated by model on one of the images belonging to the test split of the dataset after the first pass.
Final Output generated by model on one of the images belonging to the test split of the dataset after the second pass. Most of the missed elements in the first pass are captured in the second pass.
PwC
Performance
Future Scope
13
Scope in expanding the viability of the Multi-Pass Inference technique and study the effect of number of passes with performance.
Explore of attention models to enhance explainability of predictions
Better performing base models like EfficientDet can be explored to improve the metrics.
| Description | mAP | OP | OR |
Run 1 | Baseline MRCNN | 57.3 | 94.0 | 41.7 |
Run 2 | Multi-Pass Inference | 63.7 | 91.8 | 50.1 |
Run 3 | Improved Run 2 | 64.1 | 91.7 | 49.6 |
PwC
Questions ?
14
PwC
Appendix
15
PwC
Using the “MPI technique” improved the performance of our model
16
Future Work
| Description | mAP | OP | OR |
Run 1 | Baseline MRCNN | 57.3 | 94.0 | 41.7 |
Run 2 | Multi-Pass Inference | 63.7 | 91.8 | 50.1 |
Run 3 | Improved Run 2 | 64.1 | 91.7 | 49.6 |
PwC
Improved on the vanilla model by implementing a novel idea ...
17
Output generated after passing the input image through the model once. (1st pass)
This technique involves getting the predictions on the input image and then filling the corresponding bounding box regions with the background colour.
The edited image is then passed through the model again to essentially ‘force’ the model to make predictions on the missed out elements.
Output generated after 2nd pass through the model. Smaller UI elements get detected.
New predictions from 2nd pass appended to the earlier predictions of 1st pass to get the final results for a particular image.
Output generated after appending the results from both passes.
PwC