1 of 17

HTML Atomic UI Elements Extraction from Hand-Drawn Website Images using Mask-RCNN and novel Multi-Pass Inference Technique

Prasang Gupta and Swayambodha Mohapatra

Pricewaterhouse Coopers Advisory

Mumbai, India

PwC

2 of 17

Team Members

Prasang Gupta

Experienced Associate in US Emerging Technologies

prasang.gupta@pwc.com

Swayambodha Mohapatra

Experienced Associate in US Emerging Technologies

swayambodha.mohapatra@pwc.com

PwC

3 of 17

Agenda�

Our

Solutions and Methodology

Performance and Future Scope

Problem Statement

Agenda�

PwC

4 of 17

The ImageCLEF 2020 problem statement involved detecting a set of atomic user interface elements (UI) in hand-drawn images of websites.

A sample output file with the bounding boxes for different classes and the confidence scores.

PwC

5 of 17

Dataset provided was skewed with respect to classes like button, image and paragraph ...

The dataset contained 2363 images labelled images (21 labels) in the development set and 587 unlabelled images in the test set. However, the classes were very skewed in the data.

Only a few classes were prominent in the data while the others were very scarce.

Out of the 21 classes, 10 classes were present in less than 100 images in the whole development set, while classes like Button, Paragraph and Image were present abundantly.

PwC

6 of 17

… and the dataset also contained skewed images and overlapping UI elements

There were several visible problems in some of the images present in the dataset.

Skewed images led to a slanting bounding box which would hamper the learning of the model.

Some elements overlapped with each other in some images which led to mixed boundaries of elements.

Case of Repeated Images. These 2 images are essentially same, but were present in the dataset as 2 different images

PwC

7 of 17

Applied data pre-processing to deal with the skewness …

Repeated Images

Identified such cases by using a distance thresholding algorithm.

DLIB model trained on just 15 instances of Image class.

Distance thresholding algorithm between 2 images which are same but slightly shifted.

DLIB model

A DLIB model was trained to identify the labels which were scarcely present in the dataset as the model does not require lot of data to train

PwC

8 of 17

… and converted the images to B&W for optimising training

To ensure uniformity in images of the training dataset, we converted all of them into Black and White. The algorithms used for the conversion are :

Grayscale image

Final Noise-reduced image

Output after Adaptive Gaussian algorithm

Color to Grayscale :

Simple Grayscale
Refined Sharpened Grayscale

Grayscale to Black & White :

Binary thresholding
Erosion + Otsu’s algorithm
Adaptive Gaussian algorithm

Removing Noise :

Custom algorithm to remove small connected components for noise reduction on C++

PwC

9 of 17

Implemented a vanilla Mask-RCNN model as baseline

Initially tried both Mask RCNN and YOLO v3 Models. Since there was no need for real-time detection, chose Mask RCNN because of better results.

Implemented Transfer Learning by using a pre-trained Mask RCNN model trained on COCO dataset. Trained the model for 200 epochs.

Able to detect large UI elements in the image. Didn’t perform well in detecting smaller UI elements, which led to lower recall score.

Output generated by Mask RCNN Model (Run 1) on one of the images belonging to the test split of the dataset.

mAP value

57.34%

Overall Precision

94.04%

Overall Recall

41.7%

PwC

10 of 17

Our vanilla model had a high precision, but missed out on recognising the smaller elements

Thought of a novel technique to solve the issue of not being able to detect small UI elements.

STEP 1

Pass the image through the inference model and get the bounding box predictions

STEP 2

Fill all the bounding box predictions with background colour (white)

Correctly Detected

Missed Out

STEP 3

Pass the edited image again through the inference model to “force” the model to predict the missed out elements

Detected the Missed Out Elements !

PwC

11 of 17

Improving our model detection capability with our novel “Multi-Pass Inference Technique”

Output generated after passing the input image through the model once. (1st pass)

Output generated after 2nd pass through the model. Smaller UI elements get detected.

Output generated after appending the results from both passes.

STEP 1

STEP 2

STEP 3

Success !

PwC

12 of 17

… and then improved on this “Multi-Pass Inference Technique” for better performance

Only the bounding boxes with the highest confidence scores after the second pass were added to the final results. �Done to ensure that stray elements detected after white space replacement in first step are not added.

Intermediate Output generated by model on one of the images belonging to the test split of the dataset after the first pass.

Final Output generated by model on one of the images belonging to the test split of the dataset after the second pass. Most of the missed elements in the first pass are captured in the second pass.

PwC

13 of 17

Performance

Future Scope

Scope in expanding the viability of the Multi-Pass Inference technique and study the effect of number of passes with performance.

Explore of attention models to enhance explainability of predictions

Better performing base models like EfficientDet can be explored to improve the metrics.

Multiple data pre-processing steps explored to get the limited training data in the best possible shape for training.
Multi-Pass Inference technique improved the mAP score drastically, hence gaining us the 3rd spot on the leaderboard of the challenge.

	Description	mAP	OP	OR
Run 1	Baseline MRCNN	57.3	94.0	41.7
Run 2	Multi-Pass Inference	63.7	91.8	50.1
Run 3	Improved Run 2	64.1	91.7	49.6

PwC

14 of 17

Questions ?

PwC

15 of 17

Appendix

PwC

16 of 17

Using the “MPI technique” improved the performance of our model

Multiple data pre-processing steps explored to get the limited training data in the best possible shape for training.
Multi-Pass Inference technique improved the mAP score drastically, hence gaining us the 3rd spot on the leaderboard of the challenge.

Future Work

Scope in expanding the viability of the Multi-Pass Inference technique and study the effect of number of passes with performance.
Further experimentation with attention mechanism can be done to focus on those parts of the image which are more important.
Better performing base models like EfficientDet can be explored to improve the metrics.

	Description	mAP	OP	OR
Run 1	Baseline MRCNN	57.3	94.0	41.7
Run 2	Multi-Pass Inference	63.7	91.8	50.1
Run 3	Improved Run 2	64.1	91.7	49.6

PwC

17 of 17

Improved on the vanilla model by implementing a novel idea ...

Output generated after passing the input image through the model once. (1st pass)

This technique involves getting the predictions on the input image and then filling the corresponding bounding box regions with the background colour.

The edited image is then passed through the model again to essentially ‘force’ the model to make predictions on the missed out elements.

Output generated after 2nd pass through the model. Smaller UI elements get detected.

New predictions from 2nd pass appended to the earlier predictions of 1st pass to get the final results for a particular image.

Output generated after appending the results from both passes.

PwC