1 of 16

UI element detection from wireframe drawings of websites

Prasang Gupta and Vishakha Bansal

Pricewaterhouse Coopers US Advisory

Mumbai, India

PwC

2 of 16

Team Members

2

Prasang Gupta

Associate in US Emerging Technologies

prasang.gupta@pwc.com

Vishakha Bansal

Senior Associate in US Emerging Technologies

vishakha.bansal@pwc.com

PwC

PwC

3 of 16

Agenda�

3

Our

Solutions and Methodology

Performance and Future Scope

Problem Statement

Agenda�

PwC

4 of 16

ImageCLEF 2021 involved detecting a set of atomic user interface elements (UI) in hand-drawn images of websites.

4

A sample output file with the bounding boxes for different classes and the confidence scores.

PwC

PwC

5 of 16

Dataset contained commonly occurring elements like paragraph, button etc. in larger proportions compared to the rare elements like table, list etc….

5

The dataset contained 3,218 images labelled images (21 classes) in the development set and 1,073 unlabelled images in the test set. Development set was split into 95% train and 5% validation set.

Commonly occurring classes were abundant in the data while the rare ones were comparatively infrequent.

Paragraph, button, header, container etc. are most commonly occurring classes whereas table, stepperinput, list, video etc. are the rare classes in the given set of images

PwC

6 of 16

… and dataset also contained images of various sizes with different contrasts

6

The dataset contained images of various sizes and were resized later to 512 x 512 for training and inference purposes. As the images are the snapshots of the drawings on paper or whiteboard, it was found to have different contrasts.

Distribution of width and height of images in development set

Images in the development set with different contrasts

PwC

7 of 16

Applied data pre-processing for contrast improvement...

7

Contrast Limited Adaptive Histogram Equalisation (CLAHE) technique is used which limits contrast amplification before applying histogram equalization to change the range of the pixels present in the image from a very confined space to a much larger distribution resulting in clearer images

Original Image

Contrast Limited Adaptive Histogram Equalisation

Output image

PwC

8 of 16

… and converted the images to B&W for optimizing training

8

To confirm uniformity in images of the training dataset, we converted all of them into Black and White. The algorithms used for the conversion are :

Grayscale image

Final Noise-reduced image

Output after Adaptive Gaussian algorithm

Color to Grayscale :

  • Simple Grayscale
  • Refined Sharpened Grayscale

Grayscale to Black & White :

  • Binary thresholding
  • Erosion + Otsu’s algorithm
  • Adaptive Gaussian algorithm

Removing Noise :

  • Custom algorithm to remove small connected components for noise reduction on C++

PwC

9 of 16

We selected YOLOv5 class of models ...

Model

mAP val 0.5

Speed on V100 (ms)

Parameters (million)

YOLOv5s

61.9

4.3

12.7

YOLOv5m

68.7

8.4

35.9

YOLOv5l

71.1

12.3

77.2

YOLOv5x

72.0

22.4

141.8

Comparison of different YOLOv5 variants trained on COCO dataset

For modelling, we started exploring Mask-RCNN, U-Net and YOLOv5.

We discarded U-Net as it has a very large number of parameters and thus would be too costly to train. We also discarded Mask-RCNN as we have already explored the model and it was found to be running into issues in detecting smaller UI elements.

Hence, for this iteration, we selected the YOLOv5 class of models. We established a baseline by using YOLOv5s (small variant, lightest) and got decent mAP and Recall scores.

mAP value

0.649

Overall Recall

0.675

Baseline Score :

Results for baseline YOLOv5 (small variant, lightest) model

PwC

10 of 16

… and tried them all to find the best performing configuration

Model

mAP

Recall

mAP improvement (over baseline in %)

YOLOv5s (Baseline)

0.649

0.675

-

YOLOv5l (with pre-trained weights)

0.810

0.826

24.8 %

YOLOv5x (with pre-trained weights, LR, early stopping)

0.820

0.840

26.3 %

YOLOv5x (with pre-trained weights, frozen layers - head trained)

0.701

0.731

8 %

We started with YOLOv5s (small variant, lightest) and worked our way up to YOLOv5x (xlarge variant, heaviest) to boost performance.

We also tried both training the full layers as well as training just the heads.

We found that training just the head is much faster, but it results in significantly reduced performance.

Also, YOLOv5x model was unsurprisingly giving the best results from our tests.

PwC

11 of 16

Performance

Future Scope

11

Experimenting with the ensemble of modelling approaches for two types of wireframes: one with more compactly placed UI elements and another with less compactly placed UI elements

Exploration of other object detection models that favour performance over speed of inference

Visual inspection for explainability of the model performance on test dataset as it was performing close to perfect (nearly 95% Precision and Recall) on validation set

Description

mAP

OR

F1

Run 1

YOLOv5s (baseline)

0.649

0.675

0.855

Run 2

YOLOv5s (pre-trained weights)

0.649

0.675

0.881

Run 3

YOLOv5l (pre-trained weights)

0.810

0.826

0.954

Run 4

YOLOv5x (pre-trained weights, LR)

0.820

0.840

0.952

Run 5

YOLOv5x (pre-trained weights, frozen)

0.701

0.731

0.894

Run 6

Run 4 with 0.2 confidence cutoff

0.824

0.844

-

Run 7

Run 4 with 0.15 confidence cutoff

0.824

0.844

-

Run 8

Run 4 with 0.1 confidence cutoff

0.829

0.852

-

Run 9

Run 4 with 0.05 confidence cutoff

0.832

0.858

-

Run 10

Run 4 with 0.01 confidence cutoff

0.836

0.865

-

on Validation dataset

PwC

12 of 16

Thank You

© 2021 PwC. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors.

pwc.com

PwC

13 of 16

Questions ?

13

PwC

14 of 16

Appendix

14

PwC

15 of 16

Multi-Pass Inference Technique

15

Output generated after passing the input image through the model once. (1st pass)

Output generated after 2nd pass through the model. Smaller UI elements get detected.

Output generated after appending the results from both passes.

STEP 1

STEP 2

STEP 3

Success !

PwC

16 of 16

We improved the score further by using post-processing

Confidence cutoff (%)

Total dataset labels

% increase in labels

Test mAP scores

25

101251

0.0

0.820

20

109848

8.5

0.824

15

119881

9.1

10

130765

9.0

0.829

5

142963

9.3

0.832

1

165013

15.4

0.836

We employed 2 post-processing techniques :

  • Multi-Pass Inference
  • Confidence cutoff variation

Multi-Pass Inference technique is used for recognising missing elements. However, since the recall is pretty good, this does not increase the score much.

Confidence cutoff variation helped increase the score as the model was encouraged to predict labels for classes even after it had lower confidence in that label.

PwC