3 of 28

Introduction

A building footprint is the border of a building drawn along the exterior walls, to create a polygon, representing the total area of the building.

These are the typical sources to generate building footprints and are usually extracted from manual digitizing from high-resolution satellite imagery.

Urban Planning and Development
Disaster Response and Management
Environmental Monitoring
Infrastructure Planning
Mapping (Ex. Google Maps)

4 of 28

Problem Statement

Manual extraction of building footprints is time-consuming, labor-intensive, and prone to human errors, especially in large-scale urban areas.

Traditional automated methods, such as thresholding and edge detection, often struggle with complex urban landscapes, occlusions, and varying lighting conditions, leading to suboptimal results.

Need for a Deep Learning Solution

Deep learning techniques, particularly convolutional neural networks (CNNs), have shown remarkable success in image segmentation tasks, including building footprint extraction. By leveraging large-scale datasets and learning complex spatial patterns, deep learning models can significantly improve the accuracy and efficiency of building footprint extraction from satellite imagery.

5 of 28

Aim and Objectives

Aim:

The aim of the project is to develop and evaluate a deep learning model for building footprint extraction from aerial imagery, focusing on the Massachusetts Buildings Dataset.

Objectives:

1. Develop a deep learning model tailored for building footprint extraction, leveraging architectures such as U-Net or its variants.

2. Model Training and Optimization:

- Train the developed model using the training set of aerial images and corresponding building footprints.

- Optimize model hyperparameters, such as learning rate and batch size, to maximize performance.

3. Evaluation Metrics:

- Evaluate the performance of the trained model using standard metrics like Intersection over Union (IoU), accuracy, precision, recall, and F1-score.

4. Comparison with Other Models:

- Implement and compare the developed model with existing architectures such as U-Net++, DeepLabV3, and Feature Pyramid Networks (FPN).

- Evaluate the performance of each model variant using the same evaluation metrics and dataset splits.

6 of 28

Dataset Overview

SECTION 2

7 of 28

Data

Number of Images:151 aerial images of the Boston area

Image Size: Each image is 1500 × 1500 pixels

Coverage: 2.25 square kilometers per image, totaling approximately

- 340 square kilometers

Data Split:

- Training Set: 137 images

- Test Set: 10 images

- Validation Set: 4 images

The mask dataset is an 8-bit image,

0 – No building present,

255 - Building is present

8 of 28

Model Architecture

SECTION 3

9 of 28

What is U-Net?

U-Net is a Convolutional Neural Network (CNN) architecture designed for Biomedical semantic segmentation tasks.
Developed by Olaf Ronneberger et al. in 2015, U-Net has become one of the most widely used architectures for image segmentation due to its effectiveness and efficiency.

Architecture Overview:

Encoder:

Extracts features from the input image.
Consists of convolutional and pooling layers to reduce spatial dimensions and increase feature channels.
Each convolutional block typically includes two convolutional layers followed by non-linear activation functions.
Pooling layers downsample feature maps to capture larger contextual information.

10 of 28

What is U-Net?

Decoder:

Generates the segmentation mask from features extracted by the encoder.
Consists of upsampling and convolutional layers to increase spatial dimensions and reduce feature channels.
Each upsampling block typically includes an upsampling operation followed by a convolutional layer.
Skip connections established between encoder and decoder layers enable the flow of high-resolution features.

Skip Connections:

Facilitate direct propagation of information from early encoder layers to corresponding decoder layers.
Mitigate information loss during downsampling and aid in precise localization of objects in the segmentation mask.

11 of 28

Applications:

Medical Imaging:

Organ segmentation, tumor detection, and lesion segmentation.
Effective with small datasets, yielding high-quality segmentation.

Satellite Image Analysis:

Land cover classification, building footprint extraction, and road segmentation.
Ideal for remote sensing, capturing intricate spatial patterns.

Biomedical Image Analysis:

Cell segmentation, nuclei detection, and morphological analysis.
Robust to image variations and noise, suitable for microscopy.

Industrial Inspection:

Defect detection, surface inspection, and quality control.
Detects subtle defects, enhancing product quality and reducing costs.

12 of 28

Deployed U-Net Model

Down-sampling Path

Consists of four DownBlocks and each DownBlock:

Applies two convolutional layers with ReLU activation and batch normalization.
Followed by max-pooling for downsampling.
Purpose: Extract high-level features while reducing spatial dimensions.
Bottleneck
A DoubleConv block with two consecutive convolutional layers.
No downsampling.
Goal: Capture rich context information.

13 of 28

Up-sampling Path

Comprises four UpBlocks and each UpBlock:

Performs up-sampling using either convolution transpose or bilinear interpolation.
Concatenates feature maps from the corresponding DownBlock.
Applies two convolutional layers with ReLU activation and batch normalization.
Objective: Recover spatial information lost during down-sampling.
Final Convolution
Single convolutional layer with kernel size 1.
Outputs segmentation mask with the desired number of classes.
Total Layers
UNet architecture includes 21 convolutional layers (excluding batch normalization layers).
4 max-pooling layers.
1 up-sampling layer.
Conclusion
UNet's unique design facilitates accurate semantic segmentation by effectively capturing both local and global context information.

14 of 28

Why U-Net for Semantic Segmentation?

Semantic segmentation: Precisely label each pixel in an image with its corresponding class.

Architectural Advantages
Handling Class Imbalance
Effective Feature Fusion
Robustness to Limited Data
Flexibility and Adaptability
Real-Time Inference

Conclusion

U-Net emerges as the ideal choice for our semantic segmentation project, offering a balance of accuracy, efficiency, and adaptability.
Architectural advantages and practical benefits make U-Net a compelling solution for our specific requirements.

�

15 of 28

Model Execution

SECTION 4

16 of 28

Dataset Sample Visualization

Original Image: Presenting the raw image data from the dataset.
Ground Truth Mask: Demonstrating the corresponding segmentation mask for the image.
One-Hot Encoded Mask: Displaying the one-hot encoded representation of the ground truth mask.

This a sample image from the dataset along with its ground truth segmentation mask and one-hot encoded representation. It provides a comprehensive view of the input data used for training the U-Net model, aiding in understanding the segmentation task and evaluating the model's performance.

17 of 28

Interpretation of Evaluation Metrics

Dice Loss: 0.1431

Measures dissimilarity between predicted and ground truth masks.
Lower values indicate better performance.

IoU Score: 0.8134

Measures overlap between predicted and ground truth masks.
Higher scores signify better alignment.

Precision: 0.8612

Proportion of true positive predictions out of all positives.
High precision minimizes false positives.

Recall: 0.9344

Proportion of true positives out of all actual positives.
High recall captures most positive instances.

Accuracy: 0.8966

Overall correctness of model's predictions.
Ratio of correct predictions to total predictions.

F1 Score: 0.5826

Harmonic mean of precision and recall.
Higher scores indicate better overall performance.

Interpretation

Metrics provide a comprehensive assessment.
Low Dice Loss suggests good segmentation accuracy.
High values in IoU, Precision, Recall, Accuracy, and F1 Score affirm model effectiveness.

18 of 28

Graphical Interpretation of Evaluation Metrics

19 of 28

Prediction results on Test set

Presenting the predicted segmentation masks generated by the U-Net model on the test set.
Comparison with Ground Truth: Overlaying predicted masks with ground truth masks for evaluation.
Shows the total number of buildings detected in the predicted mask.

Model – U-Net

20 of 28

Few more predicted segment masks generated by U-Net

21 of 28

Predicted segment mask generated by DeepLabV3 and U-Net++

22 of 28

Comparison & Results

SECTION 5

23 of 28

Models used for comparison:

U-Net
U-Net++
DeepLabV3
FPN

U-Net++:

U-Net++ is an extension of the original U-Net architecture, featuring dense skip connections and nested U-Net blocks. It aims to enhance feature propagation and capture more context information, leading to improved segmentation performance.

DeepLabV3:

DeepLabV3 is a deep learning model designed for semantic segmentation tasks. It employs atrous convolution to effectively capture multi-scale information, allowing it to achieve high-resolution segmentation results.

FPN (Feature Pyramid Network):

FPN is a convolutional neural network architecture designed to extract features at multiple scales. It uses a top-down architecture with lateral connections to build a feature pyramid from a single input image, enabling effective object detection and segmentation across different scales.

24 of 28

Model Comparison

Model	F1 Score	IoU Score	Dice Loss	Precession	Accuracy	Recall
Unet	0.5826	0.8134	0.1431	0.8612	0.8966	0.9344
Unet++	0.5825	0.8233	0.1494	0.8800	0.9045	0.9260
DeepLabV3	0.5830	0.7734	0.1550	0.8331	0.8710	0.9133
FPN	0.5826	0.6669	0.2060	0.7762	0.7994	0.8139

The evaluation results on the test data reveal the performance of different segmentation models. Across the models tested, U-Net++ achieved the highest mean IoU score of 0.8233, indicating better pixel-level accuracy in segmentation. However, U-Net had the highest mean recall of 0.9344, suggesting its effectiveness in capturing true positive instances.

25 of 28

Conclusion:

U-Net achieved a Mean IoU Score of 0.8134 and a Mean F1 Score of 0.5826 on the test data, demonstrating its effectiveness in building segmentation tasks.
U-Net++ and DeepLabV3 also performed well, with Mean IoU Scores of 0.8233 and 0.7734, respectively, showcasing their robustness in capturing intricate details in segmentation.
FPN achieved a Mean IoU Score of 0.6669, indicating its capability in extracting features at different scales, although it exhibited slightly lower performance compared to other models.
In conclusion, U-Net, U-Net++, and DeepLabV3 are effective architectures for building segmentation tasks, with U-Net showing promising results in this study. Further experimentation and fine-tuning may be beneficial to explore the full potential of these models in various segmentation scenarios.

26 of 28

Scope For Further Research

1. Utilization of PAN Images:

Investigate the use of panchromatic (PAN) images for building footprint extraction, leveraging their higher spatial resolution to enhance the delineation of building boundaries and details.

2. PAN Sharpening Techniques:

Explore advanced PAN sharpening techniques to fuse PAN images with RGB imagery, aiming to improve the spatial resolution and spectral fidelity of the input data for building footprint extraction models.

3. Hybrid Models:

Develop hybrid deep learning models that incorporate both PAN and RGB information, leveraging the complementary strengths of each modality to enhance building footprint extraction accuracy and robustness.

27 of 28

References

Chen, F., Wang, N., Yu, B., & Wang, L. (2022). RES2-UNEt, a new deep architecture for building detection from high spatial resolution images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15, 1494–1501. https://doi.org/10.1109/jstars.2022.3146430
K. Tippayamontri and N. Khunlertgit, "Comparison of Deep Learning-Based Semantic Segmentation Models for Unmanned Aerial Vehicle Images," 2022 37th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), Phuket, Thailand, 2022, pp. 415-418, doi: 10.1109/ITC-CSCC55581.2022.9895074. keywords: {Training;Computers;Computational modeling;Semantics;Computer architecture;Autonomous aerial vehicles;Safety;Semantic Segmentation;deep learning;aerial image}

1 of 28

2 of 28