1 of 28

Lit Review: nnU-Net

For Biomedical Image Segmentation

Selina Liu

2 of 28

Prez Outline: U-Net → nnU-Net → application

Application [lung tumor segmentation]

nnU-Net: design a U-Net for us

ML image seg pipeline

Image input

Image output

3 of 28

Image Segmentation: extract important info from image

A process of dividing an image into meaningful regions/objects
Idea: assign each pixel / groups of pixels with a specific label / class
Relevant application: analyze and understand various structures and features of biomedical images

E.g. identify and analyze tumors in CT scans

Assist medical research, diagnosis, and treatment planning
Many image segmentation methods: manual annotation, thresholding, …

https://datahacker.rs/020-overview-of-semantic-segmentation-methods/

So, let’s start my presentation with the broad concept of image segmentation, and then gradually narrow down to what i will do in the summer

Image segmentation is the process of dividing an image into multiple meaningful and coherent regions or objects. It involves assigning a specific label or class to each pixel or group of pixels in the image, allowing for the identification and extraction of distinct regions of interest. In our case, Image segmentation is particularly important for analyzing and understanding various structures and features within medical images. Biomedical image segmentation plays a crucial role in applications such as:

Tumor Detection and Analysis: Segmenting medical images, such as magnetic resonance imaging (MRI) or computed tomography (CT) scans, to identify and analyze tumors. This can aid in diagnosis, treatment planning, and monitoring the progression of diseases.

Using biomedical image segmentation, we want to extract relevant information from medical images to assist in medical research, diagnosis, and treatment planning.

Well, there are a lot of ways people can segment biomedical images: For example, people can do munal annotation or people can use thresholding based on a selected intensity value.[optional]: examples of biomedical image segmentation application

Cell Segmentation: Segmenting individual cells within microscopy images to study their properties, such as size, shape, and spatial distribution. Cell segmentation is vital in fields like pathology, cytology, and cellular biology.
Organ and Tissue Segmentation: Segmenting organs or specific anatomical structures in medical images to facilitate quantitative analysis, surgical planning, or disease assessment. For example, segmenting the heart in cardiac images for assessing its function or segmenting the brain in neuroimaging for studying brain regions.
Vessel Segmentation: Identifying and extracting blood vessels from medical images, such as retinal images or angiography, to aid in the diagnosis and monitoring of vascular diseases like diabetic retinopathy or arterial blockages.
Nuclei Segmentation: Segmenting cell nuclei in histopathology images for various purposes, including cancer diagnosis, grading, and quantification of biomarkers.

4 of 28

Biomedical image segmentation pipeline: ML approach

Research Papers

International Competitions

Propose New Architecture

CNN
random forest
U-Net
…

DataSet Properties

DataSet challenge

Data splitting

Initialization

Loss function

Monitoring and validation

Hyperparameter tuning

Performance metric

Testing

Additional steps

Model Architecture

Pre-processing

Model training

post-processing

In this project, we will focus on a machine learning approach. Here is a standard, high-level pipeline of using machine learning skills to segment biomedical images and there are basically four big steps.

The first step is to choose a good model architecture for the specific task you are trying to do, and you could explore research papers, search for existing architectures online, check out the international competitions in the related area for inspiration or you can propose a new architecture
The next step is to do some data pre-processing. In this step, you should consider the dataset you are trying to analyze, what are their properties? What challenges are they facing? What are the things you can do to get around these challenges? And how are you going to split the dataset to more effectively train a model?
The third step is model training. In this step, you should answer questions like how are you going to initialize the model? How are you going to calculate the loss during training? You should also monitor the performance of the model in the process and tune some hyperparameters to further improve the model performance
The final step is post-processing. In this step, you should answer the questions like how are you going to quantify your model? What performance metric will you use? How was your model performance on a dataset that it never saw before? And are there any additional steps you can do to further improve your model?

5 of 28

Our choice of machine learning architecture: U-Net

Anand, V.; Gupta, S.; Koundal, D.; Nayak, S.R.; Barsocchi, P.; Bhoi, A.K. Modified U-NET Architecture for Segmentation of Skin Lesion. Sensors 2022, 22, 867. https://doi.org/10.3390/s22030867

Our approach: U-Net

Effective feature representation: capture low-level & high-level features
Reduced need for training data: has good performance even with limited data [our case]
Relatively flexible: could easily be extended to adapt to additional biomedical imaging datasets.
Promising: has shown significant success in literature.
Available pretrained models: transfer learning is faster to train and often results in better results
Apply to both 2D and 3D data

Model Architecture

Pre-processing

Model training

post-processing

U-Net is first proposed in 2015

So the architecture we have chosen is U-Net, and we chose it for several reasons If anyone is interested, i am happy to answer why this U-Net structure have these advantages, and I will go over them in detail when we discuss U-Net architecture. But for now, I will basically describe them in examples to help you understand it!

very effective feature representation.

First, what is a feature? Features are basically inputs to our model, and you can image it as some property of our data. E.g. in our case of detecting tumors in lung CT scans, a common feature is "intensity". Intensity-based features capture the variations in pixel intensities within the image, which can provide valuable information about the presence of tumors. For instance, tumors in CT images often exhibit higher or lower intensity values compared to the surrounding healthy tissues.
High-level features capture abstract information related to the overall structure or context of an image, while low-level features represent basic or local characteristics at the pixel level.

High-level feature E.g.: relationship between tumor regions and surrounding anatomical structures, such as blood vessels or adjacent organs.
Low-level feature E.g: differences in pixel- intensity values between tumor regions and normal tissues.

Reduced need for training data.

One of the major challenge of using machine learning approaches in biomedical image segmentation is actually getting the data itself. To train a comprehensive model, we usually need huge amount of accurate manual annotated data and this training dataset is very time-consuming to generate! In our case, we have very limited dataset and one of the major reason we choose U-Net is because it has shown good performance even with limited training data.

Relatively flexible:

at a starter, we are developing the image segmentation models for lung tumor detection, U-Net is a versatile architecture that can be adapted to different biomedical imaging modalities and segmentation tasks. It can handle various input sizes, such as 2D images, 3D volumes, or multi-channel images. So it will be easy for us to extend add new features and develop its ability to segment additional biomedical image datasets.

Promising:

has shown significant success in literature.

Availability of Pretrained Models:

Due to its widespread adoption and popularity, pretrained U-Net models and pre-trained weights are readily available.
So typically, there are two ways to implement a machine learning model, we can 1) training from scratch or 2) training using a pre-trained model / transfer training. You can understand transfer learning as basically we uses a pretrained model, trained on benchmark datasets, as a backbone model to train on different data involving similar or different tasks
Transfer learning approach is faster to train and typically produces better results than training from scratch

6 of 28

U-Net topology: Encoder & Skip Connections & Decoder

Encoder

Contracting path
Gradually reduce the spatial resolution while increasing # of captured image features
Extract higher-level/ abstract representations from the input image.

Decoder

Expanding path
Gradually increase the spatial resolution, concatenating skip connection
Extract lower-level / detailed representations from the input image

Skip Connections:

directly link corresponding layers between the encoding and decoding paths.
enable the flow of information at different scales and preserve the fine-grained details

Model Architecture

Pre-processing

Model training

post-processing

7 of 28

Encoder: Capture high-level features hierarchically.

A series of convolutional layers and pooling layers
Convolutional layer:

Apply multiple convolutional filters to abstract features form the input data
Each filter learns to extract a different set of features from the input image
Output a stack of feature maps

Pooling layer:

Downsample the feature maps and capture larger scale information by max-pooling:
Max pooling: retains the most prominent feature while discarding less relevant information.

Model Architecture

Pre-processing

Model training

post-processing

https://plainenglish.io/blog/pooling-layer-beginner-to-intermediate-fa0dbdce80eb

Now, let’s take a closer look at encoder. The key idea here is to capture high-level features hierarchically at different levels of abstraction.

Encoder

a series of convolutional and pooling layers.

Here in the graph, you can treat both the orange and the red arrow as a convolutional layer, and the blue arrow is the pooling layer
if anyone is curious, we actually only has the red arrow as standard convolutional layer, because the orange has some dropout inside, but we don’t need to dive into that concept for now

Convolutional layer:

Apply multiple convolutional filters [matrix of weights] to abstract features form the input data
Each filter learns to extract a different set of features from the input, leading to a richer representation of the data
The output of is a stack of feature maps [feature map is basically a spatial representation of the properties that your model have learned from the input image]

Pooling layer:

Downsample the feature maps [reduce its image size and therefore resolution] and capture larger scale information by max-pooling:
Max pooling: partitions the input feature map into non-overlapping regions and outputs the maximum value within each region. By selecting the maximum value, max pooling retains the most prominent feature in each region while discarding less relevant information.

8 of 28

Encoder: Capture high-level features hierarchically.

A series of convolutional layers and pooling layers
Convolutional layer:

Apply multiple convolutional filters to abstract features form the input data
Each filter learns to extract a different set of features from the input image
Output a stack of feature maps

Pooling layer:

Downsample the feature maps and capture larger scale information by max-pooling:
Max pooling: retains the most prominent feature while discarding less relevant information.

Model Architecture

Pre-processing

Model training

post-processing

https://plainenglish.io/blog/pooling-layer-beginner-to-intermediate-fa0dbdce80eb

Now, let’s take a closer look at encoder. The key idea here is to capture high-level features hierarchically at different levels of abstraction.

Encoder

a series of convolutional and pooling layers.

Here in the graph, you can treat both the orange and the red arrow as a convolutional layer, and the blue arrow is the pooling layer
if anyone is curious, we actually only has the red arrow as standard convolutional layer, because the orange has some dropout inside, but we don’t need to dive into that concept for now

Convolutional layer:

Apply multiple convolutional filters [matrix of weights] to abstract features form the input data
Each filter learns to extract a different set of features from the input, leading to a richer representation of the data
The output of is a stack of feature maps [feature map is basically a spatial representation of the properties that your model have learned from the input image]

Pooling layer:

Downsample the feature maps [reduce its image size and therefore resolution] and capture larger scale information by max-pooling:
Max pooling: partitions the input feature map into non-overlapping regions and outputs the maximum value within each region. By selecting the maximum value, max pooling retains the most prominent feature in each region while discarding less relevant information.

9 of 28

Encoder: Capture high-level features hierarchically.

A series of convolutional layers and pooling layers
Convolutional layer:

Apply multiple convolutional filters to abstract features form the input data
Each filter learns to extract a different set of features from the input image
Output a stack of feature maps

Pooling layer:

Downsample the feature maps and capture larger scale information by max-pooling:
Max pooling: retains the most prominent feature while discarding less relevant information.

Model Architecture

Pre-processing

Model training

post-processing

https://plainenglish.io/blog/pooling-layer-beginner-to-intermediate-fa0dbdce80eb

Now, let’s take a closer look at encoder. The key idea here is to capture high-level features hierarchically at different levels of abstraction.

Encoder

a series of convolutional and pooling layers.

Here in the graph, you can treat both the orange and the red arrow as a convolutional layer, and the blue arrow is the pooling layer
if anyone is curious, we actually only has the red arrow as standard convolutional layer, because the orange has some dropout inside, but we don’t need to dive into that concept for now

Convolutional layer:

Apply multiple convolutional filters [matrix of weights] to abstract features form the input data
Each filter learns to extract a different set of features from the input, leading to a richer representation of the data
The output of is a stack of feature maps [feature map is basically a spatial representation of the properties that your model have learned from the input image]

Pooling layer:

Downsample the feature maps [reduce its image size and therefore resolution] and capture larger scale information by max-pooling:
Max pooling: partitions the input feature map into non-overlapping regions and outputs the maximum value within each region. By selecting the maximum value, max pooling retains the most prominent feature in each region while discarding less relevant information.

10 of 28

Skip Connection: enable multi-level information flow

preserve spatial details and enable information flow at multiple levels

Skip connections are created by concatenating/merging the feature maps from the encoding path to the corresponding layers in the decoding path.

These connections allow the decoder to access both local and global information, aiding in accurate segmentation.

Model Architecture

Pre-processing

Model training

post-processing

11 of 28

Decoder: reconstruct lower-level features hierarchically

A series of upsampling and convolutional layers
Upsampling

using deconvolutions through padding

the upsampled feature maps are concatenated with the corresponding skip connection, enables the decoder to recover spatial details

A series of convolutional filters are then applied to concatenated feature maps to extract and learn relevant features, and further refine our segmentation

Model Architecture

Pre-processing

Model training

post-processing

With our knowledge of encoder, it should be easier for you to understand what is a decoder. I can imagine it as a symmetric counterpart of a encoder. The key idea of Decoder is to reconstruct lower-level features hierarchically.

Deconder

A series of upsampling layers and convolutional layers

Again, here the green arrows are upsampling layers and the red arrows are convolutional layers

Upsampling

is typically done using deconvolutions through padding (add empty broader pixels around input pixel, this could enlarge your feature map and therefore increase the resolution and then you adjust them by some weights learned from the model)

At each upsampling step, the upsampled feature maps are concatenated with the corresponding skip connection from the encoding path, enables the decoder to recover spatial details
After that, a series of convolutional filters are then applied to concatenated feature maps to extract and learn relevant features and further refine our segmentation

12 of 28

Decoder: reconstruct lower-level features hierarchically

A series of upsampling and convolutional layers
Upsampling

using deconvolutions through padding

the upsampled feature maps are concatenated with the corresponding skip connection, enables the decoder to recover spatial details

A series of convolutional filters are then applied to concatenated feature maps to extract and learn relevant features, and further refine our segmentation

Model Architecture

Pre-processing

Model training

post-processing

With our knowledge of encoder, it should be easier for you to understand what is a decoder. I can imagine it as a symmetric counterpart of a encoder. The key idea of Decoder is to reconstruct lower-level features hierarchically.

Deconder

A series of upsampling layers and convolutional layers

Again, here the green arrows are upsampling layers and the red arrows are convolutional layers

Upsampling

is typically done using deconvolutions through padding (add empty broader pixels around input pixel, this could enlarge your feature map and therefore increase the resolution and then you adjust them by some weights learned from the model)

At each upsampling step, the upsampled feature maps are concatenated with the corresponding skip connection from the encoding path, enables the decoder to recover spatial details
After that, a series of convolutional filters are then applied to concatenated feature maps to extract and learn relevant features and further refine our segmentation

13 of 28

DataSets Properties: variability & class-imbalance & limited

Model Architecture

Pre-processing

Model training

post-processing

Large variability

due to variations in imaging modalities, patient populations, disease states, and imaging protocols
can pose challenges in designing robust models that generalize well across different variations

Class imbalance

regions of interest might be significantly underrepresented compared to others.
Dealing with class imbalance is crucial to ensure that the model learns to segment all classes effectively and avoids biased predictions.

Limited annotated data

It is common to have limited annotated data, which necessitates techniques like data augmentation and/or transfer learning to mitigate the limitations of scarce annotations.

Now we know that the architecture we choose is U-Net, let’s proceed to the second step of data pre-processing. First, we need to explore the dataset we will be working on and observe if they have some properties. In general, biomedical image datasets have several unique properties due to the nature of the data and the specific characteristics of the medical domain. Here are some main properties of biomedical image datasets:

Large Variability: Biomedical datasets typically exhibit significant variability due to variations in imaging modalities, patient populations, disease states, and imaging protocols. This variability can pose challenges in designing robust models that generalize well across different variations and capture the necessary patterns for accurate segmentation.

Class Imbalance: Class imbalance is common in biomedical image datasets, where certain classes or regions of interest might be significantly underrepresented compared to others. For example, in a dataset of cancerous tissue segmentation, the area of cancerous regions may be relatively small compared to healthy tissue. Dealing with class imbalance is crucial to ensure that the model learns to segment all classes effectively and avoids biased predictions.

Annotated Data Availability: Biomedical image datasets often require expert annotation, which can be time-consuming and costly. The availability of annotated data may vary depending on the specific task and imaging modality. It is common to have limited annotated data, which necessitates techniques like data augmentation, transfer learning, or semi-supervised/weakly supervised learning to mitigate the limitations of scarce annotations.

14 of 28

Model Training: learn from data & extract useful info

Data Pre-processing

Train

Test

Hyperparameter initialized model

Train Input

Train Output

Trained model

Test Input

Test Output

Predicted Output

Model evaluation

compare

ML Architecture: U-Net

Model Architecture

Pre-processing

Model training

post-processing

15 of 28

Model Training: hyper-parameters fine-tuning

Data Pre-processing

Train

Test

Hyperparameter initialized model

Train Input

Train Output

Trained model

Test Input

Test Output

Predicted Output

Model evaluation

compare

ML Architecture: U-Net

Model Architecture

Pre-processing

Model training

post-processing

Best/ Final model

16 of 28

Multiple design choices are needed to obtain optimal model

U-Net (Variants)

Attention U-Net
Residual U-Net
Dense U-Net
Ensemble U-Net
Adversarial U-Net
Inception U-Net
Recurrent U-Net
2.5D U-Net
…

Data Augmentation

Rotation, flipping, scaling
Elastic deformation
…

Loss function

Dice loss
Cross-entropy loss
…

Qualification Metric

Dice Coefficients
Hausdorff Distance
…

…

U-Net and its Variants

Other Design Choices

Model Architecture

Pre-processing

Model training

post-processing

Now you have noticed that in the process of training model, we are not training only one model. Instead, we are training multiple models that share the basic logic but have small differences, and pick the best one among them. So all the concepts we have went over in the U-Net section, like encoders, skip connections and decoders only give us a very general idea of how a U-Net works. Based on the general idea, U-Net has developed many variants that are designed for specific datasets / segmentation tasks.

For example, the basic U-Net is designed to do 2D surface segmentations, but for CT scans that have 3D information, a new variant called 2.5D U-Net is developed to incorporate the valuable volume information and achieve better accuracy.

Even though we used a U-Net (Variant) that is a good fit for our specific segmentation task, it doesn’t mean that the model we have trained will be optimal. There are still a lot of other design choices we should make to achieve optimal performance of our model.

[optional]

For example, here are some major pipeline design choices we could make when training our model

Image Preprocessing: Decide on the preprocessing steps to be applied to the input images, such as resizing, normalization, or intensity adjustment, to ensure consistency and enhance the model's performance.
Data Augmentation: Determine the data augmentation techniques to be used, such as rotation, flipping, scaling, or elastic deformations, to increase the diversity of the training data and improve the model's ability to generalize to unseen examples.
Loss Function: Select an appropriate loss function for the segmentation task, such as pixel-wise cross-entropy, dice coefficient, or a combination of multiple losses, depending on the desired properties of the segmentation output.
Model Architecture: Define the specific architecture of the U-Net, including the number of encoding and decoding blocks, the number of filters in each block, and the use of additional architectural modifications such as residual connections or attention mechanisms.
Initialization and Regularization: Choose appropriate weight initialization methods, such as random initialization or pre-training with a related task or dataset. Consider incorporating regularization techniques like dropout or batch normalization to prevent overfitting and improve generalization.
Learning Rate and Optimization: Determine the learning rate schedule or use adaptive learning rate techniques to control the optimization process. Select an appropriate optimizer, such as Adam, SGD, or variants like AdamW or RMSprop, to update the model's weights during training.
Batch Size and Training Schedule: Decide on the batch size that balances computational efficiency and memory constraints. Explore different training schedules, including the number of epochs, early stopping criteria, and learning rate adjustments, to optimize the model's convergence and performance.
Evaluation Metrics: Choose suitable evaluation metrics, such as dice coefficient, intersection over union (IoU), or pixel accuracy, to assess the quality of the segmentation output and compare different model iterations or variations.

These pipeline design choices should be carefully considered and tailored to the specific characteristics and requirements of your dataset and segmentation task to achieve the best possible performance with your U-Net model.

17 of 28

Post-Processing: potential further improvement

additional steps after the initial segmentation output of the U-Net model
[potentially] refine and improve the segmentation results
involve applying various techniques to address specific challenges and enhance the quality of the segmentation output.

Smoothing operations

Create more coherent segmentations

Conditional rules

Domain-specific rules [shape constraints]

Connected component analysis

Better connectivity of segmented objects

Model Architecture

Pre-processing

Model training

post-processing

Post-processing is basically some additional steps that we could do after the initial segmentation output of the U-Net model to refine and improve the segmentation results. These steps typically involve applying various techniques to address specific challenges and enhance the quality of the segmentation output. Here are some common post-processing techniques used with U-Net:

Smoothing: Smoothing operations, such as Gaussian smoothing or morphological operations like erosion and dilation, can be applied to reduce noise, fill gaps, and remove small isolated regions in the segmentation mask. These operations help create more coherent and visually appealing segmentations.

Conditional Rules: Biomedical images often have specific characteristics that can be exploited to improve segmentation accuracy. Domain-specific rules or heuristics can be applied to refine the segmentation output. For example, intensity thresholds, shape constraints, or anatomical priors can be used to guide the post-processing and improve the final segmentation.

Connected Component Analysis: Connected component analysis involves identifying and labeling connected regions in the segmentation mask. This analysis can help remove isolated small regions or merge fragmented segments that are likely to belong to the same object. It ensures better connectivity and more accurate delineation of the segmented objects.

The specific post-processing methods employed may vary depending on the characteristics of the dataset, the nature of the segmentation task, and the desired output quality. Experimentation and fine-tuning of the post-processing techniques are often performed to achieve the best results for a particular biomedical image segmentation task.

18 of 28

… Oops! U-Net pipeline has major drawbacks

Cumbersome U-Net pipeline design

Most design choices are highly dependent on each other

Many U-Net variants
Number of neutrons, number of convolutional layers, learning rate, dropout, loss function, qualification metric…

Difficult to follow the literature and ascertain design choices that generalize beyond the experiment they demonstrate

Time-consuming

Current practice is expert-driven
Involved manual trials and error experiments
specific to the task at hand

Theoretically, if we carefully follow the proposed pipeline to train machine learning model for our segmentation task, we should at least have a decent result. However, this U-Net pipeline has two major drawbacks

Cumbersome U-Net pipeline design

most design choices are highly dependent on each other, and It very difficult to follow the literature and ascertain which design principles actually generalize beyond the experiment they were demonstrated on

U-Net variants [residual, attention, dense, 2.5D…]
Number of neutrons, number of convolutional layer, learning rate, dropout, loss function, evaluate metric

For 3D biomedical imaging, dataset have drastically different properties like imaging modality, image size, (anisotropic) voxel spacing, and class ratio, so the ml pipeline design can be cumbersome

Current practice is expert-driven and involves manual trial and error experiments that are typically specific to the task at hand

19 of 28

Systematic U-Net pipeline design? No New U-Net!

Motivation:

Researchers don’t want to design a new U-Net pipeline everytime they have a new segmentation task.

Thought :

Is here a higher-level model that design a U-Net for me?

Solution:

nnU-Net (no new U-Net)
The method design a U-Net pipeline for the specific dataset and segmentation task
achieves state of the art performance on several medical segmentation benchmarks.

Isensee, F., Jäger, P. F., Kohl, S. A. A., Petersen, J., & Maier-Hein, K. H. (2020, April 2). Automated design of Deep Learning Methods for Biomedical Image segmentation. arXiv.org. https://arxiv.org/abs/1904.08128

20 of 28

Pipeline comparison: expert-driven V.S. nnU-Net

21 of 28

nnU-Net: a ML to design ML(s) to make predictions

Segmentation algorithm can be formalized as:

f = segmentation algorithm (U-Net)
x = input image
y-hat = predicted segmentation
θ = set of hyper-parameters

nnU-Net: formalizing the process of adjusting θ based on dataset

g = nnU-Net
X, Y = datasets properties
θ = the formalized optimal set of hyper-parameters

For most people who don’t specialize in the machine learning, the whole idea of nnU-Net might be a little bit difficult to understand, so I would love to give a math metaphor of it.

Θ is the set of hyperparameters required for training and applying the method. The dimensionality of θ can be quite large, covering the entire experimental pipeline from preprocessing to inference.

Publications usually focus on reporting and substantiating the most relevant choices regarding θ, and ideally provide source code to cover θ entirely. This process, however, lacks insights into how θ must be adjusted if transitioning to a new dataset with different properties. Here, we make the first attempt at formalizing this process.

nnU-Net automatically adapts to any new dataset

Define dataset fingerprint information comprising key properties such as image size, voxel spacing, or class ratios

Define pipeline fingerprint as the entirety of choices being made during method design like learning rate, dropout…

nnU_Net is designed to generate a successful pipeline fingerprint for a given dataset fingerprint （infer aata-dependent hyperparameters）

E.g. dataset properties like image size affects the size of patches that the model sees during training, which in turn affects the required network topology (number of downsampling steps, size of convolution filters)

22 of 28

nnU-Net configures seg pipeline using 3-step recipe

Fixed parameters

are not adapted.
certain architecture and training properties that can simply be used all the time.
e.g.nnU-Net's loss function, (most of the) data augmentation strategy and learning rate.

Rule-based parameters

use the dataset fingerprint to adapt certain segmentation pipeline properties by following hard-coded heuristic rules.
e.g.the patch size, network topology and batch size are optimized jointly given some GPU memory constraint.

Empirical parameters

are essentially model-learned / trial-and-error.
E.g. the optimization of the postprocessing strategy.

Training time estimate: 18 hrs - 3 days (dependent on dataset)

nnU-Net configures its segmentation pipelines based on a three-step recipe:

Fixed parameters are not adapted. During development of nnU-Net we identified a robust configuration (that is, certain architecture and training properties) that can simply be used all the time. This includes, for example, nnU-Net's loss function, (most of the) data augmentation strategy and learning rate.

Rule-based parameters use the dataset fingerprint to adapt certain segmentation pipeline properties by following hard-coded heuristic rules. For example, the network topology (pooling behavior and depth of the network architecture) are adapted to the patch size; the patch size, network topology and batch size are optimized jointly given some GPU memory constraint.

Empirical parameters are essentially trial-and-error. For example the selection of the best U-net configuration for the given dataset (2D, 3D full resolution, 3D low resolution, 3D cascade) and the optimization of the postprocessing strategy.

23 of 28

nnU-Net output & its performance in competitions

Based on a given dataset, nnU-Net creates three U-Net configurations:

a 2D U-Net (for 2D and 3D datasets)
a 3D U-Net that operates on a high image resolution (for 3D datasets only)
a 3D U-Net cascade where first a 3D U-Net operates on low resolution images and then a second high-resolution 3D U-Net refined the predictions of the former (for 3D datasets with large image sizes only)
Image segmentation time estimate: <60s - 10mins (dependent on dataset)

nnU-Net outcompetes many specialized deep learning pipelines,

and in [KiTS challenge] semi-target setting (Lung Tumor

segmentation), it has the Best performance!

Given a new dataset, nnU-Net will systematically analyze the provided training cases and create a 'dataset fingerprint'. nnU-Net then creates several U-Net configurations for each dataset:

2d: a 2D U-Net (for 2D and 3D datasets)

3d_fullres: a 3D U-Net that operates on a high image resolution (for 3D datasets only)

3d_lowres → 3d_cascade_fullres: a 3D U-Net cascade where first a 3D U-Net operates on low resolution images and then a second high-resolution 3D U-Net refined the predictions of the former (for 3D datasets with large image sizes only)

Note that not all U-Net configurations are created for all datasets. In datasets with small image sizes, the U-Net cascade (and with it the 3d_lowres configuration) is omitted because the patch size of the full resolution U-Net already covers a large part of the input images.

24 of 28

nnU-Net could be sub-optimal: possible further improvements!

nnU-Net could be suboptimal for some segmentation tasks

It’s developed with a focus on the Dice coefficient [2D scenarios] as performance metric, may not be optimal for other metrics [3D scenarios]
Unconsidered dataset properties could exist which may cause suboptimal segmentation performance
Possible post-processing techniques specifically for our dataset may not be included in nnU-Net

For highly domain specific cases, nnU-net should be seen as a good starting point for necessary modifications

E.g. in this study, the proposed modifications to the default nnU-Net pipeline substantially improved the results both on the training set cross-validation as well as the official validation set

nnU_Net could be suboptimal for some segmentation tasks

It’s developed with a focus on the Dice coefficient as performance metric, may not be optimal for other metrics

For example, in out target task of segmenting tumor tissues in 3D lung CT scans, although we would probably start with 2D segmentation, we will eventually move on to 3D segmentation (this provide us much more information and allow for potentially better segmentation), where the Dice coefficient does not apply that well.

Unconsidered dataset properties could exist which may cause suboptimal segmentation performance

The main thing here is that nnU-Net only learned the relationship between pipeline fingerprints and dataset fingerprints based on limited number of dataset, and not specially in the direction of lung images,

For highly domain specific cases, nnU_net should be seen as a good starting point for necessary modifications

25 of 28

Further improvements & Next Steps

Further Improvements we can do:

Extended Benchmark Datasets train our own nnU-Net [lung-tumor-oriented]

Faster reaction time:

10mins -> ~1s [this study]

Next Steps

Pretrained Model performance test
Train our nnU-Net [lung cancer oriented]
Validation on UCSF datasets

Future steps

Extended segmentation tasks
User Friendly reaction time
3D interactive feature

Lung Nodule Analysis 2016	880 patients 2D
Kaggles Data Science Bowl	1397 patients 2D
The Lung Image Database Consortium dataset	[LIDC] → 1024 patients / 2D

26 of 28

Index Page: annotated perspective papers & ML terms

In you are interested, please check out this doc that summarizes the related literature.

Each paper is highlighted in four levels

Important contents that are highly related to our project
Semi-related examples/ explanations/ supplements
ML terms that are annotated in more details
Alternative methods / research themes that could be further explored

In you are interested, please check out this doc that give more detailed info for machine-learning related terms noted in the literature review.

This doc is organized in a standard pipeline of biomedical image segmentation
For each major action, the functionality of the action and the methods are noted

27 of 28

Summary: nnU-Net for biomedical segmentation task

ML image seg pipeline

Model architecture

U-Net

Pre-processing

Data augmentation

Model training

hyperparameter

Post-processing

Further improve

nnU-Net to design U-Net

3-step recipe:

Fixed
Rule-based
exp-learned

Default output:

2D
3D
3D cascade

Good starting point

Application & next steps

Pretrain model test
Train task-specific nnU-Net
Validate on UCSF dataset

Cumbersome Design

Sub-optimal

28 of 28

THANK U

Selina Liu😊