1 of 17

XI INTERNATIONAL CONFERENCE

“INFORMATION TECHNOLOGY AND IMPLEMENTATION” (IT&I-2024)

Enhancing Object Detection and Classification in High-Resolution Images Using SAHI Algorithm and Modern Neural Networks

Oleksii Bychkov, Kateryna Merkulova, Yelyzaveta Zhabska and Andrii Yaroshenko

2 of 17

Fundamental Theoretical Aspects

The Slicing Aided Hyper Inference (SAHI) algorithm is a key component of the proposed

approach for object detection and classification in high-resolution images. The main idea behind SAHI

is to divide the large input image into smaller, overlapping patches, which can then be processed

independently by the object detection models. This approach has several advantages over traditional

methods that rely on resizing or cropping the image to fit the input size of the neural network.

The SAHI algorithm consists of the following steps:

1. Slicing:

  • The image is divided into smaller, overlapping patches.
  • Patch size and overlap are adjustable parameters.
  • Tested configurations:
    • Patch sizes: 256×256, 512×512, 1024×1024 pixels
    • Overlap ratios: 0.25 and 0.5

3 of 17

2. Inference:

  • Each patch is processed by an object detection model, yielding bounding boxes and class probabilities.
  • Models evaluated: YOLOv5, YOLOv8, YOLOX, Torchvision, RetinaNet.

Fundamental Theoretical Aspects (cont.)

3. Merging:

  • Outputs from overlapping patches are combined.
  • Duplicate detections are resolved using strategies like Non-Maximum Suppression (NMS).

4. Post-processing:

  • Final detections are refined through probability thresholding and adjustment of bounding boxes to match the original image dimensions.

4 of 17

Non-Maximum Suppression (NMS)

It removes redundant bounding boxes, retaining only the most confident detections for each object. This is especially important in sliding window approaches, like SAHI, where overlapping patches can generate multiple detections for the same object.

Fundamental Theoretical Aspects (cont.)

How NMS Works:

  1. Sort detections by confidence scores (highest to lowest).
  2. Select the box with the highest score as a final detection.
  3. Compare the selected box with remaining boxes and calculate Intersection over Union (IoU).
  4. Suppress boxes with IoU above a predefined threshold (commonly 0.5).
  5. Repeat until all boxes are processed.

B1, B2 — bounding boxes of detections

High IoU indicates significant overlap, suggesting the same object.

5 of 17

We conducted experiments to assess the performance of the SAHI algorithm combined with five object detection models. The evaluation considered several factors:

Key Variables:

  • Patch Size: 256×256, 512×512, and 1024×1024 pixels
    • Smaller patches enable finer detection but increase computational costs.
    • Larger patches reduce computational overhead but risk missing small objects.
  • Overlap Ratio: 0.25 and 0.5
    • Higher overlap reduces object splitting but increases processing overhead.
  • Object Detection Models:
    • YOLOv5, YOLOv8, YOLOX, Torchvision, RetinaNet

Experimental Evaluation

6 of 17

Metrics Evaluated:

  1. Execution Time: Measures computational speed.
  2. Error Percentage: Quantifies deviation from ground truth (lower is better):

  • Efficiency: A composite metric balancing accuracy and speed (higher is better):

    • Emphasizes error significance over execution time.

Experimental Evaluation

7 of 17

A large beach panorama was selected to evaluate the detection of small objects such as people, cars, and boats. The image details are as follows:

  • Dimensions: 19,968×6,144 pixels (0.122 GPixels)
  • Format: PNG
  • Disk Size: 185.56 MiB
  • RAM Usage: 351.00 MiB

Initial Observations:

  • Without SAHI Algorithm: No objects were detected using all neural networks.

SAHI Implementation:

Tile Size: Multiple sizes tested (256×256, 512×512, and 1024×1024 pixels).

Overlap Ratio: Assessed at 25% and 50%.

Case Study: Beach Panorama Analysis

8 of 17

Studied Image Preview

Case Study: Beach Panorama Analysis

9 of 17

Experiment Results: Beach Panorama Detection

Figure 1 - results for patch size 512x512 and overlap ratio of 50%.

Figure 2 - results of various combinations of patch sizes,

overlap ratios and detection models with true value

10 of 17

NMS Impact on Object Detection

Figure 3 - detections with NMS on the left and without on the right

Left (With NMS):

  • Redundant bounding boxes suppressed.
  • Only the most confident detections are retained.
  • Cleaner and more precise detection results.

Right (Without NMS):

  • Multiple overlapping bounding boxes for the same objects.
  • Higher detection count but includes many false positives.

This highlights the effectiveness of NMS in reducing redundant detections and improving the clarity of object detection outcomes.

11 of 17

To validate detection accuracy, the approximate number of people in the photo was manually counted at ~1,100. The experiment evaluated combinations of neural networks and post-processing methods, with key observations as follows:

General Observations:

  • Detection Variability: Detected object counts vary significantly across neural networks and post-processing methods.
  • SAHI (RAW): Yields the highest number of detections without post-processing but includes many false positives.
  • Post-Processing Methods: Techniques like NMS, NMM, Greedy NMM, and LSNMS effectively remove redundant detections, significantly reducing object counts (Figure 1).

Experiment Results: Beach Panorama Detection

12 of 17

Figure 4 — results for GREEDYNMM

Experiment Results: Beach Panorama Detection

This table summarizes the results of all previous observations, evaluating:

  • Error and Efficiency for each combination of:
    • Detection models (YOLOv5, YOLOv8, YOLOX, Torchvision, RetinaNet)
    • Patch sizes (256×256, 512×512, 1024×1024)
    • Overlap ratios (0.25, 0.5)
  • Post-Processing: GREEDYNMM NMS algorithm applied.

The table highlights the trade-offs between accuracy and computational performance across different configurations.

13 of 17

Neural Network Comparison:

  • YOLOv5 & YOLOv8: Similar performance, with YOLOv8 slightly outperforming YOLOv5.
  • YOLOX: Detects more objects overall, particularly for rare classes.
  • Torchvision (256×256, 25% overlap):
    • Achieved the lowest error rate.
    • However, efficiency was not the highest due to extensive processing time.
  • Torchvision (512×512, 50% overlap):
    • Demonstrated the highest efficiency, balancing accuracy and processing time.
  • RetinaNet: Lowest detection count among all networks.

These results (Figure 2) highlight the strengths and limitations of each network and post-processing combination in detecting small objects.

Experiment Results: Beach Panorama Detection

14 of 17

For detailed results and additional experiments, including:

  • Comprehensive performance analysis of various neural networks and post-processing methods.
  • Comparisons across different patch sizes, overlap ratios, and detection strategies.
  • Quantitative and qualitative evaluations of detection outcomes.

Please refer to the full paper for more in-depth insights.

Further Experiments

15 of 17

This study proposed a novel framework for object detection in high-resolution images, combining the SAHI algorithm with state-of-the-art neural networks (YOLOv5, YOLOv8, YOLOX, Torchvision, and RetinaNet). Key contributions and findings include:

Key Contributions:

  • Challenge Addressed: Improved detection of small objects in high-resolution images by dividing them into smaller patches while maintaining detail and resolution.
  • Enhanced Accuracy: Integration of SAHI with YOLOv8 and YOLOX models delivered superior precision and recall compared to traditional methods.
  • Class Imbalance: RetinaNet's performance highlighted the need for strategies addressing small and rare object detection.

Conclusion

16 of 17

Real-World Applications:

  • Scalable solution adaptable to domains like satellite imagery, medical imaging, and surveillance.

Future Directions:

  • Optimizing patching strategies and exploring advanced architectures.
  • Incorporating attention mechanisms and transformer models for further improvements.

This research provides a robust, efficient framework for object detection in high-resolution images, paving the way for accurate and reliable systems across diverse applications.

Conclusion

17 of 17

Department of Software Systems and Technologies�Taras Shevchenko National University of Kyiv�Kyiv, UKRAINE

Oleksii Bychkov�bos.knu@gmail.com

Kateryna Merkulova �kate.don11@gmail.com

Yelyzaveta Zhabska�y.zhabska@gmail.com�

Andrii Yaroshenko

andrii.yaroshenko@knu.ua

Thank you for attention!