1 of 40

Image Recognition and Object Detection with MediaPipe on Android

Anant Chowdhary

Software Engineer

2 of 40

Agenda

  • What is MediaPipe ?
  • MediaPipe Tasks and Graphs
  • Vision Tasks
  • Object Detection
  • Image Classification
  • Image Segmentation
  • On-Device Machine Learning and MediaPipe
  • Summary

3 of 40

What is MediaPipe ?

  • MediaPipe Solutions provides a robust set of libraries and tools that helps users incorporate AI and ML techniques into their applications.

  • MediaPipe Solutions is a part of the MediaPipe Open Source Project - aids customizability so that solutions can be adapted to particular requirements.

  • MediaPipe is available across multi-platforms such as Android, iOS, Python and the Web.

4 of 40

What is MediaPipe ?

  • MediaPipe enables a diverse set of applied ML and AI use cases in applications.

Some of them are :

    • Vision Tasks:
      • Object Detection
      • Object Classification
      • Object Segmentation
    • Text Tasks :
      • Text Classification
      • Text Embedding
      • Language Detection
    • Audio Classification
    • Generative Tasks

5 of 40

Why MediaPipe ?

  • End-to-End solutions for many common ML tasks which are ready to use and can significantly speed up development.

  • MediaPipe is highly customizable and allows use of GPUs/CPUs and TPUs depending on device tiers. Performance and efficiency optimizations are key features of MediaPipe.

  • MediaPipe Solutions is Open Source and benefits from a vibrant community of contributors and users. Continuous support, frequent updates and shared knowledge are an advantage.

6 of 40

MediaPipe Tasks and Graphs

  • MediaPipe Tasks encapsulates the primary programming interface for MediaPipe Solutions.

  • It features a range of libraries that enable the deployment of machine learning solutions on devices with minimal code.

  • MediaPipe Graphs are powerful constructs that allow complicated ML pipelines to be built. They are the backbone of MediaPipe’s modular framework.

7 of 40

MediaPipe Tasks and Graphs

  • Each graph consists of nodes (Calculators) and edges (streams of data)

  • Written in a text based format (.pbtxt files)

  • Graphs need to be acyclic

8 of 40

MediaPipe Tasks and Graphs

  • Each graph consists of nodes (Calculators) and edges (streams of data)

  • Written in a text based format (.pbtxt files)

  • Graphs need to be acyclic

(Node) GrayScaleCalculator

(Node) PassThroughCalculator

Input Video (input_video)

Output Video (output_video)

9 of 40

MediaPipe Tasks and Graphs

  • Each calculator has logic that is used to compute its output, and the output is possibly passed onto the next node (calculator) connected to the calculator in the graph.

  • Example :

node {

calculator: “GrayScaleCalculator”

input_stream: “input_video”

output_stream: “grayscale_video”

}

node {

calculator: “PassThroughCalculator”

input_stream: “grayscale_video”

output_stream: “output_video”

}

output_stream: “output_video”

10 of 40

Vision Tasks

  • Our focus here, is on Vision Tasks
    • Object Detection:
      • Detecting objects of multiple classes within an image with a specified confidence measure.

    • Image Classification:
      • Identifying what an image represents (from a set of classes that are pre-defined)

    • Image Segmentation :
      • Partitioning an image into distinct regions corresponding to different objects / objects of different classes.

11 of 40

Vision Tasks

  • Our focus here, is on Vision Tasks

cell phone

Object Detection

Image Classification

Image Segmentation

12 of 40

Object Detection

13 of 40

Models used for Object Detection

  • MediaPipe’s Object Detector API requires a model to be downloaded into the project’s directory (on-device). Starting with default models is recommended.

  • Custom models must be in TensorFlow Lite format and must include specific metadata describing the operating parameters of the custom model.

  • Some commonly used models for object detection: EfficientDet-Lite0 (default recommended), EfficientDet-Lite2, SSD MobileNetV2.

14 of 40

Models used for Object Detection

  • Choosing a model is highly subjective based on the needs of an application

  • Some commonly used models:

EfficientDet-Lite0

EfficientDet-Lite2

SSD MobileNetV2

1.5 million object instances, 80 labels

1.5 million object instances, 80 labels

1.5 million object instances, 80 labels

Accurate, lightweight, strikes a balance between latency and accuracy

More accurate than EfficientDet-Lite0 in general, but slower and requires more memory,

Very fast, light but not as accurate as EfficientDet-Lite0

15 of 40

Models used for Object Detection

  • There’s generally a tradeoff between latency, accuracy and model size

  • You may decide to use different models based on multiple factors such as :
    • Cost to your customers
      • MediaPipe models must be downloaded over the internet in most cases.
      • Different models use different amounts of power and affect system performance differently.
    • Different devices may merit different models based on how powerful they are
    • Accuracy requirements
      • Generally a good idea to peruse through model evals before deciding on which model to use.

16 of 40

MediaPipe Object Detection on Android

  • MediaPipe’s Object Detector can be added to your project’s Gradle Dependencies:

dependencies {

implementation ‘com.google.mediapipe:tasks-vision:latest.release’

}

  • Whichever model you choose (custom or default), can be stored in your project’s assets directory
      • The project will need various signals on the model being used, such as:
        • What hardware to use : CPU/GPU
        • The model’s location within the device’s storage.
      • This can be configured with BaseOptions

17 of 40

BaseOptions for MediaPipe Tasks

  • BaseOptions.Builder supplies a default builder() which can help set parameters
    • Run on CPU/GPU
      • baseOptionBuilder.setDelegate(Delegate.CPU)
      • baseOptionBuilder.setDelegate(Delegate.GPU)

  • Set model path
    • baseOptionsBuilder.setModelAssetPath(MODEL_NAME)

18 of 40

MediaPipe ObjectDetector

  • ObjectDetectorOptions sets up options for MediaPipe’s object detector

  • Set up configuration options for Object Detection using ObjectDetector.ObjectDetectorOptions.builder()
    • setBaseOptions(baseOptionsBuilder.build()) - set base options that were previously created
    • setRunningMode(runningMode)
      • runningMode : RunningMode.IMAGE, RunningMode.VIDEO, RunningMode.LIVE_STREAM
    • setMaxResults(runningMode) - we often want only a small subset of results.
    • setScoreThreshold(threshold) - set the minimum confidence probability that each result should be filtered on.

19 of 40

MediaPipe ObjectDetector

  • Now that the ObjectDetectorOptions are ready, we’re ready to use the detector.

ObjectDetector

ObjectDetectorOptions

BaseOptions

RunningMode

maxResults

threshold

Allowlist

Denylist

20 of 40

MediaPipe ObjectDetector

objectDetector = ObjectDetector.createFromOptions(context, options);

  • Preparing an image for inference
    • MediaPipe Image (or MPImage) is designed to be an immutable image container, and is cross platform.
    • Convert an image to an MPImage before running inference:

val mpImageForInference = BitmapImageBuilder(bitmapImage).build()

  • Run inference
    • objectDetector.detect(image)
    • objectDetector.detect(image, imageProcessingOptions) in case more flexibility is required. For instance, running inference within only a specific region of the image to improve performance.
    • Example: ImageProcessingOptions.Builder().setRegionOfInterest can be used to specify the region to be processed.

21 of 40

MediaPipe ObjectDetector Results

  • ObjectDetectorResult is returned by objectDetector.detect

    • Consists of a List<com.google.mediapipe.tasks.components.containers.Detection>
    • Each Detection has
      • BoundingBox consisting of
        • Upper left corner coordinates (xmin, ymin)
        • height and width of the box
      • score - the confidence score of the detection
      • display_name - Human readable string for the detected object

22 of 40

Some Real World Examples of Object Detection

(Using EfficientDet-lite0.tflite)

23 of 40

Some Real World Examples of Supported Labels

(Using EfficientDet-lite0.tflite)

  1. Laptop
  2. Couch
  3. Chair
  4. Potted Plant
  5. Mouse
  6. Remote
  7. Keyboard
  8. Cell phone
  9. Microwave
  10. Oven
  11. Toaster
  12. Cake
  13. Donut
  14. Pizza
  15. Spoon

24 of 40

Image Classification

cell phone

25 of 40

Image Classification

  • (Recap) Identifying what an image represents (from a set of classes that are pre-defined)
  • Initialization steps very similar to Object Detection
    • Set up BaseOptions using BaseOptions.Builder
    • Set model path
    • Create ImageClassifierOptions similar to ObjectDetectorOptions (recap) :
        • setBaseOptions(baseOptionsBuilder.build()) - set base options that were previously created
        • setRunningMode(runningMode)
          • runningMode : RunningMode.IMAGE, RunningMode.VIDEO, RunningMode.LIVE_STREAM
        • setMaxResults(runningMode) - we often want only a small subset of results
        • setScoreThreshold(threshold) - set the minimum confidence probability that each result should be filtered on

26 of 40

Image Classification Inference

  • Running inference is in this case too, done using MPImage
    • imageClassifier = ImageClassifier.createFromOptions(context, imageClassifierOptions);
    • Perform inference using
      • imageClassifier.classify(image);

  • Similar to Object Detection, one can specify more configuration options, such as which region is to be classified using
      • imageClassifier.classify(image, imageProcessingOptions);

27 of 40

Image Classification Results

  • ImageClassificationResult is returned by imageClassifier.classify

  • ImageClassificationResult contains Classification[]
    • Each element of Classification[] is the result of each head of the model
      • Each Classification contains Category[] - an array of predicted categories, sorted by descending scores (high to low probability).
        • Each category in turn, contains label of the category object

Classification

Category

Category

Category

Classification

Category

Category

Category

Classification

Category

Category

Classification[]

Category

28 of 40

Image Classification Results

ImageClassifierResult:

Classifications #0:

head index: 0

category #0:

category name: "/m/01f6pd"

display name: "American Crow"

score: 0.7140

index: 28

category #1:

category name: "/m/01g1fg"

display name: "Fish Crow"

score: 0.00491

index: 29

29 of 40

Image Segmentation

30 of 40

Image Segmentation

  • (Recap) Partitioning an image into distinct regions corresponding to different objects / objects of different classes.
  • Initialization steps very similar to Object Detection and Image Classification
    • Set up BaseOptions using BaseOptions.Builder
    • Set model path
    • Create ImageSegmenterOptions similar to ObjectDetectorOptions (recap) :
        • setBaseOptions(baseOptionsBuilder.build()) - set base options that were previously created
        • setRunningMode(runningMode)
          • runningMode : RunningMode.IMAGE, RunningMode.VIDEO, RunningMode.LIVE_STREAM

31 of 40

Image Segmentation

  • ImageSegmenterOptions

    • setOutputCategoryMask(boolean) - when set, outputs a binary segmentation mask where each pixel “wins” if it is a part of the chosen category

    • setOutputConfidenceMask(boolean) - when set, outputs a mask per category as a float value image where the value represents the confidence score map of the category.

32 of 40

MediaPipe Image Segmentation Inference

imageSegmenter = ImageSegmenter.createFromOptions(context, imageSegmenterOptions);

  • Run inference
    • imageSegmenter.segment(image)
    • imageSegmenter.segment(image, imageProcessingOptions) in case more flexibility is required.
    • ImageSegmenterResult is returned by the segmenter
      • Optional<MPImage> categoryMask() returns the category mask that can be used to perform follow up tasks / display the mask if required.
      • qualityScores() - Quality scores for the mask per category (values within [0,1])

33 of 40

Image Segmentation

  • Model Selection

    • Segmentation models for multiple categories. Choose based on the use case
    • Generally, there’s a tradeoff between accuracy, number of categories that will be segmented and latency
    • Depending on the use case, choose a model that segments precisely, if the use case is very specific.

34 of 40

Image Segmentation

  • Examples of different results with different models

HairSegmenter

DeepLab-V3

35 of 40

Where do we find Models from ?

  • MediaPipe GitHub is a great source for some models to start off with:

    • Face Detection
    • Pose Tracking
    • Iris Tracking
    • Object Detection
    • Object Tracking
    • Selfie Segmentation and more…

  • More Specialized models are available on Kaggle. Some examples are :
    • Bird Classification
    • Food Classification Model
    • Celery Detection Model(!!)

36 of 40

On-Device Machine Learning with MediaPipe

  • Note that the models we covered were stored on-device and generally in the application’s assets.

  • Why on-device ?

    • Performant
    • Reduced dependency on network connections
    • Private and Secure
    • Cost efficient since inference is performed on user’s devices - hence very scalable.
    • Accessibility

37 of 40

On-Device Machine Learning with MediaPipe

  • Some pitfalls of On-Device Machine Learning :

    • Memory and Device Constraints : Devices such as smartphones are generally constrained which can limit the complexity of models that can be run.

    • Energy Consumption : Running on-device models can be power intensive

    • Performance variability and fragmentation : Since different models may need to be deployed to different devices, performance may vary a lot over device tiers.

38 of 40

A few things to think about when developing with MediaPipe

  • Which model fits your use case

  • How will the model’s results be evaluated

  • Depending on the application, how will regressions be detected ?

  • Does the model that you’re using need to be fine tuned ?

  • How often / how will model updates be served to users ?

39 of 40

In Summary

  • MediaPipe is a versatile framework for building multimodal (video, image, audio and text) machine learning applications and pipelines.

  • It is Open Source and cross platform

  • Pre-built solutions are a huge advantage with MediaPipe. Applications can be as simple or complicated as necessary

  • A wide range of models are compatible with MediaPipe, making it an extremely potent, powerful and flexible solution in the ever changing world of AI and machine learning.

40 of 40

Thank You