1 of 40

Image Recognition and Object Detection with MediaPipe on Android

Anant Chowdhary

Software Engineer

Welcome to the presentation on Image Recognition and Object Detection with MediaPipe on Android.

My name is Anant, and I currently work on AI based dubbing of videos at Google. I’ve in the past worked on on-device object detection on Android.

I’ll be talking about MediaPipe on Android and how to leverage it to perform some common vision based tasks.

Within the past few years, we’ve seen a rapid rise in machine learning based tasks being used across mobile applications.

Some very common examples are : QR code detection, face detection for authentication and optical character recognition (OCR).

In this presentation, we’ll be learning more about MediaPipe which is a popular open source machine learning framework that is cross platform. With MediaPipe, we’ll be focussing on image recognition, object detection and we will touch upon segmentation.

I’m not speaking on behalf of Google or Alphabet here.

2 of 40

Agenda

What is MediaPipe ?
MediaPipe Tasks and Graphs
Vision Tasks
Object Detection
Image Classification
Image Segmentation
On-Device Machine Learning and MediaPipe
Summary

Looking at the agenda for today, this is what we’ll be covering :

What is MediaPipe ?

We’ll briefly be talking about what MediaPipe is, and what you can do with it.

MediaPipe Tasks and Graphs

MediaPipe relies on concepts called Tasks and graphs - we’ll be delving into what they are and how they are applied across MediaPipe. These aren’t specific to vision based tasks, so these will be useful across MediaPipe with even things such as text based tasks.

Vision Tasks

Here we will briefly touch upon the types of vision based tasks one can perform using MediaPipe.

Object Detection

Next comes object detection, where we will perform a deeper dive into how one can detect objects using MediaPipe.

Image Classification

Very similar to object detection, we have image classification where we classify what is present in an image through MediaPipe.

Image Segmentation

Image segmentation is something that’s used all over mobile apps now. From filters in social media apps, to background blurring in video conferencing apps. We will take a brief look at how to perform image segmentation using MediaPipe.

On-Device Machine Learning and MediaPipe

Towards the end, we will discuss why On-Device Machine Learning is so powerful, why it is gaining popularity and what scenarios it’s suitable in.

Summary

At the end we will briefly summarize everything we’ve learnt in this presentation.

3 of 40

What is MediaPipe ?

MediaPipe Solutions provides a robust set of libraries and tools that helps users incorporate AI and ML techniques into their applications.

MediaPipe Solutions is a part of the MediaPipe Open Source Project - aids customizability so that solutions can be adapted to particular requirements.

MediaPipe is available across multi-platforms such as Android, iOS, Python and the Web.

4 of 40

What is MediaPipe ?

MediaPipe enables a diverse set of applied ML and AI use cases in applications.

Some of them are :

Vision Tasks:

Object Detection
Object Classification
Object Segmentation

Text Tasks :

Text Classification
Text Embedding
Language Detection

Audio Classification
Generative Tasks

5 of 40

Why MediaPipe ?

End-to-End solutions for many common ML tasks which are ready to use and can significantly speed up development.

MediaPipe is highly customizable and allows use of GPUs/CPUs and TPUs depending on device tiers. Performance and efficiency optimizations are key features of MediaPipe.

MediaPipe Solutions is Open Source and benefits from a vibrant community of contributors and users. Continuous support, frequent updates and shared knowledge are an advantage.

MediaPipe provides end to end solutions for common ML tasks. This can significantly speed up development. For instance, if you’re trying to create an app that’s focussed on video conferencing, and you want to blur the background - mediapipe provides segmentation capabilities (which we will talk about in a bit), out of the box. You can choose to focus on your app, and let MediaPipe handle the rest.
GPUs are gaining popularity each day now and are very widely used for inference using ML models and in training too. MediaPipe lets us decide which one to use - GPU or CPU depending on the nature of the device your app is running on.
Since MediaPipe is Open Source, there’s a pretty large community of contributors and users. It’s generally pretty easy to get help with something if you’re stuck somewhere. Updates are frequent and the entire MediaPipe knowledge base is shared.

6 of 40

MediaPipe Tasks and Graphs

MediaPipe Tasks encapsulates the primary programming interface for MediaPipe Solutions.

It features a range of libraries that enable the deployment of machine learning solutions on devices with minimal code.

MediaPipe Graphs are powerful constructs that allow complicated ML pipelines to be built. They are the backbone of MediaPipe’s modular framework.

7 of 40

MediaPipe Tasks and Graphs

Each graph consists of nodes (Calculators) and edges (streams of data)

Written in a text based format (.pbtxt files)

Graphs need to be acyclic

8 of 40

MediaPipe Tasks and Graphs

Each graph consists of nodes (Calculators) and edges (streams of data)

Written in a text based format (.pbtxt files)

Graphs need to be acyclic

(Node) GrayScaleCalculator

(Node) PassThroughCalculator

Input Video (input_video)

Output Video (output_video)

9 of 40

MediaPipe Tasks and Graphs

Each calculator has logic that is used to compute its output, and the output is possibly passed onto the next node (calculator) connected to the calculator in the graph.

Example :

node {

calculator: “GrayScaleCalculator”

input_stream: “input_video”

output_stream: “grayscale_video”

}

node {

calculator: “PassThroughCalculator”

input_stream: “grayscale_video”

output_stream: “output_video”

}

output_stream: “output_video”

10 of 40

Vision Tasks

Our focus here, is on Vision Tasks

Object Detection:

Detecting objects of multiple classes within an image with a specified confidence measure.

Image Classification:

Identifying what an image represents (from a set of classes that are pre-defined)

Image Segmentation :

Partitioning an image into distinct regions corresponding to different objects / objects of different classes.

When I started off the presentation, I’d mentioned that MediaPipe is very versatile and can be used for a wide variety of tasks.
In this presentation we focus on vision based tasks.
Some common vision based tasks that MediaPipe is used for are : object detection, image classification, face detection and pose detection.
In this presentation we’re going to be focussing on object detection, image classification and image segmentation.
Go over these briefly :

Object Detection: Detecting objects of multiple classes within an image with a specified confidence measure.
Image Classification: Identifying what an image represents (from a set of classes that are pre-defined)
Image Segmentation : Partitioning an image into distinct regions corresponding to different objects / objects of different classes.

11 of 40

Vision Tasks

Our focus here, is on Vision Tasks

cell phone

Object Detection

Image Classification

Image Segmentation

On this slide I have a single image (this is from my desk at home). Here we can see examples of all three vision tasks that I introduced in the previous slide.
Object Detection : we simply detect the object, along with a bounding box that indicates where the object is, and provides us with details on the geometric location within the image
Image Classification : Mediapipe’s image classifier tells us what exactly is in the image. So for instance, here, mediapipe classifies the object that I’ve shown as a cell phone. Note that this is not a cellphone, which shows us that there’s no classifier that is 100% accurate. Classification accuracy is dependent on a host of factors - such as the model being used and the image itself.
Image Segmentation : In this case, we just have a rectangle being drawn to segment the object of interest in the image. In reality, most objects will have curved edges. We will see examples of segmentation in the following slides.

12 of 40

Object Detection

13 of 40

Models used for Object Detection

MediaPipe’s Object Detector API requires a model to be downloaded into the project’s directory (on-device). Starting with default models is recommended.

Custom models must be in TensorFlow Lite format and must include specific metadata describing the operating parameters of the custom model.

Some commonly used models for object detection: EfficientDet-Lite0 (default recommended), EfficientDet-Lite2, SSD MobileNetV2.

We’d previously mentioned that object detection requires an ML model to be used on-device through MediaPipe. This model must be stored within the Android Project’s directory. This is generally in the assets directory of your project.
At this stage we may also be thinking about which model to use ? It’s recommended that we start with default models. Choosing which model to use can be very nuanced, and it’s good to familiarize ourselves with Mediapipe before we set out to choose a model specific to our use case.
Most models that we use with MediaPipe are going to be TensorFlowLite models. As a refresher, TensorFlow is a machine learning framework for building and training models. TensorFlowLite is specifically used for small models, and these seem to fit the use case of on-device models well.
Therefore, most models that we use with MediaPipe are in TensorFlowLite format.
The great thing about MediaPipe is that though it comes with default models, one can also use customized models that suit their use case. Customized models must include specific metadata describing the operating parameters of the custom model. Customization of models is beyond the scope of this presentation.
MediaPipe on Android generally has some commonly used default models : EfficientDet-Lite0 , EfficientDet-Lite2, SSD MobileNetV2. Each of these have their strengths and weaknesses.

14 of 40

Models used for Object Detection

Choosing a model is highly subjective based on the needs of an application

Some commonly used models:

EfficientDet-Lite0	EfficientDet-Lite2	SSD MobileNetV2
1.5 million object instances, 80 labels	1.5 million object instances, 80 labels	1.5 million object instances, 80 labels
Accurate, lightweight, strikes a balance between latency and accuracy	More accurate than EfficientDet-Lite0 in general, but slower and requires more memory,	Very fast, light but not as accurate as EfficientDet-Lite0

As mentioned previously, choosing a model can be a nuanced task. This depends highly on what kind of application one’s trying to develop. As with everything in software, there are tradeoffs.
You may for instance choose to tradeoff latency for higher accuracy in situations where accuracy is important - such as face detection. There may be other situations where latency is more important, such as detecting the color of a traffic light (red, yellow or green), since we’re not trying to be exact with the bounding boxes of the object in consideration.
We mentioned three commonly used object detection models with MediaPipe in the previous slide. A comparison on these is listed in this slide.
SSD MobileNetV2 is the fastest amongst these, and is very lightweight but not as accurate as EfficientDet-Lite0 and EfficientDet-Lite2
EfficientDet-Lite2 is generally the most accurate out of the three, but as expected is the slowest on an average, and is more memory intensive.
EfficientDet-Lite0 strikes a good balance between latency, memory requirements and accuracy. For most use cases, this is a good model to start with.

15 of 40

Models used for Object Detection

There’s generally a tradeoff between latency, accuracy and model size

You may decide to use different models based on multiple factors such as :

Cost to your customers

MediaPipe models must be downloaded over the internet in most cases.
Different models use different amounts of power and affect system performance differently.

Different devices may merit different models based on how powerful they are
Accuracy requirements

Generally a good idea to peruse through model evals before deciding on which model to use.

We’ve had a look at some commonly used models. What are some other things that we must keep in mind while choosing a model ?
Cost to customers

In most cases, models will be downloaded over the internet. Let’s say you’ve bundled version 0 of the model you’re using with your app, and that needs to be upgraded after a certain period of time - it’s going to be over the internet in all probability. This is a cost that the customer has to bear too. Data may be expensive in some places.
Different models may consume different amounts of power and may affect the battery life of a mobile device differently. Hence, one may want to choose a model carefully in scenarios like this.

It also may be a good idea to evaluate the model on your own dataset(s) before choosing which one to go with. Datasets can be tailored to specific needs and can give us an idea into how each model performs on specific data.

16 of 40

MediaPipe Object Detection on Android

MediaPipe’s Object Detector can be added to your project’s Gradle Dependencies:

dependencies {

implementation ‘com.google.mediapipe:tasks-vision:latest.release’

}

Whichever model you choose (custom or default), can be stored in your project’s assets directory

The project will need various signals on the model being used, such as:

What hardware to use : CPU/GPU
The model’s location within the device’s storage.

This can be configured with BaseOptions

17 of 40

BaseOptions for MediaPipe Tasks

BaseOptions.Builder supplies a default builder() which can help set parameters

Run on CPU/GPU

baseOptionBuilder.setDelegate(Delegate.CPU)
baseOptionBuilder.setDelegate(Delegate.GPU)

Set model path

baseOptionsBuilder.setModelAssetPath(MODEL_NAME)

18 of 40

MediaPipe ObjectDetector

ObjectDetectorOptions sets up options for MediaPipe’s object detector

Set up configuration options for Object Detection using ObjectDetector.ObjectDetectorOptions.builder()

setBaseOptions(baseOptionsBuilder.build()) - set base options that were previously created
setRunningMode(runningMode)

runningMode : RunningMode.IMAGE, RunningMode.VIDEO, RunningMode.LIVE_STREAM

setMaxResults(runningMode) - we often want only a small subset of results.
setScoreThreshold(threshold) - set the minimum confidence probability that each result should be filtered on.

Now we come to Object Detection, which is something MediaPipe enables using very little boilerplate code, and either pre-trained models or custom models.
To use MediaPipe’s object detector, we typically set up the pipeline, configure the model, and process frames or images. Note that this isn’t only for images, and this can work on video frames too.
We saw BaseOptions in the last slide - which is going to be a part of a larger umbrella of ObjectDetectorOptions
ObjectDetectorOptions in MediaPipe allows you to customize the behavior of the object detection model.
With ObjectDetectorOptions we generally set properties such as :

Base options - we discussed base options in the previous slide.
Running Mode - this generally is used to configure how the model will behave during inference. These are IMAGE, VIDEO and LIVE_STREAM
maxResults - we often want only a small subset of the results. In most cases, we wouldn’t want all results since that can be computationally expensive, and post processing these results may also be expensive. We in most scenarios will want some of the top few results.
scoreThreshold : most object detectors detect objects with probabilities of confidence. For most use cases, you’d want to tune the detector to a specific threshold probability that suits your use case. You could ofcourse set the probability to something very low and get all detected results, but that is probably not very useful for most applications.

19 of 40

MediaPipe ObjectDetector

Now that the ObjectDetectorOptions are ready, we’re ready to use the detector.

ObjectDetector

ObjectDetectorOptions

BaseOptions

RunningMode

maxResults

threshold

Allowlist

Denylist

20 of 40

MediaPipe ObjectDetector

objectDetector = ObjectDetector.createFromOptions(context, options);

Preparing an image for inference

MediaPipe Image (or MPImage) is designed to be an immutable image container, and is cross platform.
Convert an image to an MPImage before running inference:

val mpImageForInference = BitmapImageBuilder(bitmapImage).build()

Run inference

objectDetector.detect(image)
objectDetector.detect(image, imageProcessingOptions) in case more flexibility is required. For instance, running inference within only a specific region of the image to improve performance.
Example: ImageProcessingOptions.Builder().setRegionOfInterest can be used to specify the region to be processed.

At this point, we’ve learnt how to initialize MediaPipe’s ObjectDetector.
The next step naturally in this sequence, is to prepare an image for inference, and parsing the results once inference has completed.
MediaPipe has a special image class called MPImage, which is immutable and is designed to be shared across platforms.
We convert a bitmap to an MPImage using a BitmapImageBuilder - so in Kotlin it looks something like:

val mpImageForInference = BitmapImageBuilder(bitmapImage).build()

Once we have our image frame in the MPImage format, the next step is to run inference
Running inference can be as simple as calling objectDetector.detect with the MPImage being passed in as an argument
Alternatively, if we want more flexibility, we can use objectDetector.detect with imageProcessingOptions. imageProcessingOptions specifies how to process the input image before running inference. Some examples of the properties we can specify with imageProcessingOptions are : a specific region of the image that inference is to be performed upon, or rotation to be applied to the image before performing inference.

21 of 40

MediaPipe ObjectDetector Results

ObjectDetectorResult is returned by objectDetector.detect

Consists of a List<com.google.mediapipe.tasks.components.containers.Detection>
Each Detection has

BoundingBox consisting of

Upper left corner coordinates (xmin, ymin)
height and width of the box

score - the confidence score of the detection
display_name - Human readable string for the detected object

Now we’ve performed inference, and we have the results from MediaPipe. Now arises the question - how do we make sense of these results ?
The results of inference (which here is the result of the object detector’s detect method), is returned in the form of an object of class ObjectDetectorResult.
ObjectDetectorResult consists of a list of Detections. These are encapsulated in a Detection class.
With every Detection object, we have information on the following :

The bounding box of the detection. This essentially tells us the coordinates of the detected box. We can infer the dimensions of the detected object’s box from this information.
Score : the confidence score of the detected box. This is the detection probability that we discussed in one of the previous slides.
A display name, which is a human readable string

22 of 40

Some Real World Examples of Object Detection

(Using EfficientDet-lite0.tflite)

Here we’re displaying objects with the highest score.
Starting from the left, that’s me taking a picture of my reflection on my TV. This shows how well the model can work even in cases where it’s not super clear that the object in focus is a person.
Then we have an image with a guitar and a couch. The model here detects the couch with a higher confidence than the guitar. This is despite the guitar actually being more prominent in the picture in my opinion.
The next image is a low light image. Even here, the model does pretty well, while detecting my laptop, and with a confidence that’s not bad at all.
The last image has two books on my coffee table. For folks who play guitar here - that’s a fantastic book to learn the fretboard pretty fast. It detects the book here.
These are examples where the model did well. There are plenty of other places where this particular model doesn’t do well. I’d like to reiterate the need to evaluate any model on your particular dataset so that you know how well it performs specifically for your use case.

23 of 40

Some Real World Examples of Supported Labels

(Using EfficientDet-lite0.tflite)

Laptop
Couch
Chair
Potted Plant
Mouse
Remote
Keyboard
Cell phone
Microwave
Oven
Toaster
Cake
Donut
Pizza
Spoon

24 of 40

Image Classification

cell phone

25 of 40

Image Classification

(Recap) Identifying what an image represents (from a set of classes that are pre-defined)
Initialization steps very similar to Object Detection

Set up BaseOptions using BaseOptions.Builder
Set model path
Create ImageClassifierOptions similar to ObjectDetectorOptions (recap) :

setBaseOptions(baseOptionsBuilder.build()) - set base options that were previously created
setRunningMode(runningMode)

runningMode : RunningMode.IMAGE, RunningMode.VIDEO, RunningMode.LIVE_STREAM

setMaxResults(runningMode) - we often want only a small subset of results
setScoreThreshold(threshold) - set the minimum confidence probability that each result should be filtered on

26 of 40

Image Classification Inference

Running inference is in this case too, done using MPImage

imageClassifier = ImageClassifier.createFromOptions(context, imageClassifierOptions);
Perform inference using

imageClassifier.classify(image);

Similar to Object Detection, one can specify more configuration options, such as which region is to be classified using

imageClassifier.classify(image, imageProcessingOptions);

Now that we’ve set up BaseOptions, similar to Object Detection, we need to set up an Image Classifier instance. We do this using imageClassifierOptions
Similar to what we did with Object Detection, we first create an image classifier instance( instead of an object detector instance).
Recall that we used a MediaPipeImage or in other words, an object of class MPImage - we’ll be doing that here too.
Once we’ve created an MPImage, we use ImageClassifier.classify to classify the image, with imageProcessingOptions

As a reminder, imageProcessingOptions let us specify how to process the input image before running inference - whether we want to classify only a specific region in the image, whether we want to rotate the image before we run inference etc.

We can also omit imageProcessingOptions if none of the cases apply.

27 of 40

Image Classification Results

ImageClassificationResult is returned by imageClassifier.classify

ImageClassificationResult contains Classification[]

Each element of Classification[] is the result of each head of the model

Each Classification contains Category[] - an array of predicted categories, sorted by descending scores (high to low probability).

Each category in turn, contains label of the category object

Classification

Category

Classification

Category

Classification

Category

Classification[]

Category

28 of 40

Image Classification Results

ImageClassifierResult:

Classifications #0:

head index: 0

category #0:

category name: "/m/01f6pd"

display name: "American Crow"

score: 0.7140

index: 28

category #1:

category name: "/m/01g1fg"

display name: "Fish Crow"

score: 0.00491

index: 29

29 of 40

Image Segmentation

30 of 40

Image Segmentation

(Recap) Partitioning an image into distinct regions corresponding to different objects / objects of different classes.
Initialization steps very similar to Object Detection and Image Classification

Set up BaseOptions using BaseOptions.Builder
Set model path
Create ImageSegmenterOptions similar to ObjectDetectorOptions (recap) :

setBaseOptions(baseOptionsBuilder.build()) - set base options that were previously created
setRunningMode(runningMode)

runningMode : RunningMode.IMAGE, RunningMode.VIDEO, RunningMode.LIVE_STREAM

31 of 40

Image Segmentation

ImageSegmenterOptions

setOutputCategoryMask(boolean) - when set, outputs a binary segmentation mask where each pixel “wins” if it is a part of the chosen category

setOutputConfidenceMask(boolean) - when set, outputs a mask per category as a float value image where the value represents the confidence score map of the category.

Even with image segmentation, we can have specific models that focus on segmenting specific parts of an image. For instance, there can be a segmenter that focuses on segmenting humans from an image. Another one could be focussed on segmenting out only furniture from an image. Yet another one could be something that only segments human eyes from an image.
All of these have various applications.
ImageSegmenterOptions as we discussed in the last slide, is specific to segmentation.
The setOutputCategoryMask option, when set, outputs a binary segmentation mask where each pixel “wins” if it is a part of the chosen category. The chosen category here, refers to the category that the segmentation model is focussed on.
setOutputConfidenceMask(boolean) - when set, outputs a mask per category as a float value image where the value represents the confidence score map of the category (segmentation models can be multi-headed).

32 of 40

MediaPipe Image Segmentation Inference

imageSegmenter = ImageSegmenter.createFromOptions(context, imageSegmenterOptions);

Run inference

imageSegmenter.segment(image)
imageSegmenter.segment(image, imageProcessingOptions) in case more flexibility is required.
ImageSegmenterResult is returned by the segmenter

Optional<MPImage> categoryMask() returns the category mask that can be used to perform follow up tasks / display the mask if required.
qualityScores() - Quality scores for the mask per category (values within [0,1])

33 of 40

Image Segmentation

Model Selection

Segmentation models for multiple categories. Choose based on the use case
Generally, there’s a tradeoff between accuracy, number of categories that will be segmented and latency
Depending on the use case, choose a model that segments precisely, if the use case is very specific.

Now that we’ve learnt a bit about image segmentation, we can talk a little about what factors can go into model selection for segmentation.
Segmentation as mentioned previously, can be computationally intensive. Hence choosing models specific to tasks is important.
If you’re choosing a model that’s as generalized as possible - know that latency and memory requirements will likely be a tradeoff that needs to be made.
In some cases, preprocessing on images can be done. For instance, let’s say you’re working on a use case where images are stored for some time before a viewer can view them - it’s possible to segment them in batches, and that may be faster than individually segmenting each image.
As a refresher, evaluate models on your own customized datasets if possible - especially if your use case is specialized.
Next, we’ll see a couple of examples

34 of 40

Image Segmentation

Examples of different results with different models

HairSegmenter

DeepLab-V3

Here we have a synthetic image of a person, wearing glasses, with and with a t-shirt.
We use two different models with MediaPipe to segment the image and the results are on the slide(the bright regions signify the segmentation mask or segmented region):

The HairSegmenter model segments out the person’s hair. As we can see it is pretty accurate in this case. There can be multiple use cases for a model as simple as this - such as social media apps, beauty or fashion apps, and so on.
The second model is DeepLab-V3 which is a more general segmenter, and segments out the background here. Similar segmenters are used in video conferencing apps or photography apps where backgrounds need to be blurred out.
There are lots of open source segmentation models available now, and I’d really encourage everyone to try them out through MediaPipe.
Even if you don’t have access to an Android device for development, these models should be accessible through MediaPipe on Python on Colab.

35 of 40

Where do we find Models from ?

MediaPipe GitHub is a great source for some models to start off with:

Face Detection
Pose Tracking
Iris Tracking
Object Detection
Object Tracking
Selfie Segmentation and more…

More Specialized models are available on Kaggle. Some examples are :

Bird Classification
Food Classification Model
Celery Detection Model(!!)

We’ve learnt a fair bit now (though in some sense just scratched the surface) about objection detection, image classification and image segmentation.
We’ve also learnt that choosing a model can be nuanced, but very important.
The question that now arises is where do we find a model ? Ofcourse one choice is to train our own model.
What if we don’t want to necessarily start from scratch ?
Well, in that case we still do have multiple alternatives.
The MediaPipe GitHub repository has a host of models for multiple vision based tasks : Face Detection, Pose Tracking, Iris Tracking, Object Detection, Object Tracking, Selfie Segmentation and so on.
It’s always a great idea to evaluate these pre-trained models and see if they suit your needs.
Kaggle, which is a very popular platform for data science competitions, datasets, and collaborative projects also has a host of TensorFlowLite models that are compatible with MediaPipe. Since it’s community sourced, there’s a very large variety of models that are available - you’ll be surprised to know there’s a model for Celery Detection!

36 of 40

On-Device Machine Learning with MediaPipe

Note that the models we covered were stored on-device and generally in the application’s assets.

Why on-device ?

Performant
Reduced dependency on network connections
Private and Secure
Cost efficient since inference is performed on user’s devices - hence very scalable.
Accessibility

All this while we’ve been discussing MediaPipe. MediaPipe is primarily used for on-device models, however there’s nothing stopping one from creating a service that uses mediapipe and serves models over an RPC.
The question that then arises, is why should we use on-device models ?
Let’s discuss a few reasons why on-device Machine Learning is so powerful and is gaining popularity :

Performant : On-device models provide immediate processing and feedback without the latency associated with sending data to and from a remote server. This is essential for applications requiring instant responses, such as augmented reality (AR) or real-time object detection.
Reduced Dependency on network connections : On-device models enable functionality even when there is no internet connection. This is important for applications in remote areas or situations where connectivity is unreliable.
Private and Secure : By keeping data processing local, on-device models enhance user privacy and data security. Sensitive information never leaves the device, reducing the risk of data breaches or unauthorized access.
Scalability: Deploying models on-device can be more scalable for applications with a large number of users, as it reduces the load on centralized servers and infrastructure.
Accessibility: Inclusivity for low bandwidth users. By enabling sophisticated features on devices that don’t require constant internet access, on-device models help level the playing field for users across different geographic and economic contexts. This can be particularly impactful in underserved or remote areas.

37 of 40

On-Device Machine Learning with MediaPipe

Some pitfalls of On-Device Machine Learning :

Memory and Device Constraints : Devices such as smartphones are generally constrained which can limit the complexity of models that can be run.

Energy Consumption : Running on-device models can be power intensive

Performance variability and fragmentation : Since different models may need to be deployed to different devices, performance may vary a lot over device tiers.

While on-device Machine Learning has loads of advantages, like with everything in technology, there are also some disadvantages.

Memory and Device Constraints: Devices such as smartphones are generally constrained which can limit the complexity of models that can be run. High end devices may behave very differently from mid/low end devices. You may need to plan out pushing out different models to different devices and plan for the tradeoffs with each.
Energy: Running on-device models offloads the cost to users, and this may lead to a hit to battery life. Especially on devices that don’t have specialized hardware.
Fragmentation: Related to the point around memory and device constraints - fragmentation can become a large problem while scaling out your applications. Plans need to be made for specific device tiers, and performance tradeoffs need to be accounted for.
In case of a faulty model, one cannot rely on mitigating issues with a server side flip. One must push out an upgraded model to users, and that completely depends on how fast users are able to update the models on their devices.

38 of 40

A few things to think about when developing with MediaPipe

Which model fits your use case

How will the model’s results be evaluated

Depending on the application, how will regressions be detected ?

Does the model that you’re using need to be fine tuned ?

How often / how will model updates be served to users ?

Now that we’re reaching the end of the presentation, there are also some things in general that one should keep in mind while developing with MediaPipe :

Model Selection : Which model fits our use case, does a pre-trained model suffice or do we need to customize it further ?
How will the model’s results be evaluated : Are there any metrics that will track how the model is performing - this can be complicated and involves defining datasets, collecting ground truth for datasets, running evaluations for the models, and finally measuring how model changes affect user behavior.
Detecting regressions from models can be hard, particularly since they’re on-device. Hence, devising ways to detect regressions in applications or in user behavior is critical.
Serving models : Once a model is on a user’s device, there’s no guarantee that it will be updated. There can be server based safeguards to ensure that if necessary, the user is only using the latest version of a model on their device. We also must think about how to push out model changes to devices, across device tiers and not break user experience - which in itself is a challenging task.

39 of 40

In Summary

MediaPipe is a versatile framework for building multimodal (video, image, audio and text) machine learning applications and pipelines.

It is Open Source and cross platform

Pre-built solutions are a huge advantage with MediaPipe. Applications can be as simple or complicated as necessary

A wide range of models are compatible with MediaPipe, making it an extremely potent, powerful and flexible solution in the ever changing world of AI and machine learning.

Now we try to summarize what we’ve learnt through the presentation :

MediaPipe is a multipurpose, versatile and cross platform framework that helps us build machine learning pipelines and a wide variety of applications.
It is Open Source : this has huge implications in terms of support, developer community and in general development velocity. There are lots of people around the world who can help with any questions anyone has.
MediaPipe comes with pre-trained models and lots of code that can help you get started. Pre-trained models can be a huge advantage while building out prototypes and can improve velocity. You can always choose to customize the models you’re using later on.
By virtue of being open source, a very wide variety of models are compatible with MediaPipe. This makes it extremely potent in the ever changing world of Machine Learning and Artificial Intelligence.

40 of 40

Thank You