2 of 37

Product Managers Concerns

PM role in Evaluation

AGENDA

Taxonomy

Overview of Metric Types

Some Common Metrics�Specific Examples

Real World Practice

Evaluation Metrics in Action

Observability Platforms

Measurement Tools

Finishing Up

Conclusion / Questions

Our Goal: Come away with a basic understanding of evaluation metrics for ML / AI models and have great starting places for going deeper in our own self-study.

3 of 37

PRODUCT MANAGEMENT

ML / AI PERFORMANCE CONCERNS

Where Should We Focus

4 of 37

TACTICAL

HOW DEEP?

STRATEGIC

We’re primarily concerned with business outcomes.�
Several categories of analytics will directly impact our goals.

AI model performance will roll up to higher level KPIs the same as web, mobile and other analytics.

How deep we go depends on our role; both level and task.�
Regardless of direct involvement, we’ll want to ensure someone is appropriately tasked.

AT WHAT LEVEL IS PRODUCT CONCERNED WITH ML / AI ANALYTICS?

5 of 37

THE CLASSIC PRODUCT MANAGER REALITY…

So…

YOU might not own this task. Maybe it’s a direct report. Or it’s being managed by another department.

But YOU had better make sure the box gets checked.

6 of 37

ML / AI EVALUATION METRICS

TAXONOMY

7 of 37

Picture Placeholder

Purpose: How is the business doing? Achieving desired outcomes?

Users: Senior Leadership, Product, ideally everyone.

Also known as: Goal metrics, True North metrics.

Examples: Revenue, Market penetration, HEART framework; Happiness, Engagement, etc., AARRR (Acquisition, Activation, CAC vs LTV, etc.)

Business & Marketing

Product

Technical / Operational

Technical / Monitoring

GENERAL ANALYTICS CATEGORIES FOR PM

Purpose: Understand behavior at feature level, what levers move KPIs.

Users: Functional PMs and teams.

Also known as: drivers, signposts, indirect levers, possibly predictive.

Examples: page usage, feature usage, user journey funnels, abandonment points, marketing campaign flow through, etc.

Purpose: General system and app performance, future capacity concerns.

Users: IT/DevOps/Product

Examples: Website page load time, network latency, etc., queries per second (actual and capacity).

Purpose: Assure Operational readiness

Users: IT/DevOps

Examples: System “heartbeat” monitoring, uptime, etc.

Arguably, not strictly analytics, but tightly

ML Performance Metrics

8 of 37

ML MEASUREMENT AT PRODUCT LEVEL

Performance Metrics

Data Quality
Classification Metrics
Regression Metrics
Clustering Metrics
Performance & Efficiency
Large Language Model (LLM) & Other Model Type Specific Metrics
Convolutional Neural Network (CNN) Specific Metrics

Business Success Metrics

User Experience
Operating Metrics
Business Performance
Suitability to Task

9 of 37

WHY SO MANY SUBCATEGORIES?

A linear model used for binary classification that estimates probability of a binary outcome using a logistic function. Use Case: Predicting if email is spam or not.

Classifies instances based on majority class among k-nearest neighbors in a feature space. Use Case: Recommending products based on similarities to other customers' purchase histories.

An ensemble model that constructs multiple decision trees during training and outputs the mode or mean prediction of the individual trees. Use Case: Predicting customer churn for subscriptions.

A model composed of interconnected layers of nodes (neurons) that learn to transform input data through non-linear mappings to output predictions. Use Case: Recognizing objects in images.

We have a wide variety of algorithms for our use cases. The math / metrics to evaluate them will differ, though there is overlap

Logistic Regression

Neural Network

Random Forest

K-Nearest Neighbors (KNN)

SELECTED popular examples…

A model trained on vast amounts of text data to understand and generate language, capable of a variety of natural language processing tasks. Use Case: Generating human-like text for customer service chatbots or writing assistance tools.

A type of deep learning model designed to process and analyze visual data by automatically learning spatial hierarchies of features. Use Case: Detecting and classifying objects in images for applications like automated medical diagnosis from X-ray images.

Convolutional Neural Network (CNN)

Large Language Model (LLM)

10 of 37

We will not be going over the dozens of KPIs and metrics in depth. Instead, we’re going to focus on a top few for common use cases.

WHAT WE WON’T BE DOING

We will have a comprehensive, (though not fully exhaustive), list of metrics posted with our course materials, including: the metrics, their targets (where applicable), general notes, and references to go deeper.

WHAT YOU CAN DO

GETTING INTO SPECIFIC KPIs & METRICS

11 of 37

Metrics

Metrics are quantifiable measurements used to track and assess the performance of a specific process, system, or activity. They provide objective data that can be used to analyze and improve the underlying issue being measured.

Key Performance Indicators (KPIs)

This is also a measurement, but it’s related to a strategic issue. KPIs are a subset of metrics chosen to reflect the most critical or important aspects of performance for a particular goal or objective. KPIs are used to evaluate and communicate progress towards achieving strategic and operational targets and reflect how successful your business is at achieving that goal.

FIRST WE NEED TO RE-VISIT

METRICS VS KPIs

Tactical

Strategic

12 of 37

Suitability to Task

For all the technical assessments we need to do, in the end all that matters are the business outcomes we see. The tools are a means to that end.

Costs / P&L

In the context of business, (vs. not-for-profit or government), the only sustainable value proposition is one that is profitable. (Cost of Acquisition (CAC) vs. Lifetime Value (LTV))

IN THE END ONLY TWO METRICS MATTER

% of customer service tickets fully automated or faster with reps

less resource intense data analysis; from finance to healthcare to… ??

Accurate / useful recommendations; products, services

Initial build

Maintenance / Updates

Per Unit costs?

(e.g., inferences)

As PM, you may or may not own the P&L. But rest assured…

someone does

13 of 37

PRODUCT MANAGEMENT

ML / AI ANALYTICS

Details for Selected Examples

14 of 37

UNDERSTANDING

GROUND TRUTH

What is it?

In the context of AI/ML analytics refers to accurate, definitive, and reliable data serving as benchmark for evaluating performance of a machine learning model.

It represents the true, known labels or classifications of the data samples, typically manually curated by domain experts.

This data is key to assessing accuracy, precision, recall, and other metrics of a model's predictions, as it provides the authoritative reference point against which model outputs can be compared.

Why it matters?

Without access to quality, well-labeled ground truth data, it becomes challenging to impossible to reliably measure the performance of AI/ML systems, as there would be no definitive way to determine the correctness of a model’s inferences.

15 of 37

DON’T MISS THIS INSIGHT

Understand that after testing and validating a model, there are aspects of live production data which are not effectively testable as a practical matter. At least, not without having high reliability of known labels.

Testing live production data, especially when changing, (drifting), over time, becomes more of a probabilistic exercise.

16 of 37

WHERE DOES

GROUND TRUTH APPLY?

Ground Truth issue may seem most applicable to classification models. Perhaps because such tasks usually deal with clear labels that feel tangible.

However, ground truth is equally important in other types of models:

Regression: We have continuous values that the model aims to predict; house prices, temperature.
Object Detection: We use locations and boundaries of objects within image spaces.
Natural Language Processing: We have accurate annotations or translations as baselines.
Reinforcement Learning: “truths” can be reward signals guiding the learning process.

To sum up, for all our mathematical sophistication in testing, there is nonetheless a potential gap in our ability to do so based on the realities of inherently messy real-world data.

17 of 37

SOME TERMINOLOGY…

With our tests, a lot of times we’ll be right. Other times wrong. Sometimes we think we’re right, but we’re not. Other times, we think we’re wrong, but we we were right. Here’s how we sort this all out…

True Positives (TP): Correctly predicted positive instances.
True Negatives (TN): Correctly predicted negative instances.
False Positives (FP): Incorrectly predicted positive instances (actual negatives predicted as positive).
False Negatives (FN): Incorrectly predicted negative instances (actual positives predicted as negative).

The sum of TP, TN, FP, and FN adds up to 100% of the instances.

Make sure to look for baseline scores if using popular models and datasets for training. These can help you more quickly determine if custom work you are doing is going in the right direction.

18 of 37

EXAMPLE FRAUD DETECTION SYSTEM

Evil product thieves have been getting away with fake transactions and hurting our business. So we built a fraud detector!

For testing, we have…

Total Transactions: 1000
Actual fraudulent transactions: 100
Actual legitimate transactions: 900

Model predictions

Predicted fraud transactions: 130 (including true positive and false positives.)
Predicted legitimate transactions: 870 (including true negatives and false negatives.)

Of the 130 transactions predicted to be fraudulent:

80 were fraudulent (true positives, TP)
50 were legitimate (false positives, FP)

Of the 870 transactions predicted to be legitimate:

850 were legitimate (true negatives, TN)
20 were fraudulent (false negatives, FN)

19 of 37

Accuracy

Basic performance metric to evaluate overall correctness of a model's predictions. It’s the ratio of predictions correct to the predictions made.

FOCUS ON ACCURACY

classification

TP and TN don’t necessarily sum to 100% because they only represent the correctly classified instances. They don’t include False Positives (FP) and False Negatives (FN).��Here, 93% of the total transactions were correctly classified as either fraudulent or legitimate.

The accuracy metric is calculated the same for both binary and multi-class classification. I.e., binary being something like spam vs. not spam emails, and multi-class being something like classifying types of animals, or products into categories. In our example, the 80 is “actually fraudulent” and the 850 is “actually legitimate.”

Our e-commerce system tries to predict fraudulent transactions with an aggressive model…

20 of 37

Precision

Shows how often an ML model is correct when predicting the target, (that is, “positive”), class. Or we can say, “how often were we actually right when we thought we were right.”

FOCUS ON PRECISION

classification

Precision is on a scale from 0 to 1 or expressed as a percentage. Higher the precision is best, with 1.0 being always right in predicting the target class. (Note: for multi-class issues, precision is typically calculated separately for each class.)

Precision helps us avoid potentially high costs of having false positives. Bad predictions in our example can result in delayed or lost transactions and upset customers. We need to reduce fraud, but our model needs work if it costs us good transactions and customers. We need to reduce the false positive errors.

Be aware of the challenge… if we increase precision, we might reduce accuracy and recall.

Our aggressive fraud detector turns out to have caught 50 transactions that weren’t actually fraud.

21 of 37

Recall

Shows how well an ML model can find objects of the target class. (That is the “positive” class.)

FOCUS ON RECALL

classification

This is a 0 – 1 percentage measure. (So the higher the better here; with 1 meaning you found every event such as a fraudulent transaction in our example.)

Recall may also be known as “sensitivity” or the “true positive rate.” In our example, how many other transactions were fraud that we missed? In search, this would be how many other relevant documents were not returned. In medicine, was our test not sensitive enough to detect an illness.

Here, 80% of the actual fraudulent transactions were correctly identified.

22 of 37

Recall

Shows whether an ML model can find all objects of the target class. (That is the “positive” class.)

SUMMARY OF OUR SCENARIO…

classification

We can perhaps adjust a classification threshold.

We can enhance feature selection and engineering to incorporate more informative features that help distinguish between fraudulent and legitimate transactions.

We can try other modeling techniques / algorithms or ensemble methods to better capture fraud patterns.

If we do these things, we want to maintain high recall so that we continue to identify most fraudulent transactions.

We maybe have somewhat decent accuracy and recall. We’re – arguably - doing somewhat well in predicting fraud; but could we do better?

People sometimes struggle with the difference. Try this…

We caught a lot of fraud, but we seem to have a fairly high rate of false positives. If we took some action on these, (like ask for more customer information), we could lose sales due to customer frustration.

The goal is to increase recall (catch more fraud) while also improving precision (reduce false positives). The exact balance would depend on the specific costs associated with false positives versus false negatives in your business context.

23 of 37

Recall

Shows whether an ML model can find all objects of the target class. (That is the “positive” class.)

FOCUS ON PRECISION / RECALL CONFUSION

classification

Recall

�Recall measures how many of the actual positive cases your model correctly identified. It's about completeness - how many relevant items did you catch?

Consider a fishing net: Recall is like casting a wide net. You want to catch all the fish (true positives), even if you catch some seaweed (false positives) too.

Think of a fire alarm system. Recall is like the system’s ability to detect all the actual fires. A high recall means the alarm goes off every time there is a fire, even if it sometimes goes off when there isn’t one.

Precision

�Precision measures how many of the cases your model identified as positive were actually positive. It's about accuracy - how reliable are your positive predictions?

Using the same fishing net metaphor, precision is like measuring how many of the things you caught in your net are actually fish (and not weeds or trash). A high precision means that most of what you caught are actual fish.�

For the fire alarm system, precision is like the alarm’s ability to avoid false alarms. A high precision means that when the alarm goes off, there is indeed a fire, and it doesn't go off for false positives like burnt toast.

Recall focuses on not missing any positive cases, while Precision focuses on ensuring that what you've identified as positive is correct. In many real-world scenarios, there's often a trade-off between the two.

People sometimes struggle with the difference. Try this…

24 of 37

Confusion Matrix

A tabular, (or grid) summary of the performance of a classification model showing true positives, true negatives, false positives, and false negatives.

FOCUS ON CONFUSION MATRIX

classification

This set of information can offer a deeper understanding of a model’s recall, accuracy, precision, and overall effectiveness. Basic measures, (accuracy, precision, recall), may obscure underlying truths. By looking more closely at the numbers that drive those ratios, we can understand more about where our models may be doing well or not.

We did not go over the F1 score or a Precision-Recall Curve. (Those are in the fuller metrics list.) Both of those tools help view and balance precision and recall trade-off issues. The Confusion Matrix lays out the specific driver numbers behind these insightful tools.

True

Positive

False Negative�(Type II Error)

False Positive�(Type 1 Error)

True Negative

Positive

Predicted

Negative

Actual

Negative

Positive

We predicted a positive instance, but it is really negative. (aka “false positive”)

We predicted a negative instance, but it is really positive. (aka “false positive”)

This simple example is for a binary classification. (We have “fraud” vs. “not fraud.”

But we can also use the matrix for more than two classes, you just end up with a larger grid with the possibilities; e.g., different types of fraud attacks.

25 of 37

BLEU (Bilingual Evaluation Understudy)

Evaluates the accuracy of machine-generated translations by comparing them to one or more reference translations

FOCUS ON BLEU

Large Language Models

In some ways, it’s like “Precision” in that it’s looking at how well new words fit properly into the output.

As with most of these scores, there are trade-offs… e.g., semantic similarity. If we say, “the plane is in the sky” and “the plane is flying” that should be similar, whereas “the plane is in the hanger” means something totally different. A summary that gets this wrong in this manner might not get caught by a poor BLEU score. (Because all these phrases share n-grams)

Recall that an n-gram is a sequence of n items; in our case here, words. (a bigram is 2 words, etc.) Here, we compare words from the predicted sentence with target sentences. Words matching target sentences are considered correct.

The simple version is computed with the fraction of candidate words in the reference. This is only the beginning though. For details, see the references within the more detailed course reference list of metrics.

We can use BLEU to test effectiveness for tasks like text summaries, translation, speech recognition, image captioning and similar.

Scoring is on 0 – 1 scale with 1 being perfect. Higher scores suggest translation – for example – is closer to the reference translation. Generally, 0-10 is poor, 50+ is good.

26 of 37

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Evaluates the quality of text summaries by measuring overlap between candidate summary and reference summaries in terms of n-grams, sequences, and word pairs.

FOCUS ON ROUGE

Large Language Models

ROUGE scores serve as an indicator of similarity based on shared words, either in the form of n-grams or word sequences. The score ranges from 0 to 1, where a higher score indicates a greater similarity. This gives insight into how well the automated summary captures relevant information.

"Gisting" refers to the process of extracting the main ideas or the essence of a text, rather than focusing on all the details.

There’s several ROUGE scores:

�ROUGE-1: Unigram (single word) overlap

ROUGE-2: Bigram (two-word sequence) overlap

ROUGE-L: Longest common subsequence

ROUGE-W: Weighted LCS (emphasizes longer sequences)

ROUGE-S: Skip-bigram (non-consecutive word pairs) overlap

To get scores we need Recall and Precision

We will look at an example in code soon.

27 of 37

Confusion Matrix

A tabular summary of the performance of a classification model showing the number of true positives, true negatives, false positives, and false negatives.

BUT WAIT… THERE’S MORE

There are dozens more metrics and new ones that are evolving.�
For a more comprehensive list of the most common ML / AI performance metrics, see the course information Wiki/Notion page for this session.�
If you are seeing this presentation outside the context of the ML / AI training course, you can find a metrics summary document here:�https://docs.google.com/document/d/1icek9K-m6hNUcjBarAWfWeHUoU7hAHFl7to1WOtXvak/edit?usp =sharing

Please see the course materials site for a fuller list of metrics and KPIs, with explanations, use cases and links to learn more about each one.

28 of 37

EVALUATION METRICS

REAL WORLD PRACTICE

29 of 37

COLAB FREE PYTHON CODING NOTEBOOKS

Colab, or "Colaboratory", allows you to write and execute Python in your browser. Free and easy to play with. Especially if you’re not a programmer and don’t want to install and manage of development platform tools on your PC, but do want to try some things.

WHAT

You can plug copied or AI-Generated code to try out ML models, as well as look at basic analytics.

WHY

ACCESS POINTS

Research.colab.com�Getting Started with Google Colab: A Beginner's Guide

Google Colab Tutorial for Beginners (YouTube)��EXAMPLE FILES WE’LL BE LOOKING AT LATER

We’ll be looking at some specific examples soon.

30 of 37

HUGGINGFACE.CO AI COMMUNITY TOOLS, MODELS, DATA

Hugging Face is a community focused on Natural Language Processing (NLP) and artificial intelligence (AI). There are thousands of free models and data sets.

WHAT

You can plug copied or AI-Generated code to try out ML models, as well as look at basic analytics. It’s an especially good place to focus on NLP and transformers, which are the currently hot areas of AI activity.

WHY

ACCESS POINTS

HuggingFace.co�Hugging Face 101: A Tutorial for Absolute Beginners!�An Introduction to Using Transformers and Hugging Face

Getting Started With Hugging Face in 15 Minutes | Transformers, Pipeline, Tokenizer, Models (YouTube)

31 of 37

PERFORMING EXPERIMENTS

IF YOU WANT TO TRY your own experiments using Colab and Hugging Face, you’ll need a small amount of setup. Colab will come with your free Google/Gmail account, and you can sign up with Hugging face for free as well. Ask your favorite AI to just write code for you if you like.

huggingface.co

colab.research.google.com

32 of 37

Classification Metrics

Classification on News Articles with Accuracy Precision, Recall and F1 Scores

COLAB NOTEBOOK

COLAB EXAMPLE PROJECTS

Language Model Test

Language Model Test with PENN dataset with metrics for Perplexity, Accuracy, and BLEU score.

COLAB NOTEBOOK

Language Model Test

Language Model Test with ROUGE score.

COLAB NOTEBOOK

Language Model Test

Language Model Test with ROUGE, BLEU and METEOR scores.

COLAB NOTEBOOK

33 of 37

SUMMARY OF WHAT TESTS DO…

In many techniques, we’re trying to fit a curve to a function; something that best represents the relationship between input variables (features) and the output (target or predictions), based on observed data.

This is generally the case for supervised learning regression and neural networks.

Checking the fit of a function to a curve

Models like decision trees, rule-based models, clustering, ensemble, reinforcement learning and similar are not so much about curve fit. In these cases, we try to evaluate how effectively data is partitioned or optimized or makes decisions.

Checking how well they achieve purpose

34 of 37

EVALUATION METRICS

OBSERVABILITY PLATFORMS

35 of 37

OBSERVABILITY

As in the past, the “battle of the checkboxes” of features will repeat itself in this category.

Most of these tools will have some collection of basic must have common features and then differentiate by types of metrics, automated options, and more.

We are still early in the evolution of these tools. We’ll see consolidation and and build out or acquisitions by existing analytics companies.

As with all such tools, product, dev, and finance teams will have to assess the ROI on such tools. Continuous monitoring of production products will become mission critical as these products become more embedded in core offerings.

An AI/ML observability platform provides comprehensive monitoring, debugging, and analysis tools to ensure the reliability, performance, and transparency of machine learning models in production.

FIRST WE HAD BI, WEB AND LATER MOBILE & SOCIAL…

THEN IOT, WEB3 / BLOCKCHAIN…

AND NOW, WE ADD ML / AI OBSERVABILITY PLATFORMS

Dozens, hundreds,

of others

Dozens, hundreds,

of others

Dozens of others, and growing

36 of 37

LET’S END NEAR WHERE WE STARTED

YOU might not own this task. Maybe it’s a direct report. Or it’s being managed by another department.

But YOU had better make sure the box gets checked.

As a Product Manager, you are most likely responsible for business outcomes. Your day-to-day may often be about features and tactical issues. In the end though, success and failure rests on outcomes.

Making sure your ML / AI analytics are effective towards your goals will increasingly be part of your world.

1 of 37

2 of 37

3 of 37

4 of 37

5 of 37

6 of 37

7 of 37

8 of 37

9 of 37

10 of 37

11 of 37

12 of 37

13 of 37

14 of 37

15 of 37

16 of 37

17 of 37

18 of 37

19 of 37

20 of 37

21 of 37

22 of 37

23 of 37

24 of 37

25 of 37

26 of 37

27 of 37

28 of 37

29 of 37

30 of 37

31 of 37

32 of 37

33 of 37

34 of 37

35 of 37

36 of 37

37 of 37