Introduction to ML / AI
Project Performance Metrics
Product Management Perspective
Product Managers Concerns
PM role in Evaluation
AGENDA
Taxonomy
Overview of Metric Types
Some Common Metrics�Specific Examples
Real World Practice
Evaluation Metrics in Action
Observability Platforms
Measurement Tools
Finishing Up
Conclusion / Questions
2
1
3
4
5
6
Our Goal: Come away with a basic understanding of evaluation metrics for ML / AI models and have great starting places for going deeper in our own self-study.
PRODUCT MANAGEMENT
ML / AI PERFORMANCE CONCERNS
Where Should We Focus
TACTICAL
HOW DEEP?
STRATEGIC
AT WHAT LEVEL IS PRODUCT CONCERNED WITH ML / AI ANALYTICS?
THE CLASSIC PRODUCT MANAGER REALITY…
So…
YOU might not own this task. Maybe it’s a direct report. Or it’s being managed by another department.
But YOU had better make sure the box gets checked.
ML / AI EVALUATION METRICS
TAXONOMY
Picture Placeholder
Purpose: How is the business doing? Achieving desired outcomes?
Users: Senior Leadership, Product, ideally everyone.
Also known as: Goal metrics, True North metrics.
Examples: Revenue, Market penetration, HEART framework; Happiness, Engagement, etc., AARRR (Acquisition, Activation, CAC vs LTV, etc.)
Business & Marketing
Product
Technical / Operational
Technical / Monitoring
GENERAL ANALYTICS CATEGORIES FOR PM
Purpose: Understand behavior at feature level, what levers move KPIs.
Users: Functional PMs and teams.
Also known as: drivers, signposts, indirect levers, possibly predictive.
Examples: page usage, feature usage, user journey funnels, abandonment points, marketing campaign flow through, etc.
Purpose: General system and app performance, future capacity concerns.
Users: IT/DevOps/Product
Examples: Website page load time, network latency, etc., queries per second (actual and capacity).
Purpose: Assure Operational readiness
Users: IT/DevOps
Examples: System “heartbeat” monitoring, uptime, etc.
Arguably, not strictly analytics, but tightly
ML Performance Metrics
ML MEASUREMENT AT PRODUCT LEVEL
Performance Metrics
Business Success Metrics
WHY SO MANY SUBCATEGORIES?
A linear model used for binary classification that estimates probability of a binary outcome using a logistic function. Use Case: Predicting if email is spam or not.
Classifies instances based on majority class among k-nearest neighbors in a feature space. Use Case: Recommending products based on similarities to other customers' purchase histories.
An ensemble model that constructs multiple decision trees during training and outputs the mode or mean prediction of the individual trees. Use Case: Predicting customer churn for subscriptions.
A model composed of interconnected layers of nodes (neurons) that learn to transform input data through non-linear mappings to output predictions. Use Case: Recognizing objects in images.
We have a wide variety of algorithms for our use cases. The math / metrics to evaluate them will differ, though there is overlap
Logistic Regression
Neural Network
Random Forest
K-Nearest Neighbors (KNN)
SELECTED popular examples…
A model trained on vast amounts of text data to understand and generate language, capable of a variety of natural language processing tasks. Use Case: Generating human-like text for customer service chatbots or writing assistance tools.
A type of deep learning model designed to process and analyze visual data by automatically learning spatial hierarchies of features. Use Case: Detecting and classifying objects in images for applications like automated medical diagnosis from X-ray images.
Convolutional Neural Network (CNN)
Large Language Model (LLM)
We will not be going over the dozens of KPIs and metrics in depth. Instead, we’re going to focus on a top few for common use cases.
WHAT WE WON’T BE DOING
We will have a comprehensive, (though not fully exhaustive), list of metrics posted with our course materials, including: the metrics, their targets (where applicable), general notes, and references to go deeper.
WHAT YOU CAN DO
GETTING INTO SPECIFIC KPIs & METRICS
Metrics
Metrics are quantifiable measurements used to track and assess the performance of a specific process, system, or activity. They provide objective data that can be used to analyze and improve the underlying issue being measured.
Key Performance Indicators (KPIs)
This is also a measurement, but it’s related to a strategic issue. KPIs are a subset of metrics chosen to reflect the most critical or important aspects of performance for a particular goal or objective. KPIs are used to evaluate and communicate progress towards achieving strategic and operational targets and reflect how successful your business is at achieving that goal.
FIRST WE NEED TO RE-VISIT
METRICS VS KPIs
Tactical
Strategic
Suitability to Task
For all the technical assessments we need to do, in the end all that matters are the business outcomes we see. The tools are a means to that end.
Costs / P&L
In the context of business, (vs. not-for-profit or government), the only sustainable value proposition is one that is profitable. (Cost of Acquisition (CAC) vs. Lifetime Value (LTV))
IN THE END ONLY TWO METRICS MATTER
% of customer service tickets fully automated or faster with reps
less resource intense data analysis; from finance to healthcare to… ??
Accurate / useful recommendations; products, services
Initial build
Maintenance / Updates
Per Unit costs?
(e.g., inferences)
As PM, you may or may not own the P&L. But rest assured…
someone does
PRODUCT MANAGEMENT
ML / AI ANALYTICS
Details for Selected Examples
UNDERSTANDING
GROUND TRUTH
What is it?
In the context of AI/ML analytics refers to accurate, definitive, and reliable data serving as benchmark for evaluating performance of a machine learning model.
It represents the true, known labels or classifications of the data samples, typically manually curated by domain experts.
This data is key to assessing accuracy, precision, recall, and other metrics of a model's predictions, as it provides the authoritative reference point against which model outputs can be compared.
Why it matters?
Without access to quality, well-labeled ground truth data, it becomes challenging to impossible to reliably measure the performance of AI/ML systems, as there would be no definitive way to determine the correctness of a model’s inferences.
DON’T MISS THIS INSIGHT
Understand that after testing and validating a model, there are aspects of live production data which are not effectively testable as a practical matter. At least, not without having high reliability of known labels.
Testing live production data, especially when changing, (drifting), over time, becomes more of a probabilistic exercise.
WHERE DOES
GROUND TRUTH APPLY?
Ground Truth issue may seem most applicable to classification models. Perhaps because such tasks usually deal with clear labels that feel tangible.
However, ground truth is equally important in other types of models:
To sum up, for all our mathematical sophistication in testing, there is nonetheless a potential gap in our ability to do so based on the realities of inherently messy real-world data.
SOME TERMINOLOGY…
With our tests, a lot of times we’ll be right. Other times wrong. Sometimes we think we’re right, but we’re not. Other times, we think we’re wrong, but we we were right. Here’s how we sort this all out…
The sum of TP, TN, FP, and FN adds up to 100% of the instances.
Make sure to look for baseline scores if using popular models and datasets for training. These can help you more quickly determine if custom work you are doing is going in the right direction.
EXAMPLE FRAUD DETECTION SYSTEM
Evil product thieves have been getting away with fake transactions and hurting our business. So we built a fraud detector!
For testing, we have…
Model predictions
Of the 130 transactions predicted to be fraudulent:
Of the 870 transactions predicted to be legitimate:
Accuracy
Basic performance metric to evaluate overall correctness of a model's predictions. It’s the ratio of predictions correct to the predictions made.
FOCUS ON ACCURACY
classification
TP and TN don’t necessarily sum to 100% because they only represent the correctly classified instances. They don’t include False Positives (FP) and False Negatives (FN).��Here, 93% of the total transactions were correctly classified as either fraudulent or legitimate.
The accuracy metric is calculated the same for both binary and multi-class classification. I.e., binary being something like spam vs. not spam emails, and multi-class being something like classifying types of animals, or products into categories. In our example, the 80 is “actually fraudulent” and the 850 is “actually legitimate.”
Our e-commerce system tries to predict fraudulent transactions with an aggressive model…
Precision
Shows how often an ML model is correct when predicting the target, (that is, “positive”), class. Or we can say, “how often were we actually right when we thought we were right.”
FOCUS ON PRECISION
classification
Precision is on a scale from 0 to 1 or expressed as a percentage. Higher the precision is best, with 1.0 being always right in predicting the target class. (Note: for multi-class issues, precision is typically calculated separately for each class.)
Precision helps us avoid potentially high costs of having false positives. Bad predictions in our example can result in delayed or lost transactions and upset customers. We need to reduce fraud, but our model needs work if it costs us good transactions and customers. We need to reduce the false positive errors.
Be aware of the challenge… if we increase precision, we might reduce accuracy and recall.
Our aggressive fraud detector turns out to have caught 50 transactions that weren’t actually fraud.
Recall
Shows how well an ML model can find objects of the target class. (That is the “positive” class.)
FOCUS ON RECALL
classification
This is a 0 – 1 percentage measure. (So the higher the better here; with 1 meaning you found every event such as a fraudulent transaction in our example.)
Recall may also be known as “sensitivity” or the “true positive rate.” In our example, how many other transactions were fraud that we missed? In search, this would be how many other relevant documents were not returned. In medicine, was our test not sensitive enough to detect an illness.
Here, 80% of the actual fraudulent transactions were correctly identified.
Recall
Shows whether an ML model can find all objects of the target class. (That is the “positive” class.)
SUMMARY OF OUR SCENARIO…
classification
We can perhaps adjust a classification threshold.
We can enhance feature selection and engineering to incorporate more informative features that help distinguish between fraudulent and legitimate transactions.
We can try other modeling techniques / algorithms or ensemble methods to better capture fraud patterns.
If we do these things, we want to maintain high recall so that we continue to identify most fraudulent transactions.
We maybe have somewhat decent accuracy and recall. We’re – arguably - doing somewhat well in predicting fraud; but could we do better?
People sometimes struggle with the difference. Try this…
We caught a lot of fraud, but we seem to have a fairly high rate of false positives. If we took some action on these, (like ask for more customer information), we could lose sales due to customer frustration.
The goal is to increase recall (catch more fraud) while also improving precision (reduce false positives). The exact balance would depend on the specific costs associated with false positives versus false negatives in your business context.
Recall
Shows whether an ML model can find all objects of the target class. (That is the “positive” class.)
FOCUS ON PRECISION / RECALL CONFUSION
classification
Recall
�Recall measures how many of the actual positive cases your model correctly identified. It's about completeness - how many relevant items did you catch?
Consider a fishing net: Recall is like casting a wide net. You want to catch all the fish (true positives), even if you catch some seaweed (false positives) too.
Think of a fire alarm system. Recall is like the system’s ability to detect all the actual fires. A high recall means the alarm goes off every time there is a fire, even if it sometimes goes off when there isn’t one.
Precision
�Precision measures how many of the cases your model identified as positive were actually positive. It's about accuracy - how reliable are your positive predictions?
Using the same fishing net metaphor, precision is like measuring how many of the things you caught in your net are actually fish (and not weeds or trash). A high precision means that most of what you caught are actual fish.�
For the fire alarm system, precision is like the alarm’s ability to avoid false alarms. A high precision means that when the alarm goes off, there is indeed a fire, and it doesn't go off for false positives like burnt toast.
Recall focuses on not missing any positive cases, while Precision focuses on ensuring that what you've identified as positive is correct. In many real-world scenarios, there's often a trade-off between the two.
People sometimes struggle with the difference. Try this…
Confusion Matrix
A tabular, (or grid) summary of the performance of a classification model showing true positives, true negatives, false positives, and false negatives.
FOCUS ON CONFUSION MATRIX
classification
This set of information can offer a deeper understanding of a model’s recall, accuracy, precision, and overall effectiveness. Basic measures, (accuracy, precision, recall), may obscure underlying truths. By looking more closely at the numbers that drive those ratios, we can understand more about where our models may be doing well or not.
We did not go over the F1 score or a Precision-Recall Curve. (Those are in the fuller metrics list.) Both of those tools help view and balance precision and recall trade-off issues. The Confusion Matrix lays out the specific driver numbers behind these insightful tools.
True
Positive
False Negative�(Type II Error)
False Positive�(Type 1 Error)
True Negative
Positive
Predicted
Negative
Actual
Negative
Positive
We predicted a positive instance, but it is really negative. (aka “false positive”)
We predicted a negative instance, but it is really positive. (aka “false positive”)
This simple example is for a binary classification. (We have “fraud” vs. “not fraud.”
But we can also use the matrix for more than two classes, you just end up with a larger grid with the possibilities; e.g., different types of fraud attacks.
BLEU (Bilingual Evaluation Understudy)
Evaluates the accuracy of machine-generated translations by comparing them to one or more reference translations
FOCUS ON BLEU
Large Language Models
In some ways, it’s like “Precision” in that it’s looking at how well new words fit properly into the output.
As with most of these scores, there are trade-offs… e.g., semantic similarity. If we say, “the plane is in the sky” and “the plane is flying” that should be similar, whereas “the plane is in the hanger” means something totally different. A summary that gets this wrong in this manner might not get caught by a poor BLEU score. (Because all these phrases share n-grams)
Recall that an n-gram is a sequence of n items; in our case here, words. (a bigram is 2 words, etc.) Here, we compare words from the predicted sentence with target sentences. Words matching target sentences are considered correct.
The simple version is computed with the fraction of candidate words in the reference. This is only the beginning though. For details, see the references within the more detailed course reference list of metrics.
We can use BLEU to test effectiveness for tasks like text summaries, translation, speech recognition, image captioning and similar.
Scoring is on 0 – 1 scale with 1 being perfect. Higher scores suggest translation – for example – is closer to the reference translation. Generally, 0-10 is poor, 50+ is good.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Evaluates the quality of text summaries by measuring overlap between candidate summary and reference summaries in terms of n-grams, sequences, and word pairs.
FOCUS ON ROUGE
Large Language Models
ROUGE scores serve as an indicator of similarity based on shared words, either in the form of n-grams or word sequences. The score ranges from 0 to 1, where a higher score indicates a greater similarity. This gives insight into how well the automated summary captures relevant information.
"Gisting" refers to the process of extracting the main ideas or the essence of a text, rather than focusing on all the details.
There’s several ROUGE scores:
�ROUGE-1: Unigram (single word) overlap
ROUGE-2: Bigram (two-word sequence) overlap
ROUGE-L: Longest common subsequence
ROUGE-W: Weighted LCS (emphasizes longer sequences)
ROUGE-S: Skip-bigram (non-consecutive word pairs) overlap
To get scores we need Recall and Precision
We will look at an example in code soon.
Confusion Matrix
A tabular summary of the performance of a classification model showing the number of true positives, true negatives, false positives, and false negatives.
BUT WAIT… THERE’S MORE
EVALUATION METRICS
REAL WORLD PRACTICE
COLAB FREE PYTHON CODING NOTEBOOKS
Colab, or "Colaboratory", allows you to write and execute Python in your browser. Free and easy to play with. Especially if you’re not a programmer and don’t want to install and manage of development platform tools on your PC, but do want to try some things.
WHAT
You can plug copied or AI-Generated code to try out ML models, as well as look at basic analytics.
WHY
ACCESS POINTS
Research.colab.com�Getting Started with Google Colab: A Beginner's Guide
Google Colab Tutorial for Beginners (YouTube)��EXAMPLE FILES WE’LL BE LOOKING AT LATER
HUGGINGFACE.CO AI COMMUNITY TOOLS, MODELS, DATA
Hugging Face is a community focused on Natural Language Processing (NLP) and artificial intelligence (AI). There are thousands of free models and data sets.
WHAT
You can plug copied or AI-Generated code to try out ML models, as well as look at basic analytics. It’s an especially good place to focus on NLP and transformers, which are the currently hot areas of AI activity.
WHY
ACCESS POINTS
HuggingFace.co�Hugging Face 101: A Tutorial for Absolute Beginners!�An Introduction to Using Transformers and Hugging Face
Getting Started With Hugging Face in 15 Minutes | Transformers, Pipeline, Tokenizer, Models (YouTube)
PERFORMING EXPERIMENTS
IF YOU WANT TO TRY your own experiments using Colab and Hugging Face, you’ll need a small amount of setup. Colab will come with your free Google/Gmail account, and you can sign up with Hugging face for free as well. Ask your favorite AI to just write code for you if you like.
Classification Metrics
Classification on News Articles with Accuracy Precision, Recall and F1 Scores
COLAB EXAMPLE PROJECTS
Language Model Test
Language Model Test with PENN dataset with metrics for Perplexity, Accuracy, and BLEU score.
SUMMARY OF WHAT TESTS DO…
In many techniques, we’re trying to fit a curve to a function; something that best represents the relationship between input variables (features) and the output (target or predictions), based on observed data.
This is generally the case for supervised learning regression and neural networks.
Checking the fit of a function to a curve
Models like decision trees, rule-based models, clustering, ensemble, reinforcement learning and similar are not so much about curve fit. In these cases, we try to evaluate how effectively data is partitioned or optimized or makes decisions.
Checking how well they achieve purpose
EVALUATION METRICS
OBSERVABILITY PLATFORMS
OBSERVABILITY
As in the past, the “battle of the checkboxes” of features will repeat itself in this category.
Most of these tools will have some collection of basic must have common features and then differentiate by types of metrics, automated options, and more.
We are still early in the evolution of these tools. We’ll see consolidation and and build out or acquisitions by existing analytics companies.
As with all such tools, product, dev, and finance teams will have to assess the ROI on such tools. Continuous monitoring of production products will become mission critical as these products become more embedded in core offerings.
An AI/ML observability platform provides comprehensive monitoring, debugging, and analysis tools to ensure the reliability, performance, and transparency of machine learning models in production.
FIRST WE HAD BI, WEB AND LATER MOBILE & SOCIAL…
THEN IOT, WEB3 / BLOCKCHAIN…
AND NOW, WE ADD ML / AI OBSERVABILITY PLATFORMS
Dozens, hundreds,
of others
Dozens, hundreds,
of others
Dozens of others, and growing
LET’S END NEAR WHERE WE STARTED
YOU might not own this task. Maybe it’s a direct report. Or it’s being managed by another department.
But YOU had better make sure the box gets checked.
As a Product Manager, you are most likely responsible for business outcomes. Your day-to-day may often be about features and tactical issues. In the end though, success and failure rests on outcomes.
Making sure your ML / AI analytics are effective towards your goals will increasingly be part of your world.
Thank You for Attending
this Session!