Presentation Abstracts (R Greiner)

Brief Biography Recent Presentations

Research Topics:

An Effective Way to Estimate an Individual's Survival Distribution

Which Liver Patients to Waitlist for a New Liver:
Motivating A Novel Survival Prediction Model, and Evaluation Measure
Also: Survival Prediction Tutorial

Budgeted Learning of Effective Classifiers

Introduction to Active Learning (Tutorial)

Explaining the Gene Signature Anomaly: Estimating the Overlap of Two Ranked Lists
WebIC: An Effective "Complete-Web" Recommender System
Proteome Analyst: A Web-based tool for Predicting Properties of Proteins
Using Machine Learning to Predict Protein Roles in Signaling Pathways
A Structural Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers
Quantifying the Uncertainty of a Belief Net Response

Tutorials (not research material... ):

Towards Patient-Specific Treatment: Medical Applications of Machine Learning
View from the Front: Working with Medical Colleagues to Produce Effective Predictor Systems
Learning Models that Predict Objective, Actionable Labels
Introduction to Machine Learning

Tutorial: Evaluation and Cross-Validation

Introduction to Bayesian Belief Nets
Survival Prediction Tutorial

Teaching Material

How to Prepare and Deliver Platform and Poster Presentations
How to Write Effectively

RESEARCH PRESENTATIONS

An Effective Way to Estimate an Individual's Survival Distribution

Context: The "survival prediction" task requires learning a model that can estimate the time until an event will happen for an instance; this differs from standard regression problems as the training survival dataset may include many "censored instances", which specify only a lower bound on that instance's true survival time. This framework is surprisingly common, as it includes many real-world situations, such as estimating the time until a customer defaults on a loan, until a game player advances to the next level, until a mechanical device breaks, and customer churn. [Note this framework allows that "true time" to be effectively infinite for some instances -- ie, some players may never advance, and some customers might not default, etc.] This presentation focuses on the most common situation: estimating the time until a patient dies.

An accurate estimate of a patient’s survival time can help determine the appropriate treatment and care of that patient. Some common approaches to survival analysis estimate a patient’s risk scores; others estimate a patient’s 5-year survival probability, or a population’s survival distribution; however, none of these provides a way to estimate an individual’s expected survival time. This motivates an alternative class of tools that can learn models that estimate a subject’s survival probability at each time – ie, an individual survival distribution (ISD) – from which one can then estimate that subject’s expected survival time. After describing such ISD models and explaining how they differ from standard models, this presentation then discusses standard ways to evaluate such models, then motivates and defines a novel approach, D-Calibration, which determines whether a model's probability estimates are meaningful. We also discuss how these measures differ, and use them to evaluate several ISD prediction tools over a range of real-world survival data sets – demonstrating, in particular, that one tool, MTLR, provides survival estimates that are helpful for patients, clinicians and researchers.

Joint work with Chun-Nam Yu, Humza Haider,

Neeraj Kumar, Vickie Baracos, Roman Eisner, Ping Jin, Bret Hoehn, Haipeng Li

Slides (2021.01.28)

Video (45m) from 2021 AAAI Symposium: Survival Prediction - aca

Other relevant material: Survival Prediction Summary (2021)

Which Liver Patients to Waitlist for a New Liver:
Motivating A Novel Survival Prediction Model, and Evaluation Measure

Deciding which patient should be waitlisted for a liver transplant should depend on “utility” -- the patient’s chance of long-term survival with the graft. However, most survival models use only “risk scores”, which are “discriminative” (can compare predicted outcomes between patients), but are not measuring the desired characteristic: utility for a single patient. This motivated us to develop and use a novel type of predictor, which can produce an “Individual Survival Distribution” for each patient.

This presentation first overviews standard survival analysis models to discuss what each can (and cannot) do, to motivate our approach. We then discuss the issue of evaluating these models, leading to a novel evaluation method (D-calibration). Finally, we show that this approach works effectively, leading to a deployed system to help hepatologists make this critical decision for their patients.

Joint work with Humza Haider, Bret Hoehn, Max Ulrich + Hepatitis Team

Video (2020.06.25) [Full 78m]

Webapp

Article

(This is a short, focused version of above material: [Individual Survival Prediction])

Survival Prediction Tutorial

An accurate estimate of a patient’s survival time can help determine the appropriate treatment and care of that patient. Some common approaches to survival analysis estimate a patient’s risk scores; others estimate a patient’s 5-year survival probability, or a population’s survival distribution; however, none of these provides a way to estimate an individual’s expected survival time. This 1.5 hour tutorial first motivates, then summarizes the standard techniques used in survival analysis – including survival curves (Kaplan-Meier Curves) and Hazard Functions (Cox Proportional Hazard) -- as well as the class of Individual Survival Distributions (ISDs, including MTLR), which learn models that estimate a subject’s survival probability at each time, from which one can then estimate that subject’s expected survival time. Next, this presentation discusses standard ways to evaluate such models (C-index, 1-Calibration, IBS), then motivates and defines novel approaches: versions of L1-loss (that can accommodate censored data) and, 'D-Calibration', which determines whether a model's probability estimates are meaningful. We also discuss how these measures differ, and use them to evaluate several ISD prediction tools over a range of real-world survival data sets -- demonstrating, in particular, that one tool, MTLR, provides survival estimates that are helpful for patients, clinicians and researchers.

(Delivered in OxML’2021 [2021.08.12] – with N Kumar)

Slides Survival-Prediction-Tutorial-2021.pdf

Budgeted Learning of Effective Classifiers

Researchers often use clinical trials to collect the data needed to evaluate some hypothesis, or produce a classifier. During this process, they have to pay the cost of performing each test. Many studies will run a comprehensive battery of tests on each subject, for as many subjects as their budget will allow -- ie, "round robin" (RR). We consider a more general model, where the researcher can sequentially decide which single test to perform on which specific individual; again subject to spending only the available funds. Our goal here is to use these funds most effectively, to collect the data that allows us to learn the most accurate classifier.

We first explore the simplified "coins version" of this task. After observing that this is NP-hard, we consider a range of heuristic algorithms, both standard and novel, and observe that our "biased robin" approach is both efficient and much more effective than most other approaches, including the standard RR approach. We then apply these ideas to learning a naive-bayes classifier, and see similar behavior. Finally, we consider the most realistic model, where both the researcher gathering data to build the classifier, and the user (eg, physician) applying this classifier to an instance (patient) must pay for the features used --- eg, the researcher has $10,000 to acquire the feature values needed to produce an optimal $30/patient classifier. Again, we see that our novel approaches are almost always much more effective than the standard RR model.

Joint work with Aloak Kapoor, Dan Lizotte and Omid Madani.

Webpage

Introduction to Active Learning

This is basically a Tutorial on Active Learning …

Slides (2021): Intro-Active-2021.pdf

Explaining the Gene Signature Anomaly:
Estimating the Overlap of Two Ranked Lists

Recent advances in high-throughput technologies, such as genome-wide SNP analysis and microarray gene expression profiling, have led to a multitude of ranked lists, where the features (SNPs, genes) are sorted based on their individual correlation with a phenotype. Many studies will then return the top k features as ``signatures'' for the phenotypes. Multiple reviews have shown, however, that different studies typically produce very different signatures, even when based on subsampling from a single dataset; this is the ``gene signature anomaly''.

This paper formally investigates the overlap of the top ranked features in two lists whose elements are ranked by their respective Pearson correlation coefficients with the phenotype outcome. We show that our model is able to accurately predict the expected overlap between two ranked lists, based on reasonable assumptions. This finding explains why the overlap in a pair of gene signatures (for same phenotype) should be relatively small, for purely statistical reasons, given today's typical sample size.

Joint work with Babak Damavandi, Chun-Nam Yu

Webpage

WebIC: An Effective "Complete-Web" Recommender System

Many web recommendation systems direct users to webpages, from a single website, that other similar users have visited. By contrast, our WebIC web recommendation system is designed to locate "information content (IC) pages" --- pages the current user needs to see to complete her task --- from essentially anywhere on the web. WebIC first extracts the "browsing properties" of each word encountered in the user's current click-stream --- eg, how often each word appears in the title of a page in this sequence, or in the "anchor" of a link that was followed, etc. It then uses a user- and site-independent model, learned from a set of annotated web logs acquired in a user study, to determine which of these words is likely to appear in an IC page. We discuss how to use these IC-words to find IC-pages, and demonstrate empirically that this browsing-based approach works effectively.

Joint work with Tingshao Zhu, Gerald Haeubl and Bob Price

Webpage

Proteome Analyst:
A web-based tool for Predicting Properties of Proteins

Proteome Analyst (PA) is a publicly available, high-throughput, web-based system for predicting various properties of each protein in an entire proteome. Using machine-learned classifiers, PA can predict, for example, the GeneQuiz general function and Gene Ontology (GO) molecular function of a protein. In addition, PA is one of the most accurate and most comprehensive systems for predicting subcellular localization, the location within a cell where a protein performs its main function. Two other capabilities of PA are notable. First, PA can create a custom classifier to predict a new property, without requiring any programming, based on labeled training data (i.e. a set of examples, each with the correct classification label) provided by a user. PA has been used to create custom classifiers for potassium-ion channel proteins and other general function ontologies. Second, PA provides a sophisticated explanation feature that shows why one prediction is chosen over another. The PA system produces a linear classifier, which is amenable to a graphical and interactive approach to explanations for its predictions; transparent predictions increase the user's confidence in, and understanding of, PA. Finally, we also present a similar technique that predicts which proteins will participate in a known (signaling) pathway, and how.

Joint work with Paul Lu, Duane Szafron, David S. Wishart, Alona Fyshe, Brandon Pearcy, Brett Poulin, Roman Eisner, Danny Ngo, Nicholas Lamb, Jordan Patterson, Kurt McMillan, Kevin Jewell

Webpage

Using Machine Learning to Predict
Protein Roles in Signaling Pathways

In general, each cell signaling pathway involves many proteins, each with one or more specific roles. As they are essential components of cell activity, it is important to understand how these proteins work -- and in particular, to determine which of a species' proteins participate in each role. Experimentally determining this mapping of proteins to roles is difficult and time consuming. Fortunately, many pathways are similar across species, so we may be able to use known pathway information of one species to understand the corresponding pathway of another.

We present an automatic approach, Predict Signaling Pathway (PSP), that uses the signaling pathways in well-studied species to predict the roles of proteins in less-studied species. We use a machine learning approach to create a predictor that achieves a generalization F-measure of 78.2% when applied to 11 different pathways across 14 different species. We also show our approach is very effective in predicting the pathways that have not yet been experimentally studied completely.

Joint work with Babak Bostan, Paul Lu, Duane Szafron

Webpage

A Structural Extension to Logistic Regression:
Discriminative Parameter Learning of Belief Net Classifiers

Bayesian belief nets (BNs) are often used for classification tasks --- typically to return the most likely class label for each specified instance. Many BN-learners, however, attempt to find the BN that maximizes a different objective function --- viz., likelihood, rather than classification accuracy --- typically by first learning an appropriate graphical structure, then finding the maximal likelihood parameters for that structure. As these parameters may not maximize the classification accuracy, ``discriminative learners'' follow the alternative approach of seeking the parameters that maximize conditional likelihood (CL), over the distribution of instances the BN will have to classify. This presentation first formally specifies this task, and shows how it extends standard logistic regression. After analyzing its inherent sample and computational complexity, we present a general algorithm for this task, ELR, that applies to arbitrary BN structures and works effectively even when given incomplete training data. We present empirical evidence that ELR produces better classifiers than are produced by the standard ``generative'' algorithms in a variety of situations, especially in common situations where the given BN-structure is incorrect.

Joint work with Wei Zhou, Xiaoyuan Su and Bin Shen

Webpage

Slides: 2008

Quantifying the Uncertainty of a Belief Net Response

A Bayesian Belief Network (BN) models a joint distribution over a set of n variables, using a DAG structure to represent the immediate dependencies between the variables, and a set of parameters (aka "CPTables") to represent the local conditional probabilities of a node, given each assignment to its parents. In many situations, these parameters are themselves random variables --- this may reflect the uncertainty of the domain expert, or may come from a training sample used to estimate the parameter values. The distribution over these "CPtable variables" induces a distribution over the response the BN will return to any "What is Pr(Q=q | E=e)?" query. This paper investigates properties of this response: showing first that it is asymptotically normal, then providing, in closed form, its mean and asymptotic variance. We then present an effective general algorithm for computing this variance, which has the same complexity as simply computing (the mean value of) the response itself --- ie, O(n 2w), where w is the effective tree width. Finally, we provide empirical evidence that a Beta approximation works much better than the normal distribution, especially for small sample sizes, and that our algorithm works effectively in practice, over a range of belief net structures, sample sizes and queries.

Joint work with Tim Van Allen, Ajit Singh and Peter Hooper.

Webpage

TUTORIALS

ML4MD

Towards Patient-Specific Treatment:
Medical Applications of Machine Learning

An effective patient-specific treatment model identifies which treatment has the best chance of success for each individual patient, based on all available information about that patient. Here, we consider the task of learning such models – in general, a learned combination of many features that collectively predict the outcome – from a labeled dataset describing earlier patients, and their outcomes. This presentation introduces the relevant ideas using real-world medical examples, and explains how this machine-learning approach differs from the task of finding individual biomarkers.

Joint work with various colleagues in the Cross Cancer Institute, the UofA Medical School and UofA Computing Science, including M Pasdar, J Mackey, S Damaraju, K Graham (and others of the PolyomX project), D Wishart, V Baracos, and A Murtha (and others of the Brain Tumor Analyst Project), M Brown (+ S Dursun, A Greenshaw), E Ryan, N Kneteman, A Montana-Loza, A Andres.

= = = = = =

Understand the task of (supervised) machine learning: produce a classifier, from a labeled dataset

… and how it differs from association studies (like biomarker discovery)
+ the role of each

Quickly overview

some basic machine learning algorithms
some examples of medical tasks that have been addressed
using Machine Learning techniques
how to validate a learned predictor

Slides (2019.11) (2019.05)

Video (2020.01: 60m) (2013) (2019.11: 45m 60m)

Abstracts: Longer abstract, … and other version

View from the Front:
Working with Medical Colleagues to
Produce Effective Predictor Systems

With today’s excitement over the many successful applications of machine learning to medical applications, an increasing number of medical researchers and clinicians are starting collaborations with machine learning researchers. This is a great opportunity to produce useful results, for a wide variety of important tasks. There are, however, many subtle issues in designing, and evaluating, these performance systems -- e.g., based on the differences between standard biostatistics vs supervised machine learning, between emulation vs objective optimization, on the focus on the performance task, related to selection bias (especially for prognosis), etc. This presentation provides information that I wish I had known, when I started -- first using simple examples to identify and characterize these distinctions then, where appropriate, suggesting solutions. We hope this information will raise awareness of these varied issues and approaches, which will facilitate many future effective collaborations.

Videos: Science Circle (Amii) (2020.04) AI Seminar (2018.08.03); Science Circle (2019.08.27 - glare on screen)

Slides: 2019.10 [need to request permission] 2018.08.17 (from 2018 MUCMD Conference (Stanford) )

Learning Models that Predict Objective, Actionable Labels

Many medical researchers want a tool that “does what a top medical clinician does, but does it better”. This presentation explores this goal. This requires first defining what “better” means, leading to the idea of outcomes that are “objective” and then to ones that are actionable, with a meaningful evaluation measure. We will discuss some of the subtle issues in this exploration – what does “objective” mean, the role of the (perhaps personalized) evaluation function, multi-step actions, counterfactual issues, distributional evaluations, etc. Collectively, this analysis argues we should learn models whose outcome labels are objective and actionable, as that will lead to tools that are useful and cost-effective.

2023.07: Video [ Slides ]

Tutorial: Introduction to Machine Learning

Machine Learning (ML) basically involves finding patterns in data. This presentation focuses on patterns that can lead to effective classifiers. It first provides a number of real-world deployed applications based on ML, and motivates why this process is essential for such tasks. It first provides a number of real-world deployed applications based on ML, and motivates why ML is essential for such tasks. It then overviews three simple examples of learnable classifiers (linear separators, artificial neural nets and decision trees), and quickly suggests some of the important statistical foundations that underlie this field.

Slides

Cmput466 notes

Tutorial: Evaluation and Cross-Validation

Many people are confused about evaluating machine learning tools (and also the resulting machine learnED models), and about ways to use Cross-Validation here – both for setting parameters and selecting features and also for evaluating the resulting model, etc. This presentation attempts to clarify these (sometimes subtle) issues.

(Note this was originally prepared for my 2022 “AI Capstone” class. )

2023: https://drive.google.com/file/d/1E6_tMvRKmEA5oIbThhdVhMcHEL3MziC1/view?usp=share_link

PDF: https://drive.google.com/file/d/1Dn7ZXjaQPShbC6AqPcX1SgNsgFBLKMra/view?usp=share_link

Tutorial: Introduction to Bayesian Belief Nets

Many tasks require building and using a model --- eg, to relate a patient's disease state with the possible symptoms, underlying causes, and effects of various treatments. These relationships are often probabilistic; eg a disease will often, but not always, manifest certain symptoms. Bayesian Belief Nets (BNs), which provide a succinct way to represent such probabilistic models, are in routine use for a wide range of applications, including medicine, bioinformatics, document classification, image processing and decision support systems. This presentation provides a quick overview to BNs: first motivating BNs in general, then describing how BNs exploit "independencies", and finally (if time permits) suggesting ways to learn a BN from data.

Video Presentation (2018)

Slides: (Intro) (Topics) [need to request permission]

Belief Net (PGM) website

TEACHING MATERIAL

How to Prepare and Deliver Platform Presentations (+ Posters)

This presentation focuses on preparing, and delivering, platform presentations: describing what material to present, and how to show this material effectively, and then some ideas relevant to the presentation itself: both before and during. The main ideas are to view the presentation as telling a well-structured story, that is relevant (and helpful) for the intended audience.

[ If time permits, we will also discuss how to prepare and present posters; we will see that many of the same ideas apply here, as well. ]

wrt Platform Presentation

Slides
Video (2020.10 [zoom])

wrt Poster Presentation [Slides]

How to Write Effectively

It is important to have great ideas. But unless and until you communicate them, no one will know those ideas, not be able to use those results. This presentation discusses how to write effectively, especially research publications. It first summarizes issues related to the content – what material to include: eg, how this relates to your underlying claims, and to your intended target audience. It notes the importance of telling a structured story, and including simple examples and motivation, etc. It then discusses form – how to state that material: eg, how to structure the document, issues related to flow and language, and consistency. It then gives some minor nuances, about formality, verbs, re-writing, etc. Throughout, we provide simple examples to illustrate the points. We hope this will guide the writers to produce effective, informative documents.

(Note this material was designed for Cmput 469 [2022], and so has some references to other parts of that course.)

Slides (2022)

Video (~70m)

Note: On July 2023, I updated the video for

Learning Models that Predict Objective, Actionable Labels

Please watch the following URL instead:

https://drive.google.com/file/d/1OAyucXPHr2zl5bMDVdXXN_S14zigdr1-/view

https://docs.google.com/document/d/1GlIoHaKmle4NlMZhZ0VaszeFX-hGfwbhJ-uhrslPb9E/edit#