ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
Plenary Talks
2
PresenterTalk TitleAbstract
3
Emmanuel Candès Conformal Prediction in 2022
Conformal inference methods are becoming all the rage in academia and industry alike. In a nutshell, these methods deliver exact prediction intervals for future observations without making any distributional assumption whatsoever other than having iid, and more generally, exchangeable data. This talk will review the basic principles underlying conformal inference and survey some major contributions that have occurred in the last 2-3 years or. We will discuss enhanced conformity scores applicable to quantitative as well as categorical labels. We will also survey novel methods which deal with situations, where the distribution of observations can shift drastically — think of finance or economics where market behavior can change over time in response to new legislation or major world events, or public health where changes occur because of geography and/or policies. All along, we shall illustrate the methods with examples including the prediction of election results or COVID19-case trajectories.
4
Guido Imbens Multiple Randomization Designs
In this talk I will discuss a new class of experimental designs, Multiple Randomization Designs. In a classical randomized controlled trial (RCT), or A/B test, a randomly selected subset of a population of units (e.g., individuals, plots of land, or experiences) is assigned to a treatment (treatment A), and the remainder of the population is assigned to the control treatment (treatment B). The difference in average outcome by treatment group is an estimate of the average effect of the treatment. However, motivating this talk, the setting for modern experiments is often different, with the outcomes and treatment assignments indexed by multiple populations. For example, outcomes may be indexed by buyers and sellers, by content creators and subscribers, by drivers and riders, or by travelers and airlines and travel agents, with treatments potentially varying across these indices. Spillovers or interference can arise from interactions between units across populations. For example, sellers' behavior may depend on buyers' treatment assignment, or vice versa. This can invalidate the simple comparison of means as an estimator for the average effect of the treatment in classical RCTs. I discuss new experimental designs for settings in which multiple populations interact. I show how these designs allow us to study questions about interference that cannot be answered by classical randomized experiments. Finally, I discuss new statistical methods for analyzing these Multiple Randomization Designs.
5
Susan Murphy Inference for Longitudinal Data After Adaptive Sampling
Adaptive sampling methods, such as reinforcement learning (RL) and bandit algorithms, are increasingly used for the real-time personalization of interventions in digital applications like mobile health and education. As a result, there is a need to be able to use the resulting adaptively collected user data to address a variety of inferential questions, including questions about time-varying causal effects. However, current methods for statistical inference on such data (a) make strong assumptions regarding the environment dynamics, e.g., assume the longitudinal data follows a Markovian process, or (b) require data to be collected with one adaptive sampling algorithm per user, which excludes algorithms that learn to select actions using data collected from multiple users. These are major obstacles preventing the use of adaptive sampling algorithms more widely in practice. In this work, we proved statistical inference for the common Z-estimator based on adaptively sampled data. The inference is valid even when observations are non-stationary and highly dependent over time, and allow the online adaptive sampling algorithm to learn using the data of all users. Furthermore, our inference method is robust to miss-specification of the reward models used by the adaptive sampling algorithm. This work is motivated by our work in designing the Oralytics oral health clinical trial in which an RL adaptive sampling algorithm will be used to select treatments, yet valid statistical inference is essential for conducting primary data analyses after the trial is over.
6
Sylvia Richardson Scaling up Bayesian Modeling and Computation for real-world biomedical and public health applications
The fast expansion of biomedical data resources is underpinning advances in medical research. However, it has brought a number of challenges for Bayesian inferential approaches. The “large n” data setting, such as encountered in the analysis of large cohorts or electronic health records, often creates computational bottlenecks, precluding model search endeavours. The “large p” setting, inherent to the modelling of high-dimensional data arising, for example, from the development of precision medicine strategies and new techniques to probe biomolecular mechanisms, can make joint analysis unreliable or intractable. Public health emergencies, like the Covid-19 pandemic, have shown the value of performing data synthesis at pace to carry out disease tracking. These varied contexts call for combining Bayesian hierarchical modelling with scalable approximate algorithms capable of producing accurate and robust inferences. In this talk, I will first discuss the adaptation of the divide-and-conquer approaches for large n to the inferential context of model choice and of mixture models – an adaptation which goes beyond the well-established divide-and-conquer approaches developed for posterior inference on a chosen model with fixed number of parameters. I will next introduce some current analysis needs in biomedicine and discuss modelling and computational strategies for implementing joint regression modelling of a large numbers p of features and responses, and for joint network analyses. In both cases, information is borrowed through suitable hierarchical formulations. If time permits, I will end with a brief discussion of the challenges to conventional statistical and data science practice brought into focus by health surveillance in the recent pandemic.
7
Invited Talks
8
PresenterTalk TitleAbstract
9
Abhik GhoshRobust Sure Independence Screening for Non-polynomial dimensional Generalized Linear Models
We consider the problem of variable screening in ultra-high dimensional generalized linear models (GLMs) of non-polynomial orders. Since the popular SIS approach is extremely unstable in the presence of contamination and noise, we discuss a new robust screening procedure based on the minimum density power divergence estimator (MDPDE) of the marginal regression coefficients. Our proposed screening procedure performs well under pure and contaminated data scenarios. We provide a theoretical motivation for the use of marginal MDPDEs for variable screening from both population as well as sample aspects; in particular, we prove that the marginal MDPDEs are uniformly consistent leading to the sure screening property of our proposed algorithm. Finally, we propose an appropriate MDPDE based extension for robust conditional screening in GLMs along with the derivation of its sure screening property. The required assumptions are further verified for some important examples of GLMs, namely linear, logistic and Poisson regressions. Our proposed methods are illustrated through extensive numerical studies along with an interesting real data application.
10
Alan AgrestiA Historical Overview of Textbook Presentations of Statistical Science
We discuss the evolution in the presentation of statistical science in textbooks during the first half of the twentieth century, as the field became better defined by advances due to R. A. Fisher and Jerzy Neyman. An early influential book with 14 editions was authored by G. Udny Yule. Methods books authored by Fisher and George Snedecor showed scientists how to implement Fisher's advances. Later books from the World War 2 era authored by Maurice Kendall, Samuel Wilks, and Harald Cramer had stronger emphasis on the theoretical foundations. The Bayesian approach emerged somewhat later in textbooks, influenced strongly by books by Harold Jeffreys and Leonard Savage. We conclude by discussing the future of textbooks on the foundations of statistical science in the emerging, ever-broader, era of data science. Details are in a recent article in the Brazilian Journal of Probability and Statistics, available at www.stat.ufl.edu/~aa/articles/Agresti_BJPS.pdf
11
Alejandra Avalos PachecoCross-study Factor Regression for Heterogeneous Datasets
Data-integration of multiple studies can be key to understand and gain knowledge in statistical research. However, such data present both biological and artifactual sources of variation, also known as covariate effects. Covariate effects can be complex, leading to systematic biases. In this talk I will present novel sparse latent factor regression (FR) and cross-study factor regression (CSFR) models to integrate such heterogeneous data. The FR model provides a tool for data exploration via dimensionality reduction and sparse low-rank covariance estimation while correcting for a range of covariate effects. CSFR are extensions of FR that enable us to jointly obtain a covariance structure that models the group-specific covariances in addition to the common component, learning covariate effects from the observed variables, such as the demographic information. I will discuss the use of several sparse priors (local and non-local) to learn the dimension of the latent factors. Our approaches provide a flexible methodology for sparse factor regression which is not limited to data with covariate effects. I will present several examples, with a focus on bioinformatics applications. We show the usefulness of our methods in two main tasks: (1) to give a visual representation of the latent factors of the data, i.e. an unsupervised dimension reduction task and (2) to provide a (i) supervised survival analysis, using the factors obtained in our method as predictions for the cancer genomic data; and (ii) dietary pattern analysis, associating each factor with a measure of overall diet quality related to cardiometabolic disease risk for a hispanic community health nutritional-data study.
Our results show an increase in the accuracy of the dimensionality reduction, with non-local priors substantially improving the reconstruction of factor cardinality. The results of our analyses illustrate how failing to properly account for covariate effects can result in unreliable inference.
12
Alessandra MatteiSelecting Subpopulations for Causal Inference in Regression Discontinuity Designs
The Brazil Bolsa Familia program is a conditional cash transfer program aimed to reduce short-term poverty by direct cash transfers and to fight long-term poverty by increasing human capital among poor Brazilian people. Eligibility for Bolsa Familia benefits depends on a cutoff rule, which classifies the Bolsa Familia study as a regression discontinuity (RD) design. Extracting causal information from RD studies is challenging. Following Li et al (2015) and Branson and Mealli (2019), we formally describe the Bolsa Familia RD design as a local randomized experiment within the potential outcome approach. Under this framework, causal effects can be identified and estimated on a subpopulation where a local overlap assumption, a local SUTVA and a local ignorability assumption hold. We first discuss the potential advantages of this framework, in settings where assumptions are judged plausible, over local regression methods based on continuity assumptions, which concern both the definition of the causal estimands, as well as the design and the analysis of the study, and the interpretation and generalizability of the results. A critical issue of this local randomization approach is how to choose subpopulations for which we can draw valid causal inference. We propose to use a Bayesian model-based finite mixture approach to clustering to classify observations into subpopulations where the RD assumptions hold and do not hold on the basis of the observed data. This approach has important advantages: a) it allows to account for the uncertainty in the subpopulation membership, which is typically neglected; b) it does not impose any constraint on the shape of the subpopulation; c) it is scalable to high-dimensional settings; e) it allows to account for rare outcomes; and f) it is robust to a certain degree of manipulation/selection of the running variable. We apply our proposed approach to assess causal effects of the Borsa Familia program on leprosy incidence in 2009, for Brazilian households who registered in the Brazilian National Registry for Social Programs in 2007-2008 for the first time. Our approach allows us to deal with the rare outcome and the large sample of 152,602 households.
13
Alessandro RinaldoSequential change point detection for networks
We study the change point detection task in settings in which we are presented with a stream of independent networks on a fixed node set whose distributions are piece-wise constant over time. Our goal is to determine, at each time point when a new observation is acquired, whether the data collected so far have provided sufficient evidence to infer that the underlying distribution has changed at the present time or in the near past. For sequences of Bernoulli networks, we develop novel, polynomial time CUSUM-based procedures and derive high-probability bounds on the corresponding detection delays with an explicit dependence on the network size, the entrywise and rank sparsity and the magnitude of the change. We complement our analysis with minimax lower bounds, which we show are realized by NP_hard procedure. We also consider more general sequences of multilayer random dot product networks. We develop novel change point algorithms based on tensors and study their theoretical performance. We demonstrate the effectiveness of our methodologies on real life examples.
14
Alex LuedtkeEfficient Estimation of the Maximal Association between Multiple Predictors and a Survival Outcome
This paper develops a new approach to post-selection inference for screening high-dimensional predictors of survival outcomes. Post-selection inference for right-censored outcome data has been investigated in the literature, but much remains to be done to make the methods both reliable and computationally-scalable in high-dimensions. Machine learning tools are commonly used to provide predictions of survival outcomes, but the estimated effect of a selected predictor suffers from confirmation bias unless the selection is taken into account. The new approach involves construction of semi-parametrically efficient estimators of the linear association between the predictors and the survival outcome, which are used to build a test statistic for detecting the presence of an association between any of the predictors and the outcome. Further, a stabilization technique reminiscent of bagging allows a normal calibration for the resulting test statistic, which enables the construction of confidence intervals for the maximal association between predictors and the outcome and also greatly reduces computational cost. Theoretical results show that this testing procedure is valid even when the number of predictors grows superpolynomially with sample size, and our simulations support that this asymptotic guarantee is indicative the performance of the test at moderate sample sizes. The new approach is applied to the problem of identifying patterns in viral gene expression associated with the potency of an antiviral drug.
15
Alex PetersenFréchet Single Index Models for Object Response Regression
With the increasing prominence of non-Euclidean data objects, statisticians must develop appropriate statistical tools for their analysis. For regression models with predictors in $R^p$ and response variables being situated in a metric space, conditional Fréchet means can be used to define the Fréchet regression function. Global and local Fréchet methods have recently been developed for modeling and estimating this regression function as extensions of multiple and local linear regression, respectively. In this presentation, this line of methodology is expanded to include the Fréchet Single Index (FSI) model, in which the Fréchet regression function only depends on a scalar projection of the underlying multivariate predictor. Estimation is performed by combining local Fréchet regression along with $M$-estimation to estimate the coefficient vector underlying regression function, and these estimators are shown to be consistent. The method is illustrated by simulations for response objects on the surface of the unit sphere and through an analysis of human mortality data in which lifetable data are represented by distributions of age-at-death, viewed as elements of Wasserstein space.
16
Ali ShojaieLearning Directed Acyclic Graphs From Partial Orderings
Directed acyclic graphs (DAGs) are commonly used to model causal relationships among random variables. The problem of estimation of DAGs is both computationally and statistically challenging, and in general, the direction of edges may not be estimable from observational data alone. However, given a causal ordering of the variables, the problem can be solved efficiently, even in high dimensions. In this paper, we consider an intermediate problem, where only a partial causal ordering of variables is available. We discuss a general estimation procedure for discovering DAGs with arbitrary structure from partial orderings. We also present efficient estimation algorithms for two popular classes of high-dimensional sparse directed acyclic graphs, namely linear and additive structural equation models.
17
Alicia CarriquiryData Science in Forensic Practice: The Good, the Ugly, and the Truly Awful
Data that arise in the forensic evaluation of evidence are often non-standard and include images, voice recordings, and biological samples with multiple donors. Yet, forensic scientists typically rely on subjective visual assessments or simple statistical approaches. We propose the use of algorithmic approaches to extract quantitative features from images that can then be used to fit statistical models or fed into classification or other types of algorithms. As illustration, we discuss examples in the forensic evaluation of footwear prints, of handwritten documents and of fired bullets, time permitting.
18
Anindya BhadraGraphical Evidence
Marginal likelihood, also known as model evidence, is a fundamental quantity in Bayesian statistics. It is used for model selection using Bayes factors or for empirical Bayes tuning of prior hyper-parameters. Yet, the calculation of evidence has remained a longstanding open problem in Gaussian graphical models. Currently, the only feasible solutions that exist are for special cases such as the Wishart or G-Wishart, in moderate dimensions. We present an application of Chib's technique that is applicable to a very broad class of priors under mild requirements. Specifically, the requirements are: (a) the priors on the diagonal terms on the precision matrix can be written as gamma or scale mixtures of gamma random variables and (b) those on the off-diagonal terms can be represented as normal or scale mixtures of normal. This includes structured priors such as the Wishart or G-Wishart, and more recently introduced element-wise priors, such as the Bayesian graphical lasso and the graphical horseshoe. Among these, the true marginal is known in an analytically closed form for Wishart, providing a useful validation of our approach. For the general setting of the other three, and several more priors satisfying conditions (a) and (b) above, the calculation of evidence has remained an open question that this article resolves under a unifying framework.
19
Anqi Zhao
No Star Is Good News: A Unified Look at Rerandomization Based on P-Values from Covariate Balance Tests
RCTs balance all covariates on average and provide the gold standard for estimating treatment effects. Chance imbalances however exist more or less in realized treatment allocations, subjecting subsequent inference to possibly large variability. Modern scientific publications require the reporting of covariate balance tables with not only covariate means by treatment group but also the associated p-values from significance tests of their differences. The practical need to avoid small p-values renders balance check and rerandomization by hypothesis testing an attractive tool for improving covariate balance in RCTs. We examine a variety of potentially useful schemes for rerandomization based on p-values (ReP) from covariate balance tests, and demonstrate their impact on subsequent inference. The main findings are twofold. First, the estimator from the fully interacted regression is asymptotically the most efficient under all ReP schemes examined, and permits convenient regression-assisted inference identical to that under complete randomization. Second, ReP improves not only covariate balance but also the efficiency of the estimators from the unadjusted and additive regressions.
20
Antonietta MiraBayesian estimation of data intrinsic dimensions
With the advent of big data, it is increasingly common to deal with cases where data take place in a high-dimensional space, and little is known a priori about their distribution. Quite often, however, this distribution has support on a subspace (manifold) whose dimension, called the intrinsic dimension (ID) of the data, is much lower than the dimensionality of the embedding space. Under very weak assumptions on the data generating mechanism, the nearest-neighbor (NN) distances among points follow distributions that depend parametrically on the ID. Facco et al. (Scientific Reports, 2017) leveraged this, developing an ID estimator (TWO-NN) based on the ratio of distances between the first two NN of each data point. This result was then extended to ratios of distances between consecutive neighboring points and, further, to ratios of distances between NN of generic orders, deriving alternative estimators (GRIDE) more robust to noise in the data.
We also extended TWO-NN to the case where the ID is not constant within the data, i.e., the distribution has support on the union of several manifolds with different IDs. This situation may trivially occur if data sets with heterogeneous IDs are merged, but, as we reveal, it also happens quite naturally in data from diverse disciplines.
In this case, the ratios follow a simple mixture distribution. Within a Bayesian framework, we can robustly estimate the IDs of the manifolds and assign each data point to one of the manifolds. In many real-world datasets, we find widely heterogeneous collections of IDs corresponding to variations in core properties. For example, folded vs. unfolded configurations in a protein molecular dynamics trajectory, active vs. non-active regions in brain imaging data, cases vs. controls in gene expression data, firms with different financial risk in company balance sheets, and Covid-19 data in countries implementing various non-pharmaceutical interventions.
21
Antonio D'AmbrosioClustering of individuals in non-metric unfolding
In the framework of preference rankings, the interest can lie in clustering individuals and/or items, allowing in order to reduce the complexity of the preference spaces for easier interpretation of collected data. A technique to perform clustering of individuals or items within the non-metric unfolding framework is presented
22
Arantxa Urdangarin
Alleviating spatial confounding in disease mapping: assessment of recent methods in terms of fixed effect estimates
Spatial confounding has received attention in the last years, but there is not a consensus about a general definition. Typically, the term spatial confounding refers to the change in fixed effect estimates that may occur when spatially correlated random effects collinear with the covariate are included in the model. Arguably, the most extended method for dealing with spatial confounding is restricted spatial regression, which consists in restricting the spatial random effects to the orthogonal complement of the fixed effects and hence preserving the estimates obtained in a regression model without spatial random effects. Other methods have been proposed to alleviate spatial confounding in spatial linear (and generalized linear) regression models, such as the spatial + model or transformed Gaussian Markov random fields, but it is still unclear if they provide correct estimates of the fixed effects. Given the controversy about the definition of spatial confounding and the various ideas underlying the proposals, the main objective of this work is to assess how well each of the methods estimate the fixed effects when there are additional spatially structured variability unexplained by the observed covariates. For this aim we first analyse three data sets to see that the distinct methods lead to different estimates. We then simulate several scenarios under different data generating mechanisms including one observed covariate and additional variability, and fit the considered models to see how well they recover the true value of the fixed effect coefficient. The simulation study reflects that the spatial+ approach provides fixed effect estimates close to the true value in most of the simulated scenarios. Interestingly, differences in risk estimation among the proposals are very small.

Keywords: restricted regression; spatial regression models; spatial+; transformed Gaussian Markov random fields.
23
Arne BathkeStatistical Evidence from Small Samples or Personalized Interventions
We present findings from two projects where currently available statistical methods are at their limits. One deals with digital health studies where participants are receiving personalized interventions. Which study designs are appropriate for such data? Can fundamental statistical concepts for evaluating interventions be adapted for these situations?
The second project considers rare disease data. Motivated by an Epidermolysis Bullosa simplex trial with an ordinal outcome variable in a cross-over study, we have tried to perform a neutral comparison between different parametric and nonparametric statistical methods applicable in such designs. Specifically, we have considered methods arising from the generalized pairwise comparison (GPC) framework, and rank-based approaches as implemented in the R package nparLD. A major question is how to actually perform a comparison that is neutral.
24
Arnoldo FrigessiThe Split-Sequential-ABC for real-time inference of complex metapopulation models.
Stochastic compartmental metapopulation models have been very useful for situation awareness, forecasting and scenario simulation during the Covid-19 pandemics. Informed by multiple sources of data, typically incidence of test-confirmed positive cases and covid-19 caused hospital admissions, they allow to estimate parameters which describe the strength of transmission of the virus in the population, the latent number of infected individuals and several parameters of the likelihood. The number of parameters and the dimension of the latent variables grow in time and when regional estimates are needed. Inference is typically based on sequential Approximate Bayesian Computation (ABC). In order to be useful for the management of the pandemics, inference has to be performed in useful time, typically within one-two days. We developed a new sequential ABC, the split-seqABC, where we split the inferential task repeatedly in time and space, to deliver reliable results in useful-time, while aiming to maintain a realistic quantification of posterior uncertainty. We show how our method has been used during the Covid-19 pandemics in Norway. This is joint work of the Oslo covid-19 modelling team at the National Institute of Public Health, Norsk Regnesentral, Telenor and the University of Oslo.
25
Aude SportisseInformative labels in Semi-Supervised Learning
In semi-supervised learning, we have access to features but the outcome variable is missing for a part of the data. In real life, although the amount of data available is often huge, labeling the data is costly and time-consuming. It is particularly true for image data sets: images are available in large quantities on image banks but they are most of the time unlabeled. It is therefore necessary to ask experts to label them. In this context, people are more inclined to label images of some classes which are easy to recognize. The unlabeled data are thus informative missing values, because the unavailability of the labels depends on their values themselves. Typically, the goal of semi-supervised learning is to learn predictive models using all the data (labeled and unlabeled ones). However, classical methods lead to biased estimates if the missing values are informative. We aim at designing new semi-supervised algorithms that handle informative missing labels.
26
Aurore DelaigleEstimating the Distribution of Episodically Consumed Foods Measured with Error

Dietary data collected from 24-hour dietary recalls are observed with significant measurement errors. In the nonparametric curve estimation literature, a lot of effort has been devoted to designing methods that are consistent under contamination by noise. However, some foods such as alcohol or fruits are consumed only episodically, and may not be consumed during the day when the 24-hour recall is administered. Existing nonparametric methods cannot deal with those so-called excess zeros. We present new estimators of the distribution of such episodically consumed food data measured with errors.
27
Axel MunkTransport Dependency: Optimal Transport Based Dependency Measures
Finding meaningful ways to determine the dependency between two random variables
𝜉 and 𝜁 is a timeless statistical endeavor with vast practical relevance. In recent years,
several concepts that aim to extend classical means (such as the Pearson correlation or
rank-based coefficients like Spearman’s 𝜌) to more general spaces have been introduced
and popularized, a well-known example being the distance correlation. In this talk,
we propose and study an alternative framework for measuring statistical dependency,
the transport dependency 𝜏 ≥ 0 (TD), which relies on the notion of optimal transport and is
applicable in general Polish spaces. It can be estimated via the corresponding
empirical measure, is versatile and adaptable to various scenarios by proper choices of the
cost function. It intrinsically respects metric properties of the ground spaces.
Based on sharp upper bounds, we exploit three distinct dependency coefficients
with values in [0, 1], each of which emphasizes different functional relations: These
transport correlations attain the value 1 if and only if 𝜁 = 𝜑(𝜉), where 𝜑 is a) a Lipschitz
function, b) a measurable function, c) a multiple of an isometry.

Besides a conceptual discussion of transport dependency, we address numerical issues and
its ability to adapt automatically to the potentially low intrinsic dimension of the ground space.Monte Carlo results suggest that TD is a robust quantity that efficiently discerns dependency structure from noise for data sets with complex internal metric geometry.
The use of TD for inferential tasks is illustrated for independence testing on a data set of trees from cancer genetics.
28
Barbara BodinierStability selection for the identification of multi-omics markers of lung cancer
Omics technologies provide an agnostic view of individual molecular profiles, which constitutes a valuable source of information for the characterisation of the embodiment of external exposures and its subsequent effects on health. Regularised approaches have been instrumental for the analysis of these high dimensional datasets but may generate instable results. In stability selection, results obtained by applying a (regularised) feature selection model on multiple subsamples are aggregated to enhance the reliability of the findings. The difficult choice of hyper-parameters in stability selection is still hampering its use in practice. We propose here to calibrate hyper-parameters by maximising a novel score measuring model stability by a negative likelihood under the null hypothesis that all features have the same selection probability. Furthermore, we show that the technical heterogeneity that arises in multi-omics integration can be corrected for using platform-specific parameters. The validity of the proposed approaches is investigated in a simulation study and its applicability is illustrated with real multi-omics data to identify molecular markers of the risk of lung cancer. Models are developed in a structural causal modelling framework, which enables the identification of potential molecular mediators of the effect of tobacco smoking on lung cancer risk. Proposed approaches have been implemented in the R package sharp, available on CRAN. 
29
Betsy OgburnDisentangling confounding and dependence in spatial statistics
Nonsense associations can arise when an exposure and an outcome of interest exhibit similar patterns of dependence. Confounding is present when potential outcomes are not independent of treatment. This talk will describe how understanding the connection between these two phenomena leads to insights about "spatial confounding" and to new methods for causal inference with spatial data.
30
Birgitte Freiesleben De BlasioModelling for opertional support in Norway during the pandemic
During the COVID-19 pandemic, mathematical modelling has been influential in informing policy decisions. We have developed a modelling pipeline providing situational awareness, short-term prognoses and scenario modelling to assist the Norwegian health authorities with response and preparedness planning.
Our models' unique feature is the focus on regional disease dynamics, centred on a stochastic municipality-level metapopulation model for estimating the epidemiological situation and an individual-based model to study the effects of interventions, including vaccination. The models are informed with data from the Norwegian emergency registry and real-time mobile phone mobility data. The models support the Norwegian COVID-19 strategy, which is based locally. The geographically specific approach is highly relevant because of the significant differences in the local disease burden observed in Norway, where, in particular, Oslo and the surrounding regions have had sustained high infection levels.
In my talk, I will give an overview of the modelling pipeline. I will present our situational awareness model and document that our regional model has enhanced three-week predictive performance compared to alternative models. Then I will describe our modelling of the regional prioritisation of vaccines in Norway. This work was part of the knowledge base for the government's decision to a geographical prioritisation of vaccines in early 2021. Finally, I will talk about our experience with operational modelling, our collaboration with policymakers and the need for advancing communications to improve the use of models during a crisis.
31
Bo LiClimate Model Validation with respect to Extremes
Climate models use systems of partial differential equations to describe the temporal evolution of climate, oceans, atmosphere, ice, and land-use processes, across a spatial domain. Scientists rely on climate models to study why the Earth’s climate is changing and how it might change in the future, as well as to study the dynamic of different climate factors. An interesting question is how we should evaluate whether a climate model simulates the Earth’s real climate. Many existing methods for comparing two climate fields shed light on climate model validation. However, they are not tailored for comparing spatial extremes fields, and the learnings obtained from their applications to climate model evaluation should not be directly extended to climate extremes. The large variation inherited in extreme values makes the evaluation in climate extremes more challenging than that for mean and dependency structure. We propose a new multiple testing approach to evaluate the extreme behavior of climate model simulations in terms of extreme value distribution and return levels. Our method can identify the regions where the simulated extremes are different from reality, and this will provide climate scientists insights on how to improve climate models.
32
Bodhisattva SenA Kernel Measure of Dissimilarity between M Distributions
We propose a measure between 0 and 1 to quantify how different M distributions (defined on a general metric space) are, which we call the kernel measure of multi-sample dissimilarity (KMD). The population KMD is 0 if and only if all the M distributions are the same, and 1 if and only if all the distributions are mutually singular; moreover, any value between 0 and 1 conveys an idea of how different these distributions are. The sample version of KMD, based on independent observations from the M distributions, can be computed in near linear time and is consistent with an asymptotic normal distribution. We develop a test for the equality of the M distributions based on KMD which is shown to be consistent against all alternatives where at least two distributions are not equal. We develop CLTs to study the power behavior of this test, both under fixed and shrinking alternatives. This yields a complete asymptotic characterization of the power of the test as well its detection threshold.
33
Brian ReichModeling Extremal Streamflow using Deep Learning Approximations and a Flexible Spatial Process
Quantifying changes in the probability and magnitude of extreme flooding events is key to mitigating their impacts. While hydrodynamic data are inherently spatially dependent, traditional spatial models such as Gaussian processes are poorly suited for modeling extreme events. Spatial extreme value models with more realistic tail dependence characteristics are under active development. They are theoretically justified, but give intractable likelihoods, making computation challenging for small datasets and prohibitive for continental-scale studies. We propose a process mixture model which specifies spatial dependence in extreme values as a convex combination of a Gaussian process and a max-stable process, yielding desirable tail dependence properties but intractable likelihoods. To address this, we employ a unique computational strategy where a feed-forward neural network is embedded in a density regression model to approximate the conditional distribution at one spatial location given a set of neighbors. We then use this univariate density function to approximate the joint likelihood for all locations by way of a Vecchia approximation. The process mixture model is used to analyze changes in annual maximum streamflow within the US over the last 50 years, and is able to detect areas which show increases in extreme streamflow over time.
34
Byeong ParkAdditive regression with parametric help
We discuss a way of improving local linear additive regression when the response variable takes values in a general separable Hilbert space. The new method reduces the constant factor of the leading bias of the local linear smooth backfitting estimator while retaining the same first-order variance. Our methodology covers the case of non-additive regression function as well as additive. We present relevant theory in this flexible framework and demonstrate the benefit of the proposed technique via a real data application.
35
Chengyong Tang
Adjusting the Benjamini-Hochberg method for controlling the false discovery rate in knockoff-assisted variable selection
The knockoff-based multiple testing setup of Barber and Cand`es (2015) for variable selection in multiple regression where sample size is as large as the number of explanatory variables is considered. The method of Benjamini and Hochberg (1995) based on ordinary least squares estimates of the regression coefficients is adjusted to this setup, transforming it to a valid p-value based false discovery rate controlling method not relying on any specific correlation structure of the explanatory variables. Simulations and real data applications show that our proposed method that is agnostic to π0, the proportion of unimportant explanatory variables, and a data-adaptive version of it that uses an estimate of π0 are powerful competitors of the false discovery rate controlling method in Barber and Cand`es (2015). This is a joint work with Dr Sanat K. Sarkar.
36
Chiara SabattiMachine learning and genetics
The beginning of Statistics and Genetics as modern academic disciplines are intertwined and, indeed, these two sciences grew together. Staple statistical methods as maximum likelihood parameter estimation and correlation are particularly well motivated in genetics.
Recent years have seen the flourishing of machine learning, applied in many scientific endeavors. At the same time, the size and availability of genetic data has increased substantially. Are machine learning tools being used to mine genetic datasets? We will discuss some challenges and highlight approaches that facilitate the deployment of modern data mining tools to the analysis of genetics data.
37
Chiung-Yu HuangImproved semiparametric estimation of the proportional rate model with recurrent event data
Owing to its robustness properties, marginal interpretations, and ease of implementation, the pseudo-partial likelihood method proposed in the seminal papers of Pepe and Cai (1993) and Lin et al. (2000) has become the default approach for analyzing recurrent event data with Cox-type proportional rate models. However, the construction of the pseudo-partial score function ignores the dependency among recurrent events and thus can be inefficient. An attempt to investigate the asymptotic efficiency of weighted pseudo-partial likelihood estimation found that the optimal weight function involves the unknown variance-covariance process of the recurrent event process and may not have closed-form expression. Thus, instead of deriving the optimal weights, we propose to combine a system of pre-specified weighted pseudo-partial score equations via the generalized method of moments and empirical likelihood estimation. We show that a substantial efficiency gain can be easily achieved without imposing additional model assumptions. More importantly, the proposed estimation procedures can be implemented with existing software. Theoretical and numerical analyses show that the empirical likelihood estimator is more appealing than the generalized method of moments estimator.
38
Chris HolmesCausal Predictive Inference and Target Trial Emulation
We consider causal inference from observational data as a missing data problem arising from a hypothetical population-scale randomized trial matched to the observational study. This links a target trial protocol with a corresponding generative predictive model for inference, providing a self-contained framework for communication of causal assumptions and statistical uncertainty on causal treatment effects, without the need for counterfactuals. Conventional causal assumptions map to intuitive conditions on the transportability of predictive models across populations and conditions. We demonstrate our approach in an application studying the effects of maternal smoking on birthweights using extensions of Bayesian additive regression trees and inverse probability weighting.
39
Claire BoyerHandling missing values in linear regression
One of the big ironies of data sciences is that the more data we have, the more missing data are likely to appear. After discussing the various issues presented by the missing data in dailylife machine learning, we will present different ways to tackle them for different purposes: indeed the strategy will not be the same if one wants to perform model estimation or prediction. We will present some insights and works trying to address the previous issues, particularly in the case of linear regression with missing covariates. This will gather joint works with Alexis Ayme, Aymeric Dieuleveut, Julie Josse, Erwan Scornet and Aude Sportisse.
40
Claudia CzadoCopula based state space models
We propose a new class of state space models which are nonlinear and non Gaussian in both the state and observation equation. More precisely, we assume that the observation equation and the state equation are defined by copula families. Inference
is performed within the Bayesian framework, using the Hamiltonian Monte Carlo method.
Simulation studies show that the proposed copula-based approach is extremely flexible, since it is able to describe a wide range of dependence structures and, at the same time, allows us to deal with missing data. Two application involving air pollution measurements are given.

References:

Kreuzer, Alexander, Dalla Valle, Luciana and Czado, Claudia. (2022). A Bayesian non‐linear state space copula model for air pollution in Beijing. Journal of the Royal Statistical Society: Series C (Applied Statistics). 71. 10.1111/rssc.12548.

Kreuzer, Alexander, Dalla Valle, Luciana and Czado, Claudia. (2019). Bayesian multivariate nonlinear state space copula models. arXiv preprint arXiv:1911.00448.
41
Claudia KirchFunctional change point detection for fMRI data
Functional magnetic resonance imaging (fMRI) is now a well-established technique for studying the brain. However, in many situations, such as when data are acquired in a resting state, the statistical analyzes depends crucially on stationarity which could easily be violated. We introduce tests for the detection of deviations from this assumption by making use of change point alternatives, where changes in the mean as well as covariance structure of functional time series are considered. Because of the very high-dimensionality of the data an approach based on a general covariance structure is not feasible, such that computations will be conducted by making use of a multidimensional separable functional covariance structure. Using the developed methods, a large study of resting state fMRI data is conducted to determine whether the subjects undertaking the resting scan have nonstationarities present in their time courses. It is found that a sizeable proportion of the subjects studied are not stationary.
42
Cristina MollicaMixtures of Mallows models with Spearman distance for clustering partial rankings
The class of Mallows models with Spearman distance has been barely explored in the literature of ranking data, from both an analytical and a computational point of view. One of the main reasons lies in the fact that, unlike the Kendall, the Hamming or the Cayley metrics, the Spearman distance cannot be decomposed into the sum of independent terms, that would lead to a closed-form solution of the normalizing constant and to convenient simplifications in the estimation procedure. Moreover, with decomposable metrics, more straightforward extensions of the Mallows model can be obtained to handle partial rankings. Nevertheless, the Spearman distance induces a parametric ranking distribution which parallels the multivariate normal model on the ranking space and admits an analytical solution for the MLE of the modal consensus ranking. Motivated by this theoretical argument, we extend the Mallows model with Spearman distance to the finite mixture context. Through a data augmentation strategy, we develop a fast and accurate EM algorithm to perform inference on samples of partially ranked sequences drawn from heterogeneous populations and possibly affected by different types of censoring. Additionally, in order to address the inferential issues related to the analysis with a large number of ranked items, this work introduces an approximation of the Spearman distance distribution. The utility of our proposals to recognise meaningful clusters in ranking data is illustrated through an application to a real-world data.
43
Cristina RuedaMathematical and Statistical modelling using the FMM approach The case of the Electrocardiogram
Oscillatory systems arise in the different biological and medical fields. Mathematical and statistical approaches are fundamental to deal with these processes. A single oscillation is mathematically represented in the circular space as a circular signal. The FMM (Frequency Modulated Mobiüs) approach, is a novel approach to study these signals that models the phase in the circular space using Mobiüs transforms. Little known as it has been recently developed, solves a variety of exciting questions with real data; some of them, such as the decomposition of the signal into components and their multiple uses, are of general application others are specific. Among the exciting specific applications is the analysis of electrocardiograms. The electrocardiograms are mostly used as diagnostic tools, since an irregularity in any of those measurements could indicate a heart condition. However, interpreting the signals is not easy, even for trained physicians. The FMMecg model, separately characterize the five fundamental waves of a heartbeat. It does so by generating parameters describing the wave shape of a heartbeat, in a similar way a physician would do manually. Diagnostic results are then calculated automatically from that data.
44
Cun-Hui ZhangScaled Cp and its optimality in the Lasso path
In sparse linear regression, the Lasso estimator requires a proper penalty to achieve the optimal rate in prediction error. Theory suggests that a proper penalty is proportional to the noise level of the regression model, which is usually unknown and often treated as a nuisance parameter in theoretical studies. The scaled Lasso eliminates the dependence of the unknown noise level in its scale-free penalty via an alternating minimization scheme. It essentially reduces the tuning parameter to a constant factor within a narrow band. Stein's unbiased risk estimation or SURE is a common criterion to select an estimator with minimal prediction error among a collection of candidates. We propose a SURE-tuned scaled Lasso method to fine-tune the constant factor in the scale-free penalty by SURE criterion. We prove an oracle inequality for the proposed estimator, which provides a theoretical guarantee that up to a higher-order term, our method achieves the minimal prediction error within an interval of penalty levels. Simulation studies under broad settings demonstrate its good performance in supporting our theory.
45
Daniel PeñaA Testing Approach to Clustering Scalar Time Series
Clustering scalar time series can be carried out using their univariate properties and
hierarchical methods, especially when the dynamic structure of the series is of interest. Two major issues in clustering analysis are to detect the existence of multiple clusters and to determine their number, if exist.
In this paper we propose a new test statistic for detecting the existence of multiple clusters in a time-series data set and a new procedure to determine the number when clusters exist.
The proposed method is based on the jumps, i.e., the increments, in the heights of the dendrogram when a hierarchical clustering is applied to the data. We use parametric bootstraps to obtain a reference distribution of the test statistics and propose an iterative procedure to find the number of clusters. The clusters found are internally homogeneous
according to the test statistics used in the analysis. The performance of
the proposed procedure in finite samples is investigated by Monte Carlo simulations
and illustrated by some empirical examples. Comparisons with some existing methods for
selecting the number of clusters are also investigated.
46
Daniele DuranteConcentration of discrepancy-based ABC via Rademacher complexity
Classical implementations of approximate Bayesian computation (ABC) employ summary statistics to measure the discrepancy among the observed data and the synthetic samples generated from each proposed value of the parameter of interest. However, finding effective summaries is challenging for most of the complex models for which ABC is required. This issue has motivated a growing literature on summary-free versions of ABC that leverage the discrepancy among the empirical distributions of the observed and synthetic data, rather than focusing on selected summaries. The effectiveness of these solutions has led to an increasing interest in the properties of the corresponding ABC posteriors, with a focus on concentration in asymptotic regimes. Although recent contributions have made key advancements, current theory mostly relies on existence arguments which are not immediate to verify and often yield bounds that are not readily interpretable, thus limiting the methodological implications of theoretical results. In this talk, I will address such aspects by developing a novel unified and constructive framework, based on the concept of Rademacher complexity, to study concentration of ABC posteriors within the general class of integral probability semimetrics (IPS), that includes routinely-implemented discrepancies such as Wasserstein distance and MMD, and naturally extends classical summary-based ABC. For rejection ABC based on the IPS class, I will prove that the theoretical properties of the ABC posterior in terms of concentration directly relate to the asymptotic behavior of the Rademacher complexity of the class of functions associated to each discrepancy. This result yields a novel understanding of the practical performance of ABC with specific discrepancies, as shown also in empirical studies, and allows to develop new theory guiding ABC calibration.
47
Davide RissoStructure learning of graphical models for count data, with applications to single-cell RNA sequencing
The problem of estimating the structure of a graph from observed data is of growing interest in the context of high-throughput genomic data, and single-cell RNA sequencing in particular. These, however, are challenging applications, since the data consist of high-dimensional counts with high variance and over-abundance of zeros. Here, we present a general framework for learning the structure of a graph from single-cell RNA-seq data, based on the zero-inflated negative binomial distribution. We demonstrate with simulations that our approach is able to retrieve the structure of a graph in a variety of settings and we show the utility of the approach on real data.
48
Davy PaindaveineL_p inference for multivariate location based on data-based simplices
The fundamental problem of estimating the location of a $d$-variate probability measure under an $L_p$ loss function is considered. The naive estimator, that minimizes the usual empirical $L_p$ risk, has a known asymptotic behaviour but suffers from several deficiencies for $p\neq 2$, the most important one being the lack of equivariance under general affine transformations. We introduce a collection of $L_p$ location estimators that minimize the size of suitable $\ell$-dimensional data-based simplices. For $\ell=1$, these estimators reduce to the naive ones, whereas, for $\ell=d$, they are equivariant under affine transformations. The proposed class contains in particular the celebrated spatial median and Oja median. Under very mild assumptions, we derive an explicit Bahadur representation result for each estimator in the class and establish asymptotic normality. Under a centro-symmetry assumption, we also introduce companion tests for the problem of testing the null hypothesis that the location $\mu$ of the underlying probability measure coincides with a given location $\mu_0$. We compute asymptotic powers of these tests under contiguous local alternatives, which reveals that asymptotic relative efficiencies with respect to traditional parametric Gaussian procedures for hypothesis testing coincide with those obtained for point estimation. Monte Carlo exercises confirm our asymptotic results.
49
Dennis ShenSame Root Different Leaves: Time Series and Cross-Sectional Methods in Panel DataA central goal in social science is to evaluate the causal effect of a policy. One
dominant approach is through panel data analysis in which the behaviors of multiple
units are observed over time. The information across time and space motivates
two general approaches: (i) horizontal regression (i.e., unconfoundedness), which
exploits time series patterns, and (ii) vertical regression (e.g., synthetic controls),
which exploits cross-sectional patterns. Conventional wisdom states that the two
approaches are fundamentally different.We establish this position to be partly false
for estimation but generally true for inference. In particular, we prove that both
approaches yield identical point estimates under several standard settings. For the
same point estimate, however, each approach quantifies uncertainty with respect to
a distinct estimand. The confidence interval developed for one estimand may have
incorrect coverage for another. This emphasizes that the source of randomness that
researchers assume has direct implications for the accuracy of inference.
50
Dimitris PolitisScalable subsampling: computation, aggregation and inferenceSubsampling is a general statistical method developed in the 1990s aimed at estimating the
sampling distribution of a statistic ˆθ in order to conduct nonparametric inference such as the
construction of confidence intervals and hypothesis tests. Subsampling has seen a resurgence in
the Big Data era where the standard, full-resample size bootstrap can be infeasible to compute.
Nevertheless, even choosing a single random subsample of size b can be computationally challenging
with both b and the sample size n being very large. In the paper at hand, we show how
a set of appropriately chosen, non-random subsamples can be used to conduct effective—and
computationally feasible—distribution estimation via subsampling. Further, we show how the
same set of subsamples can be used to yield a procedure for subsampling aggregation—also
known as subagging—that is scalable with big data. Interestingly, the scalable subagging estimator
can be tuned to have the same (or better) rate of convergence as compared to ˆθ. The
paper is concluded by showing how to conduct inference, e.g., confidence intervals, based on the
scalable subagging estimator instead of the original ˆθ.
51
Dmitry ArkhangelskyRandomization-based Inference for Synthetic Control Estimators
We analyze the properties of synthetic control (SC) type estimators in settings with many treated units, assuming that the treatment assignment is randomized but the probabilities are unknown. Exploiting duality, we interpret the SC optimization problem as a balancing estimator for the unknown probabilities. We analyze the properties of this procedure, assuming that the randomization probabilities are based on past outcomes and unobserved heterogeneity. In the regime where the dependence on the unobserved heterogeneity is limited, we show that the estimator is asymptotically normal but biased. We then quantify this bias under various outcome models
52
Domenico MarinucciAsymptotics for Spherical Functional Autoregressions
We investigate a class of spherical functional autoregressive processes, and we discuss the estimation of the corresponding autoregressive kernels. In particular, we first establish a consistency result (in sup and mean-square norm), then a quantitative central limit theorem (in Wasserstein distance), and finally a weak convergence result, under more restrictive regularity conditions. Our results are validated by a small numerical investigation.
53
Dominik Rothenhausler
Calibrated inference: statistical inference that accounts for both sampling uncertainty and distributional uncertainty
During data analysis, analysts often have to make seemingly arbitrary decisions. For example during data pre-processing, there are a variety of options for dealing with outliers or inferring missing data. Similarly, many specifications and methods can be reasonable to address a certain domain question. This may be seen as a hindrance to reliable inference since conclusions can change depending on the analyst's choices.
In this talk, I argue that this situation is an opportunity to construct confidence intervals that account not only for sampling uncertainty but also some type of distributional uncertainty. Distributional uncertainty is closely related to other issues in data analysis, ranging from dependence between observations to selection bias and confounding.
54
Dungang LiuModel diagnostics of discrete data regression: a unifying framework using functional residualsModel diagnostics is an indispensable component of regression analysis, yet it is not
well addressed in standard textbooks on generalized linear models. The lack of exposition
is attributed to the fact that when outcome data are discrete, classical methods
(e.g., Pearson/deviance residual analysis and goodness-of-fit tests) have limited utility
in model diagnostics and treatment. This paper establishes a novel framework for
model diagnostics of discrete data regression. Unlike the literature defining a singlevalued
quantity as the residual, we propose to use a function as a vehicle to retain the
residual information. In the presence of discreteness, we show that such a functional
residual is appropriate for summarizing the residual randomness that cannot be captured
by the structural part of the model. We establish its theoretical properties, which
leads to the innovation of new diagnostic tools including the functional-residual-vscovariate
plot and Function-to-Function (Fn-Fn) plot. Our numerical studies demonstrate
that the use of these tools can reveal a variety of model misspecifications, such
as not properly including a higher-order term, an explanatory variable, an interaction
effect, a dispersion parameter, or a zero-inflation component. The functional residual
yields, as a byproduct, Liu-Zhang’s surrogate residual mainly developed for cumulative
link models for ordinal data (Liu and Zhang, 2018, JASA). As a general notion,
it considerably broadens the diagnostic scope as it applies to virtually all parametric
models for binary, ordinal and count data, all in a unified diagnostic scheme.
55
Eardi LilaInterpretable discriminant analysis for functional data supported on random non-linear domains
In this talk, we will present a novel framework for the classification of functional data supported on non-linear, and possibly random, manifold domains. The motivating application is the identification of subjects with Alzheimer's disease from their cortical surface geometry and associated cortical thickness map. The proposed model is based upon a reformulation of the classification problem into a regularized multivariate functional linear regression model. This allows us to adopt a direct approach to the estimation of the most discriminant direction while controlling for its complexity through appropriate differential regularization. We will also present theoretical results on the out-of-sample prediction error of the proposed model. The application of the proposed method to a pooled dataset from the Alzheimer's Disease Neuroimaging Initiative and the Parkinson's Progression Markers Initiative reveals discriminant directions that capture both cortical geometric and thickness predictive features of Alzheimer's Disease that are consistent with the existing neuroscience literature.
56
Efstathia BuraEnsemble Conditional Variance Estimation for Sufficient Dimension Reduction
Conditional Variance Estimation (CVE) and Ensemble Conditional Variance Estimation (ECVE) are novel sufficient dimension reduction (SDR) methods in regressions with continuous response and predictors. Conditional Variance Estimation applies to additive error regressions with continuous predictors and link function and ECVE to general non-additive error regression models. Both operate under the assumption that the predictors can be replaced by a lower dimensional projection without loss of information. They are semiparametric forward regression model-based exhaustive sufficient dimension reduction estimation methods that are shown to be consistent under mild assumptions.
CVE outperforms mean average variance estimation (MAVE) and ECVE outperforms central subspace mean average variance estimation (csMAVE), their main competitors, under several simulation settings and in benchmark data set analyses.
57
Eli Ben-MichaelPolicy learning with asymmetric utilities
Data-driven decision making plays an important role even in high stakes settings like medicine and public policy. Learning optimal policies from observed data requires a careful formulation of the utility function whose expected value is maximized across a population. Although researchers typically use utilities that depend on observed outcomes alone, in many settings the decision maker's utility function is more properly characterized by the joint set of potential outcomes under all actions. For example, the Hippocratic principle to ``do no harm'' implies that the cost of causing death to a patient who would otherwise survive without treatment is greater than the cost of forgoing life-saving treatment. We consider optimal policy learning with asymmetric utility functions of this form. We show that asymmetric utilities lead to an unidentifiable social welfare function, and so we first partially identify it. Drawing on statistical decision theory, we then derive minimax decision rules by minimizing the maximum regret relative to alternative policies. We show that one can learn minimax decision rules from observed data by solving intermediate classification problems. We also establish that the finite sample regret of this procedure is bounded by the mis-classification rate of these intermediate classifiers. We apply this conceptual framework and methodology to the decision about whether or not to use right heart catheterization for patients with possible pulmonary hypertension.
58
Elizabeth Willow EisenhowerA flexible movement model for partially migrating species
We propose a flexible model for a partially migrating species, which we demonstrate using yearly paths for golden eagles (Aquila chrysaetos). Our model relies on a smoothly time-varying potential surface defined by a number of attractors. We compare our proposed approach using varying coefficients to a latent-state model, which we define differently for migrating, dispersing, and local individuals. While latent-state models are more common in the existing animal movement literature, varying coefficient models have various benefits including the ability to fit a wide range of movement strategies without the need for major model adjustments. We compare simulations from the models for three individuals to illustrate the ability of our model to better describe movement behavior for specific movement strategies. We also demonstrate the flexibility of our model by fitting several individuals whose movement behavior is less stereotypical.
59
Elynn ChenReinforcement Learning with Heterogeneous Data: Estimation and Inference
Reinforcement Learning (RL) has the promise of providing data-driven support for decision-making in a wide range of problems in healthcare, education, business, and other domains. Classical RL methods focus on the mean of the total return and, thus, may provide misleading results in the setting of the heterogeneous population that commonly underlies large-scale datasets. We introduce the K-Heterogeneous Markov Decision Process (K-Hetero MDP) to address sequential decision problems with population heterogeneity. We propose the Auto-Clustered Policy Evaluation (ACPE) for estimating the value of a given policy, and the Auto-Clustered Policy Iteration (ACPI) for estimating the optimal policy in a given policy class. Our auto-clustered algorithms can automatically detect and identify homogeneous subpopulations while estimating the Q function and the optimal policy for each subpopulation. We establish convergence rates and construct confidence intervals for the estimators obtained by the ACPE and ACPI. We present simulations to support our theoretical findings, and we conduct an empirical study on the standard MIMIC-III dataset. The latter analysis shows evidence of value heterogeneity and confirms the advantages of our new method.
60
Emiko DupontSpatial confounding and spatial+
Spatial confounding is an issue that can arise when regression models for spatially varying data are used for effect estimation. Such models include spatial random effects to account for the spatial correlation structure in the data. But as spatial random effects are not independent of spatially dependent covariates, they can interfere with the covariate effect estimates and make them unreliable. Traditional methods for dealing with this problem restrict spatial effects to the orthogonal complement of the covariates, however, recent results show that this approach can be problematic. Spatial+ is a novel method for dealing with spatial confounding when the covariate of interest is spatially dependent but not fully determined by spatial location. Theoretical analysis of estimates as well as simulations show that bias, in this case, arises as a direct result of spatial smoothing and, moreover, that it can be avoided by a simple adjustment to the model matrix in the spatial regression model.
61
Emilie EliseussenRank-based Bayesian mixture modeling with covariate information
Preference data, often in the form of (partial) rankings or pair comparisons, are frequently encountered and used to estimate individual behaviors (of users, also called “assessors”) in several areas, such as marketing and politics. Interestingly, rank-based models have recently been proposed as a useful tool for data integration as well. Combining ranking or preference data with covariate information about the assessors can potentially lead to a better understanding of the assessor's behaviors and preferences. The Mallows model is one of the most popular rank-based models, as it adapts well to different types of ranking and preference data, and the previously proposed Bayesian Mallows Model (BMM) offers a flexible computational framework also allowing capturing the heterogeneity across assessors, via a finite mixture. However, BMM currently does not allow including covariate information on the assessors. We develop a Bayesian Mallows-based finite mixture model that performs clustering while also accounting for assessor-related covariates. The proposed method is rooted on Product Partition models (PPMx), and therefore named BMMx, as it a priori favors the aggregation of assessors into a cluster when their covariates are similar, where their similarity is measured according to an augmented model. We investigate the performance of BMMx, and compare it to alternative approaches, in both simulation studies and real-data examples.
62
Emily HectorDistributed Inference for Spatial Extremes Modeling in High Dimensions
Extreme environmental events frequently exhibit spatial and temporal dependence. These data are often modeled using max stable processes that are computationally prohibitive to fit for as few as a dozen observations. We propose a spatial partitioning approach based on local modeling of subsets of the spatial domain that delivers computationally and statistically efficient inference. The proposed distributed approach is extended to estimate spatially varying coefficient models to deliver computationally efficient modeling of spatial variation in marginal parameters. We illustrate the flexibility of our approach through simulations and the analysis of streamflow data from the U.S. Geological Survey.
63
Erwan ScornetWhat's a good imputation to predict with missing values?
Statistical literature suggests that constant imputation should not be used to handle missing data, as it distorts the true data distribution. However, when the aim is to predict in presence of missing data, methods consisting in first imputing as well as possible and second learning on the completed data to predict the outcome are commonly employed. Yet, this widespread practice has no theoretical grounding.
In this talk, we will see that for almost all imputation functions, an impute-then-regress
procedure with a powerful learner is Bayes optimal. This result holds for all missing-values mechanisms, in contrast with the classic statistical results that require missing-at-random settings to use imputation in probabilistic modeling.
Moreover, it implies that perfect conditional imputation is not needed for good prediction asymptotically, and therefore constant imputation might be sufficient in various settings. In fact, we show that on perfectly imputed data the best regression function will generally be discontinuous, which makes it hard to learn. Crafting instead the imputation to leave the regression function unchanged simply shifts the problem to learning discontinuous imputations.
64
Eunice Carrasquinha Network-based regularization: an application to melanoma cancer
Melanoma is the principal cause of death of all skin diseases, and its incidence is increasing faster than any other type of cancer. A successful treatment depends on early detection, as the metastatic form is resistant to therapies. Gene expression data are increasingly being used to establish a diagnosis and optimize treatment of oncological patients. In this work, we propose the analysis of gene expression data from metastatic melanoma as a tool to obtain potential genes that could be important targets for new therapies and treatment. However, the high-­‐dimensionality nature of the data brings many constraints, for which several approaches have been considered, with regularization techniques in the cutting-­‐edge research front. Additionally, the network structure of gene expression data has fostered the development of network-­‐based regularization techniques to convey data into a low-­‐dimensional and interpretable level.
In this work, classical elastic net and two recently proposed network-­‐based methods, HubCox and OrphanCox, are applied to high-­‐dimensional gene expression data, to model survival data.
The melanoma transcriptomic dataset obtained from The Cancer Genome Atlas (TCGA) is used, considering patients' RNA-­‐seq measurements as covariates.
The application of sparsity-­‐inducing techniques to the skcm dataset enabled the selection of relevant genes (CIITA, HLA-­‐DQB1 and HLA-­‐DQA1) over a range of parameters evaluated. Comparable results were obtained for the elastic net and the network-­‐based OrphanCox regarding model performance and genes selected.
65
Fabrizio RuggeriAn adversarial risk analysis framework for the software release problem
A major issue in software engineering is the decision of when to release a software product to the market. This problem is complex due to, among other things, the uncertainty surrounding the software quality and its faults, the various costs involved, and the presence of competitors. A general adversarial risk analysis framework is proposed to support a software developer in deciding when to release a product and showcased with an example.
66
Falco Bargagli Stoffi
Causal Rule Ensemble: An Ensemble Learning Approach for Interpretable Discovery of Heterogeneous Subgroups
In health and social sciences, it is critically important to identify subgroups of the study population where a treatment has notable heterogeneity in the causal effects. Recently, data-driven discovery of heterogeneous effects via decision tree methods has been proposed. Despite its high interpretability, single tree discovery of heterogeneous effects tends to overfit the training data, and to find an oversimplified representation of treatment heterogeneity. To accommodate these shortcomings, we propose a new Causal Rule Ensemble (CRE) method that discovers heterogeneous subgroups through an ensemble-of-trees approach. CRE can a) uncover complex heterogeneity patterns; b) control for finite sample familywise error rate in the subgroups discovery; and c) avoid overfitting. The discovered subgroups are defined in terms of interpretable decision rules and are referred to as causal rules. We employ a two-stage CATE estimator for the discovered causal rules and provide theoretical guarantees. We also propose a new sensitivity analysis to unmeasured confounding bias for the estimated CATEs. Via simulations, we show that the CRE method has competitive discovery and estimation performance when compared to state-of-the-art techniques.
67
Faming LiangNonlinear Sufficient Dimension Reduction with a Stochastic Neural Network
Sufficient dimension reduction is a powerful tool to extract core information hidden in the high-dimensional data and has potentially many important applications in machine learning tasks. However, the existing nonlinear sufficient dimension reduction methods often lack the scalability necessary for dealing with large-scale data. We propose a new type of stochastic neural network under a rigorous probabilistic framework and show that it can be used for sufficient dimension reduction for large-scale data. The proposed stochastic neural network is trained using an adaptive stochastic gradient Markov chain Monte Carlo algorithm, whose convergence is rigorously studied in the paper as well. Through extensive experiments on real-world classification and regression problems, we show that the proposed method compares favorably with the existing state-of-the-art sufficient dimension reduction methods and is computationally more efficient for large-scale data.
68
Federico CamerlenghiNormalized random measures with interacting atoms for Bayesian mixtures
The seminal work of Ferguson (1973), who introduced the Dirichlet process, has spurred the definition and investigation of more general classes of Bayesian nonparametric priors, with the aim at increasing flexibility while maintaining analytical tractability. Among the numerous generalizations, a very large class of random probability measures have been introduced by Regazzini et al. (2003): this is the class of normalized random measures with independent increments (NRMIs).
NRMIs are random probability measures with almost surely discrete realizations, defined
through the specifications of two ingredients: i) a sequence of unnormalized weights, which are the jumps of a Lévy process on the positive real line; ii) a sequence of i.i.d. random atoms from a common base measure.
The proposed construction is appealing from a mathematical standpoint, because analytical tractability is preserved, however NRMIs do not allow interaction among atoms, which are supposed to be independent and identically distributed. In some applied framework, the i.i.d. assumption is too restrictive; for instance, NRMIs lead to poor performance in model-based clustering, when they are used as mixing measures in mixture models. To overcome this limitation, we introduce a new class of normalized random measures with atoms'interaction. In our construction the atoms come from a finite point process, which is marked with i.i.d. positive weights. Thus, a new class of random probability measures is obtained by normalization. The desired interaction among atoms is then induced by a suitable choice of the law of the point process, which can create a repulsive or attractive behaviour.
By means of Palm calculus, we are able to characterize marginal, predictive and posterior distributions for the proposed model. We specialize all our results for several choices of the finite point process, i.e., in the Poisson, Determinantal, Gibbs and Shot-Noise Cox case.
Finally we discuss the use of the proposed process as a mixing measure in mixture models. Here we show that our theoretical findings allow to develop computational procedures suited to obtain well-separated clusters and reliable density estimates.
69
Francesca DominiciMethods for Causal Inference to Estimate an Exposure Response Function
In this talk, I will provide an overview of data science methods, including methods for Bayesian analysis, causal inference, and machine learning, to inform environmental policy. This is based on my work analyzing a data platform of unprecedented size and representativeness. The platform includes more than 500 million observations on the health experience of over 95% of the US population older than 65 years old linked to air pollution exposure and several confounders.
70
Gautam KamathThe Role of Bias in Private Estimation
Differential privacy provides a rigorous framework for privacy-preserving statistical estimation. However, the standard private estimators provide biased estimates of the underlying population parameters, due to operations involving data clipping. We explore the role of bias in private estimation, showing that in many natural cases, private estimation necessarily incurs bias. We also show tradeoffs between bias and mean squared error for private estimators, and give private algorithms for unbiased estimation in restricted settings.
71
Genevera AllenGRAPH QUILTING: GRAPHICAL MODEL SELECTION FROM PARTIALLY OBSERVED INTERACTIONS
Graphical model estimation and selection is a seemingly impossible task when several pairs of variables are never jointly observed. Recovering the edges of a graph in such settings requires one to infer conditional dependencies between variables for which the empirical marginal dependence, or covariance, does not exist. This unexplored statistical problem arises in neuroimaging, for example, where due to technology limitations, it is impossible to jointly observe the activities of all neurons simultaneously. We call this statistical challenge
the “Graph Quilting” problem. We study this problem for Gaussian Graphical models where we show that missing of parts of the covariance matrix translates into an unidentifiable precision matrix which specifies the graph. Nonetheless, we show that under mild conditions, it is possible to correctly identify edges connecting the observed pairs of nodes. Additionally, we show that we can recover a minimal superset of edges connecting variables that are never jointly observed. Thus, we show that one can infer conditional relationships even when marginal relationships are missing, a surprising result! To accomplish this, we propose an L1-regularized partially observed likelihood graph estimator and establish its high-dimensional rates of convergence for the Graph Quilting problem. We illustrate our approach using synthetic data, as well as for learning functional connectivity from in vivo calcium imaging data of ten thousand neurons in the mouse visual cortex.
72
George MichalaidisThe Bayesian Nested Lasso for Mixed Frequency Regression Models
In this talk, we discuss the problem of modeling and analysis of time series data that evolve at different frequencies (e.g., quarterly-monthly). We focus on forecasting a single variable measured at a low frequency based on a regression model that includes past lags of the response variable and other high and low frequency predictors and their lagged valued. We develop the Bayesian Nested Lasso (BNL) that leads to principled selection of the lag of the predictors, reduces the effective number of model parameters through sparsity induced by
the lasso component and finally incorporates desirable decay patterns over time lags in the magnitude of the corresponding regression coefficients. Further, it is easy to obtain samples from the posterior distribution due to the closed form expressions for the conditional distributions of the model parameters. Theoretical properties of the method are established and numerical results obtained from synthetic and macroeconomic data illustrate the good performance of the proposed Bayesian framework in parameter selection
and estimation, and in the key task of GDP forecasting.
73
Georgia PapadogeorgouAddressing selection bias in cluster randomized experiments
In pragmatic cluster randomized experiments, units are often recruited after the random cluster assignment. This can lead to post-randomization selection bias, inducing systematic differences in baseline characteristics of the recruited patients between intervention and control arms. In such situations there are two different causal estimands of average treatment effects, one on the overall population and one on the recruited population, which require different data and strategies to identify. We specify the conditions under which cluster randomization implies individual randomization. We show that under the assumption of ignorable recruitment, the average treatment effect on the recruited population can be consistently estimated from the recruited sample, via either regression adjustment or inverse probability weighting. While the average treatment effect on the overall population is generally not identifiable from the observed sample alone, a meaningful weighted estimand on the overall population can be consistently estimated via applying a simple weighting scheme to the recruited sample. This estimand corresponds to the subpopulation of units who would be recruited into the study regardless of the assignment.
74
Gérard Biau Scaling ResNets in the large-depth regimeDeep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However,
the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted
to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached
on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a
factor $\alpha_L$. We show in a probabilistic setting that with standard i.i.d. initializations, the only non-trivial dynamics
is for $\alpha_L = frac{1}{\sqrt{L}}$---other choices lead either to explosion or to identity mapping. This scaling factor
corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation
that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is
obtained with specific correlated initializations and $\alpha_L = \frac{1}{L}$. Our analysis suggests a strong interplay
between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments,
we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and
after training.
75
Germain Van BeverFlexible Hilbertian Additive Regression with Small Errors-in-Variables
In this talk, we present a new framework of additive regression modelling for data in very generic settings. More precisely, we tackle the problem of estimating component functions of additive models where the regressors and/or response variable belong to general Hilbert spaces and can be imperfectly observed. By this, we mean that some variables can be either measured incompletely or with errors. Smooth backfitting methods are used to estimate consistently the component functions and we provide explicit rates of convergence. We amply illustrate our methodology in various settings, including the functional, Riemannian and Hilbertian settings.
76
Hal SternStatistics and the Fair Administration of Justice: Assessing Bloodstain Pattern Evidence
Statistics has emerged as a critical topic in ongoing discussions regarding the use of science to assess forensic evidence. A 2009 U.S. National Academies report on forensic science and a subsequent 2016 report by the U.S. President’s Council of Advisers on Science and Technology raised questions about the scientific underpinnings for the analysis of a number of types of forensic evidence. Misapplication of forensic science has been identified as a contributing factor in nearly half of 362 cases in which DNA helped exonerate wrongly-convicted individuals. For these reasons there has been an increased focus on evaluating the ways in which evidence is analyzed, interpreted and reported with an eye towards providing more scientifically justified methods. This talk provides some background on forensic statistics and demonstrates approaches to inference for bloodstain pattern evidence. Contributions include a novel approach to representing the bloodstain patterns and the application of a Dirichlet Process Mixture Model for assessing the likelihood of observing a given pattern under different causal mechanisms.
77
Han XiaoMulti-linear Tensor Autogressive Models
Contemporary time series analysis has seen more and more tensor type data from many fields. In the first part of the talk, we propose a multi-linear autoregressive model for tensor-valued time series (TenAR). Comparing with the traditional VAR approach, the TenAR model preserves the tensor structure and has advantages in terms of interpretability, dimension reduction and computation. We propose to use the alternating algorithms to obtain the LSE and MLE. The performance of the models and methods is demonstrated by theoretical studies and simulated and real examples.
In the second part of the talk, we consider the TenAR model under co-integration. We investigate the MLE under a separable error covariance structure and provide asymptotic results for the co-integration vectors, as well as the projection onto the co-integration space.
78
Hans MuellerNetwork Dynamics via Fréchet Regression
Fréchet regression implements conditional Fréchet means when predictors are scalars or vectors and responses are random objects, i.e., lie in a separable metric space (Petersen & Müller 2019). We show that this approach can be applied to network data when each network in its entirety is viewed as a random object and it is of interest to quantify how networks change in dependence on covariates. To this end, we represent networks by their graph Laplacians and the challenge then is to characterize the space of graph Laplacians so as to justify the application of Fréchet regression. Ultimately this characterization leads to asymptotic rates of convergence by applying empirical process theory. This network regression approach is illustrated with resting-state fMRI data in neuroimaging and New York taxi records.
79
Helene Charlotte Wiese RytgaardContinuous-time targeted learning
Targeted learning (TMLE) provides a general framework for semiparametric efficient substitution estimation of causal parameters that combines machine learning with rigorous statistical inference. The TMLE estimation procedure updates initial estimators for nuisance parameters to provide nonparametric inference for the causal parameter based on the efficient influence function. In this talk I will introduce the continuous-time TMLE, a generalization of the targeted learning methodology for estimation of effects of time-dependent interventions in longitudinal data settings where both interventions, covariates and outcome can happen at subject-specific points in time.
80
Henry LuStatistical Learning for AI Assisted Clinics
This study discusses the co-developments of AI assisted clinics with Taipei Veterans General Hospital. The designs of computer assisted diagnosis systems with deep learning techniques by multi-modalities of medical images are discussed for specific clinical applications. The related issues are investigated for the integration of statistical models, computational algorithms and domain knowledge. The current developments are summarized and the future potential studies are discussed.
81
Hernando OmbaoExploring General Dependence in a Brain Network
Brain physiological and cognitive process over the entire network is complex. A full understanding of brain activity requires careful study of its multi-scale spatial-temporal organization - from small volume neurons to communities of regions of interest; and from transient events to long-term temporal dynamics. Motivated by these challenges, we will explore some characterizations of dependence between components of a multivariate time series and then apply these to the study of brain functional connectivity. There is no single measure of dependence that can capture all facets of brain connectivity. Here, we shall explore dependence between band-specific oscillations as well as scale-specific subprocesses of a locally stationary wavelet process. We developed a method that explores potential interactions between oscillations in multivariate time using these subprocesses. This is potentially interesting for brain scientists because functional brain networks are associated with cognitive function, and mental and neurological diseases. This method is used to study differences in functional brain connectivity between the healthy and the children diagnosed with attention deficit hyperactivity disorder. This is joint work with Paolo Redondo, Haibo Wu, Sultan Malik and Marco Pinto of the Biostatistics Group at KAUST.
82
Holger Rootzen Multivariate Peaks over Thresholds modelling: influenza prediction and anomaly detection
Extreme value statistics based on the Multivariate Generalized Pareto (MGP) distribution is in rapid development and already helps mitigate risks in global warming, civil engineering, and finance. In these models, a high threshold is chosen for each component in a random vector; the difference between the component values and the thresholds – the peaks – are computed; and the vector is considered extreme and included in the modelling with the MGP distribution if one or more of the peaks are positive. I will describe the use of these methods in two new areas: prediction of the risk that an ongoing influenza epidemic will be exceptionally severe, and real-time detection of anomalous and dangerous influenza epidemics and of anomalous and fraudulent credit card transactions.
83
Hongyu Zhao
Leveraging large biobank data across populations to improve disease risk predictions using genetic information
Polygenic risk score (PRS) has demonstrated its great utility in biomedical research through identifying high risk individuals for different diseases from their genotypes. However, the broader application of PRS to the general population is hindered by the limited transferability of PRS developed in Europeans to non-European populations. To improve PRS prediction accuracy in non-European populations, we have developed a statistical method called SDPRX that can effectively integrate genome wide association study (GWAS) summary statistics from different populations. SDPRX characterizes the joint distribution of the effect sizes of a variant in two populations to be both null, population specific or shared with correlation. It automatically adjusts for linkage disequilibrium differences between populations. Through simulations and applications to seven traits, we compared the prediction performance of SDPRX with three other methods PRS-CSx, LDpred2 and XPASS. LDpred2 is a single population method that takes non-EUR GWAS summary statistics as input, while SDPRX, PRS-CSx and XPASS are multi-discovery methods that jointly integrate GWAS summary statistics from multiple populations. We show that SDPRX outperforms other cross population prediction methods in the prediction accuracy in non-European populations, with an average 22% better than PRS-CSx, 33% better than LDpred2, 39% better than XPASS for quantitative traits and binary traits considered.
84
Iavor BojinovPopulation Interference In Panel Experiments
The phenomenon of population interference, where a treatment assigned to one experimental unit affects another experimental unit's outcome, has received considerable attention in standard randomized experiments. The complications produced by population interference in this setting are now readily recognized, and partial remedies are well known. Much less understood is the impact of population interference in panel experiments where treatment is sequentially randomized in the population, and the outcomes are observed at each time step. This paper proposes a general framework for studying population interference in panel experiments and presents new finite population estimation and inference results. Our findings suggest that, under mild assumptions, the addition of a temporal dimension to an experiment alleviates some of the challenges of population interference for certain estimands. In contrast, we show that the presence of carryover effects --- that is, when past treatments may affect future outcomes --- exacerbates the problem. Revisiting the special case of standard experiments with population interference, we prove a central limit theorem under weaker conditions than previous results in the literature and highlight the trade-off between flexibility in the design and the interference structure.
85
Imke MayerCausal survival analysis from theory to practice
Causal survival analysis consists in estimating the effect of a treatment on time-to-event outcome(s). We focus on estimating the restricted mean survival time (RMST), the average survival time from baseline to a pre-specified time, between treated and control groups on right-censored data from an observational study.
After stating the identifiability conditions, we review and compare different causal estimation methods, both parametric and non-parametric, which require modeling of the propensity score, the survival outcome and the censoring. We illustrate these methods on observational clinical data to answer a medical question about the effect of transfusion on one-year mortality for patients in intensive care. We discuss the interpretability of the findings of this study from a methodological point of view and explain the methodological challenges for causal survival analysis raised by this study (missing values, selection of the study population, selection of the adjustment variables) in the perspective of guiding future practitioners.
86
Ivette GomesReliable Ways to Measure Risks of Rare Events
Among the great variety of alternative methodologies available to deal with the management of risks of extreme events, and for stationary sequences from a model F, with a heavy right tail function, i.e. a positive extreme value index (EVI), the value at risk (VaR) and the conditional tail expectation (CTE) will be under discussion. For these Pareto-type models, the classical EVI-estimators are the Hill (H) estimators, and hence the possibility of considering associated H VaR and CTE-estimators. Since H can be replaced by any consistent EVI-estimator, improvements in the performance of the H CTE-estimators, through the use of reliable EVI-estimators based on different generalised means, are now suggested and studied, both asymptotically and for finite samples.
87
Jaakko PereOn extreme behavior of functional observations
In this work we consider extreme value theory in functional settings. In functional settings, one needs to specify in what sense a function is viewed as extreme. This can be done by considering some mapping that can be used to measure the typicality/extremity of the observation. For example, one can consider mapping each function to its maximum value. We then consider the univariate extreme value index estimators on the set of the mapped values. These values can be used, for example, in comparing the extreme behavior of two sets of functions.
88
Jacob BienSelective Inference for Hierarchical Clustering
Although statistics textbooks emphasize the importance of forming a hypothesis before looking at a data set, in practice it is quite common for data analysts to "double dip." That is, they first explore a data set to formulate some hypotheses and then they want to know whether what they have found is "real." For example, after running a clustering method on some data, a data analyst looking at two of the clusters might want to know whether their means are "truly" different from each other. Applying a standard two-sample test in such a setting will lead to a grossly inflated Type I error rate. We develop a selective inference approach to help answer this question while properly accounting for clustering having been performed on the data.
89
Jan HannigGeneralized Fiducial Inference on Differentiable Manifolds
We consider the problem of defining a general fiducial density on an implicitly-defined differentiable manifold. Our proposed density extends the Generalized Fiducial Distribution (GFD) of Hannig et al. (2016). The resulting GFD formula is obtained by projecting the Jacobian differential in the ambient space onto the tangent space of the manifold. To circumvent the need for an intractable marginal integral calculation, we use two Monte Carlo algorithms that can efficiently explore a constrained parameter space and adapt them for use with the Constrained GFD. To demonstrate the new GFD formula we consider a number of simple examples. We also apply this methodology to the density estimation problem using splines and an estimation of Gaussian precision matrix with some known zero entries. Finally, we discuss how the manifold point of view could contribute to the philosophical understanding of fiducial distribution.
90
Jana JureckovaUtilization of Choquet Capacities in Statistical Inference
In statistical decision we often work with systems of probability distributions, rather with individual ones. The problem is then e .g. testing hypotheses on the system, a comparison of several systems with each other, a numerical characterization of functionals of the whole system, among others. The suitable Choquet capacity, dominating the system, enables to characterize the system as a whole. It provides more flexibility in the decision, among others releasing the additivity by a sub-aditivity. We shall illustrate the applications of the capacities in testing, in the characterization of the risk measure, in description of the distance of systems, and elsewhere.
91
Jane-Ling WangDeep Learning for Partially Linear Cox Model
While deep learning approaches to survival data have demonstrated empirical success in applications, most of these methods are difficult to interpret and mathematical understanding of them is lacking. This paper studies the partially linear Cox model, where the nonlinear component of the model is implemented using a deep neural network. The proposed approach is flexible and able to circumvent the curse of dimensionality, yet it facilitates interpretability of the effects of treatment covariates on survival. We establish
asymptotic theories of maximum partial likelihood estimators and show that our nonparametric deep neural network estimator achieves the minimax optimal rate of convergence (up to a poly-logarithmic factor). Moreover, we prove that the corresponding finite-dimensional estimator for treatment covariate effects is √n-consistent, asymptotically normal, and attains semiparametric efficiency. Extensive simulation studies and analyses of two real survival datasets show the proposed estimator produces confidence intervals
with superior coverage as well as survival time predictions with superior concordance to actual survival times.
92
Jean FengEfficient nonparametric statistical inference on population feature importance
In predictive modeling applications, it is often of interest to determine the relative contribution of subsets of features in explaining the variability of an outcome. It is useful to consider this variable importance as a function of the unknown, underlying data-generating mechanism rather than the specific predictive algorithm used to fit the data. By connecting ideas in nonparametric variable importance to machine learning, we provide a method for efficient estimation of variable importance when building a predictive model using neural networks. In particular, we show how a single augmented neural network with multi-task learning can simultaneously estimate the importance of many feature subsets, improving on previous procedures for estimating importance. Furthermore, we extend these ideas to define Shapley Population Variable Importance Measure (SPVIM) and develop a computationally efficient procedure for its statistical inference. Although the computational complexity of the true SPVIM scales exponentially with the number of variables, we develop a statistically efficient estimator that randomly samples only O(n) feature subsets given n observations. Our procedure has good finite-sample performance in simulations and we illustrate its application to an in-hospital mortality prediction task.
93
Jelena BradicFair Policy Targeting
One of the major concerns of targeting interventions on individuals in social welfare programs is discrimination: individualized treatments may induce disparities across sensitive attributes such as age, gender, or race. This paper addresses the question of the design of fair and efficient treatment allocation rules. We adopt the non-maleficence perspective of first do no harm: we select the fairest allocation within the Pareto frontier. We cast the optimization into a mixed-integer linear program formulation, which can be solved using off-the-shelf algorithms. We derive regret bounds on the unfairness of the estimated policy function and small sample guarantees on the Pareto frontier under general notions of fairness. Finally, we illustrate our method using an application from education economics.
94
Jessica UttsData Science Ethics: Issues for Statisticians
As sources of data become more plentiful and massive data sets are easy to acquire, new ethical issues arise involving data quality and privacy, and the analysis, interpretation and dissemination of data-driven decisions. There are numerous anecdotes involving abuses of complex data analyses and algorithms, and the impact they have had on society. What can statisticians do to help enhance data science ethics in practice? This talk will address some of these issues, and provide examples and resources.
95
Jianqing FanStructural Deep Learning in Conditional Asset Pricing
We develop new financial economics theory guided structural nonparametric methods for estimating conditional asset pricing models using deep neural networks, by employing time-varying conditional information on alphas and betas carried by firm-specific characteristics. Contrary to many applications of neural networks in economics, we can open the “black box” of machine learning predictions by incorporating financial economics theory into the learning, and provide an economic interpretation of the successful predictions obtained from neural networks, by decomposing the neural predictors as risk-related and mispricing components. Our estimation method starts with period-by-period cross-sectional deep learning, followed by local PCAs to capture time-varying features such as latent factors of the model. We formally establish the asymptotic theory of the structural deep-learning estimators, which apply to both in-sample fit and out-of-sample predictions. We also illustrate the “double-descent-risk” phenomena associated with over-parametrized predictions, which justifies the use of over-fitting machine learning methods.
96
Jiayang SunInterpretable Learning for model, transformation, and variable selection
Feature selection is critical for developing drug targets or understanding reproductive success at high altitudes. However, selected features depend on the model assumption used for feature selection. Determining variable transformations to make the model more realistic or interpretable is not trivial in the case of many features or variables. This talk presents our advance toward a semi-parametric learning pipeline to study feature, transformation, and model selection in a “triathlon.” We introduce a concept of necessity-sufficiency guarantee, open up dialogues for paradigm changes, provide our learning procedure, and illustrate its performance and applications.
97
Jin ZhouLongitudinal and survival modeling of biobank-scale electronic health record (EHR) and Omics data
The availability of vast amounts of longitudinal data from electronic health records (EHR) and personal wearable devices opens the door to numerous new research questions. In many studies, individual variability of a longitudinal outcome is as important as the mean. Blood pressure fluctuations, glycemic variations, and mood swings are prime examples where it is critical to identify factors that affect the within-individual variability. We propose a scalable method, within-subject variance estimator by robust regression (WiSER), for the estimation and inference of the effects of both time-varying and time-invariant predictors on within-subject variance. It is robust against the misspecification of the conditional distribution of responses or the distribution of random effects. It shows similar performance as the correctly specified likelihood methods but is $10^3 \sim 10^5$ times faster. The estimation algorithm scales linearly in the total number of observations, making it applicable to massive longitudinal data sets. The effectiveness of WiSER is illustrated using the accelerometry data from the Women's Health Study and a clinical trial for longitudinal diabetes care. This is joint work with Chris German (UCLA), Janet Sinsheimer (UCLA), and Hua Zhou (UCLA),

98
Jinchi LvSIMPLE-RP: Group Network Inference with Non-Sharp Nulls and Weak Signals
Large-scale network inference with uncertainty quantification has important applications in natural, social, and medical sciences. The recent work of Fan, Fan, Han and Lv (2022) introduced a general framework of statistical inference on membership profiles in large networks (SIMPLE) for testing the sharp null hypothesis that a pair of given nodes share the same membership profiles. In real applications, there are often groups of nodes under investigation that may share similar membership profiles at the presence of relatively weaker signals than the setting considered in SIMPLE. To address these practical challenges, in this paper we suggest the method of SIMPLE with random pairing (SIMPLE-RP) for testing the non-sharp null hypothesis that a group of given nodes share similar (not necessarily identical) membership profiles under weaker signals. Utilizing the idea of uniform random pairing, we construct our test as the maximum of the SIMPLE tests for subsampled node pairs from the group. Such technique reduces significantly the correlation among individual SIMPLE tests while largely maintaining their power, enabling delicate analysis on the asymptotic distributions of the SIMPLE-RP test. Under the mixed membership models without degree heterogeneity, we establish a simple-to-use asymptotic null Gumbel distribution for the SIMPLE-RP test and a formal power analysis. We further extend the SIMPLE-RP method and theoretical properties to the degree-corrected mixed membership models for accommodating the practical issue of degree heterogeneity. These new theoretical developments are empowered by a general asymptotic theory of spiked eigenvectors for random matrices with weak spikes built in our work. Our theoretical results and the practical advantages of the newly suggested method are demonstrated through several simulation and real data examples. This is a joint work with Jianqing Fan, Yingying Fan and Fan Yang.
99
Jing WangSEMIPARAMETRIC ESTIMATION OF NON-IGNORABLE MISSINGNESS WITH REFRESHMENT SAMPLE
Missing data commonly arises in longitudinal data analysis and imposes methodological challenges in providing unbiased estimation and statistical inference due to informative missing. Therefore, it is crucial to correctly identify the missing mechanism and appropriately incorporate missing mechanism into the estimation and inference procedures. Traditional methods, such as the complete case analysis and imputation methods, are designed to deal with missing data under unverifiable assumptions of MCAR and MAR. In this paper, we focus on the identification and estimation of missing parameters under the non-ignorable missingness assumption using the refreshment samples from two-wave panel data. Specifically, we propose a full-likelihood approach when the joint distribution of the two-wave data belongs to a given family. If specification of the the joint distribution is unavailable, we propose a semi-parametric method to estimate the attrition parameters by marginal density estimates with the additional refreshment sample. We derive asymptotic properties of the semi-parametric estimators and illustrate their performance with simulations. Inference based on bootstrapping is proposed and verified through simulations. A real data application is provided based on the Netherlands Mobility Panel study.
100
Jiwei ZhaoStatistical Exploitation of Unlabeled Data in Semi-Supervised Learning under High Dimensionality
We consider the benefits of unlabeled data in the semi-supervised learning setting under high dimensionality, for parameter estimation and statistical inference. In particular, we address the following two important questions. First, can we use the labeled data as well as the unlabeled data to construct a semi-supervised estimator such that its convergence rate is faster than the supervised estimator? Second, can we construct confidence intervals or hypothesis tests that are guaranteed to be more efficient or powerful than the supervised estimator? We show that, the semi-supervised estimator with a faster convergence rate exists under some conditions, and the implementation of this optimal estimator needs a reasonably good estimation of the conditional mean function. For statistical inference, we mainly propose a safe approach that is guaranteed to be no worse than the supervised estimator in terms of statistical efficiency. Not surprisingly, if the conditional mean function is well estimated, our safe approach becomes semi-parametrically efficient. After the theory development, I will also present some simulation results as well as a real data analysis.