PGM Class Project Ideas

Distributed Large-Scale MMSB with Petuum

The MMSB (Mixed Membership Stochastic Blockmodel) is a probabilistic graphical model for networks, which takes a directed network as input and outputs a K-dimensional “topic vector” or “mixed-membership vector” for each node. These MM vectors are often interpreted as overlapping community or network role memberships, in the same way that topic vectors in topic models are interpreted as overlapping topics. In fact, MMSB is mathematically and algorithmically very similar to Latent Dirichlet Allocation (for training topic models).

The original MMSB variational inference algorithm [1] can only scale to around 10K nodes. Stochastic Variational Inference (SVI) for MMSB [2,3] provides an algorithmic solution to scale MMSB up to larger networks. The idea is to subsample latent variables in such a way that the variational objective remains (asymptotically) unbiased, yet the number of latent variables touched per iteration is O(M) instead of O(N^2), where M is the number of edges and N is the number of nodes. Thus, SVI MMSB has O(M) complexity instead of O(N^2), and the current single-machine implementation (https://github.com/premgopalan/svinet) scales to 4M nodes and 1000 roles. However, the algorithm requires one week to run at this scale, and needs a machine with 128GB RAM -- which not everyone has access to.

The SAILING Lab has developed a distributed ML framework called Petuum [4], which enables ML algorithms to be converted into distributed versions for use in a compute cluster. Your task is to reimplement the a-MMSB SVI algorithm of [2,3] on Petuum, so as to scale it up to 10M nodes and beyond on a cluster of cheap machines (e.g. 8 cores, 16GB RAM). In addition to the immediate value of having a large-scale MMSB implementation that runs on low-to-medium spec clusters, this project also provides a distributed starting-point implementation for more complex variations of MMSB -- which should in turn spur further development of MM models.

Please note that a special requirement for this project is that the distributed MMSB code must be open-sourced as a toolkit for Petuum.

Contact Person: Qirong Ho (qho@cs.cmu.edu)

[1] E. Airoldi, D. Blei, S. Fienberg, and E. P. Xing. Mixed Membership Stochastic Blockmodels. Journal of Machine Learning Research, 2008.

[2] P. Gopalan, D. Mimno, S. Gerrish, M. Freedman, and D. Blei. Scalable inference of overlapping communities. NIPS, 2012.

[3] P. Gopalan and D. Blei. Efficient discovery of overlapping communities in massive networks. Proceedings of the National Academy of Sciences, 2013.

[4] http://www.petuum.org

Large-Scale L1-Regularized and L2-Regularized Logistic Regression

L1-regularized and L2-regularized logistic regression (LR) are two of the most widely used machine learning models in industry due to their simplicity and performance. In fact, these are the workhorses of many ad serving systems in major companies such as Google [1][2], Facebook [3] and Yahoo![4]. While proprietary systems for logistic regression abound, to the best of our knowledge, there is no open source implementation for large-scale regularized logistic regression. With many modern large-scale ML frameworks such as GraphLab, Petuum, Piccolo, or Spark, it is relatively easy to implement such a model on one of these systems. In this project, your task will be to fill this need.

We recommend using Petuum [5], an open source large-scale ML framework currently being developed at CMU. This system offers table interface based on a distributed key-value store backend that is pretty easy to use. Furthermore, we (the Petuum developers) are happy to offer technical help. We can also provide assistance in finding a public cluster on which to carry out large-scale experiments for this project.

Regarding the algorithm, conventional (stochastic) gradient descent generally gives very good performance for L2-regularized LR. For L1-regularized LR, you can try implementing the subgradient method (which is essentially the same as gradient descent in terms of implementation), since it is reported to have good performance on large data sets [6]. You may also investigate other methods such as coordinate descent and generalized gradient descent, but they may be somewhat challenging to parallelize or implement.

Contact Person: Dai Wei (wdai@cs.cmu.edu)

[1] M. Richardson. Predicting clicks: Estimating the click-through rate for new ads. WWW, 2007.

[2] H. B. McMahan et al. Ad click prediction: a view from the trenches. KDD, 2013.

[3] http://bickson.blogspot.com/2011/05/large-scale-logistic-regression.html

[4] D. Chakrabarti, D. Agarwal, and V. Josifovski. Contextual advertising by combining relevance with click feedback. WWW, 2008.

[5] http://petuum.org/

[6] J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. KDD, 2009.

Large-Scale ADMM for Fancy Models (and Beyond)

The Alternating Direction Method of Multipliers (ADMM) is an optimization technique that can be used to parallelize large scale optimization problems. The tutorial in [1] provides a short introduction to the method. The basic idea is to break down the objective function by data instance or feature in order to form subproblems that can be solved independently, thus enabling the efficient optimization of problems with large sample sizes or feature dimensionality. The document in [2] provides the ADMM formulations for several models, including SVM, lasso, (non-overlapping) group lasso, sparse logistic regression, and generalized additive models.

This idea would make a great starting point for a number of projects. For example, you could attempt to derive ADMM for sparse additive models (SpAM), structured SVM, overlapping group lasso, etc. Another option would be to implement some existing formulation at large scale and do a data-mining type of project (see below). Students interested in theoretical work can also try to formulate ADMM by breaking the problem down along both feature dimensions and data instances in order to handle both big data and big model problems.

So far, there really hasn’t been much experimental work on applying ADMM to large scale problems, partly because system implementations are challenging. Recently, we have released a large scale ML framework called Petuum [3], which provides a simple table-based API and hides the challenges of distributed computation. The Petuum development team here can provide tech support for a project in this area if needed. We can also help you find a public cluster on which to carry out large scale experiments for your project. You may also consider using other large ML frameworks like GraphLab and Spark, but we can only provide very limited help for those systems.

Contact Person: Dai Wei (wdai@cs.cmu.edu)

[1] http://www.mit.edu/~medard/itmanet/pimeeting5/FLoWS_Talks/Boyd.pdf

[2] http://www.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf

[3] http://petuum.org/

Large-Scale CRF/GP/Structured SVM with Bundle Methods

Bundle methods [1][2] are a generic class of method for solving many convex optimization problems. The idea is to use tangent lines (first order Taylor expansion) to iteratively form a lower bound the objective function and minimize with respect to that lower bound. This method lends itself naturally to parallelization, since each worker can explore different paths of lower bound and communicate (asynchronously) their current minima with each other. As far as I know this method has not been applied at large scale and many problems can be, in theory, efficiently solved by bundle methods. Examples include conditional random fields (CRF), Gaussian process (GP), structured SVM, regularized logistic regression, lasso, regularized topic model, and many other. These are kind of low-hanging fruits so definitely worth giving a shot.

Regarding to system implementation, you can try out a large scale ML framework we recently released called Petuum [3]. It provides a simple table-based API and hide the challenges of distributed computation. The Petuum dev team here can provide tech support if needed. We can also help you find public cluster to carry out large scale experiments. You may also consider using other large ML frameworks like GraphLab and Spark, but we can only provide very limited help for those systems.

Contact Person: Dai Wei (wdai@cs.cmu.edu)

[1] A. J. Smola, S. V. N. Vishwanathan, and Q. V. Le. Bundle methods for machine learning. NIPS, 2008.

[2] C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 2010.

[3] http://petuum.org/

Large-Scale Support Vector Machines

SVM is one of the most popular ML methods invented in the last two decades. Perhaps surprisingly, despite the amount of research into algorithms and theories, there’s simply no open-source large scale SVM implementation. Both ADMM and bundle method can solve SVM (see above), but you can also consider stochastic gradient descent [1] and stochastic dual coordinate ascent (SDCA) [2][3]. Again, for system implementation you can consider using Petuum [4], GraphLab, or Spark.

Contact Person: Dai Wei (wdai@cs.cmu.edu)

[1] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML, 2007.

[2] S. Shalev-Shwartz, and T. Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. arXiv, 2012.

[3] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. ICML, 2008.

[4] http://petuum.org/

Influence in Social Media

In social media, the spread of themes and ideas along network links is called a cascade. These links can be explicit (as in the case of Facebook friends), or implicit (Twitter retweets/followers and up/down-voting of posts in forums). Non-statistical algorithms like NetInf [1] and ContTinEst [2] have been developed to study cascades, but there have been few graphical model approaches thus far.

Your task is to develop a graphical model to predict when a cascade will occur on this network, given knowledge of events such as "X people have retweeted Y", or "X people have up-voted post Y". As for the exact definition of a cascade, you may refer to the NetInf paper, or come up with your own. Also, you will need to decide how you will go about validating your model. We suggest either collecting real cascade data from your chosen website, or developing a simulator to generate artificial cascades.

Contact Person: Dai Wei (wdai@cs.cmu.edu)

[1] M. Gomez-Rodriguez, J. Leskovec, and A. Krause, Inferring networks of diffusion and influence. KDD, 2010.

[2] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. Scalable influence estimation in continuous-time diffusion networks. NIPS, 2013.

Embarrassingly Parallel Variational Inference

Approximate methods for Bayesian inference, such as MCMC and variational inference, are very popular. There is a great interest in scaling these methods to work on very large datasets; however, most MCMC methods applied to N data points must perform O(N) operations per sample drawn. A recently proposed method [1] involves performing MCMC in an (embarrassingly) parallel way on subsets of data, and then combining the resulting samples to form samples from the full-data posterior. This allows for a great speedup over traditional MCMC methods.

In this project, your task is to extend this idea to the domain of variational inference. The idea would be to partition a dataset into subsets, perform variational inference on each (i.e. learn a variational approximation to each “subposterior”), and then combine the learned models to learn parameters of a variational approximation to the full-data posterior. The combination might be similar to the density product proposed by [1], [2], and [3]. You should aim to both prove theoretically that this combination will yield a correct variational approximation to the full-data posterior, and provide similar empirical results on a large-scale dataset.

Contact Person: Willie Neiswanger (willie@cs.cmu.edu)

[1] W. Neiswanger, C. Wang, and E. P. Xing. Asymptotically exact, embarrassingly parallel MCMC. arXiv, 2013.

[2] X. Wang and D. B. Dunson. Parallel MCMC via Weierstrass sampler. arXiv, 2013.

[3] S. L. Scott et al. Bayes and big data: The consensus monte carlo algorithm.

Stochastic Optimization Methods for MCMC

There exist classic methods for MCMC that are based on optimization procedures---these include a method called Langevin dynamics or Langevin Monte Carlo (LMC), which is based on gradient descent. Recently, extensions have been proposed to LMC, which extend it to the domain of stochastic optimization. In particular, the extension is a stochastic gradient descent procedure which (asymptotically) yields true samples from the posterior distribution. This means that every step in the MCMC algorithm only touches a subset of the full data (as opposed to touching the full dataset at each step), and therefore converges quickly

In this project, your task is to try and extend these methods to perform valid MCMC based on other optimization algorithms, such as:

Generalized or projected gradient MCMC methods (for sampling on a constrained space or manifold)
ADMM-style methods for MCMC (which might allow for good parallelization of MCMC).
Dual-space optimization methods (which might allow for optimization-based MCMC over non-smooth posteriors).

In general, you would need to have a clear motivation for trying to perform MCMC with a given optimization method, and the project should consist of either theoretical justification that samples are approximately from a posterior distribution, or empirical evidence of this. This would probably entail modifying the optimization methods, and adding certain noise in a manner similar to the papers referenced below.

Contact Person: Willie Neiswanger (willie@cs.cmu.edu)

[1] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. ICML, 2011.

[2] S. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. ICML, 2012.

[3] S. Patterson and Y. W. Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. NIPS, 2013.

[4] M. A. Brubaker, M. Salzmann, and R. Urtasun. A family of MCMC methods on implicitly defined manifolds. AISTATS, 2012.

Communication-Free Parallel Supervised Topic Models

Approximate methods for Bayesian inference, such as MCMC and variational inference, are very popular. Often, people want to scale these methods to work on very large datasets. A recently proposed method involves performing MCMC in an (embarrassingly) parallel way on subsets of data, and then combining the resulting samples to form samples from the full-data posterior.

One issue with this line of work is the problem of quasi-ergodicity, which is the following problem: often, posteriors are very multimodal, and samplers only aim to estimate the posterior within one of the modes. This is particularly true for large, latent-variable models such as LDA, and is known as quasi-ergodicity. Quasi-ergodicity is a problem in parallel sampling because different machines maybe become stuck in different modes. This can arise very easily, as there is (for example) a mode for each permutation of topic labels.

In this project, your task is to bypass this problem through the use of a supervised topic model. In a supervised topic model, or sLDA, each document is associated with a label. In addition to sampling the usual parameters of a topic model, one also learns (samples) parameters of a regression vector for prediction of the label associated with a given document. This model can be applied to perform prediction for future documents. In order to get around the problem of quasi-ergodicity, we’d like to perform the following: split the data into subsets, and learn the sLDA model on each subset. Afterwards, when given a new document, perform prediction using each sub-model (in parallel), and then combine the predictions in some way. Your goal would be to evaluate, either theoretically or empirically, whether a procedure such as this could be made to yield results that give a good approximation to the original sLDA model.

Contact Person: Willie Neiswanger (willie@cs.cmu.edu)

[1] D. Blei and J. D. McAuliffe. Supervised topic models. NIPS, 2007.

[2] W. Neiswanger, C. Wang, and E. P. Xing. Asymptotically exact, embarrassingly parallel MCMC. arXiv, 2013.

Joint Learning of Expression Quantitative Trait Associations and Gene Network Structure

Two well-studied problems in computational biology are eQTL mapping, which aims to identify associations between genetic variants such as single nucleotide polymorphisms (SNPs) and gene expression signatures, and gene network inference, whose goal is to to estimate all pairwise associations (edges) among a set of genes (nodes) in a network using expression data. Many different methods have been applied to each of these problems. One technique for eQTL mapping that has recently gained popularity is regularized multi-task regression, in which the SNPs are treated as inputs and the genes are treated as outputs [1]. A popular approach for performing gene network inference is to construct a Markov network model and use the graphical lasso or another optimization algorithm to estimate the conditional independence relationships among the genes [2].

In this project, your goal is to develop a novel method for jointly performing eQTL mapping (discovering associations between SNPs and genes) and gene network inference (learning pairwise relationships among the genes) while sharing information between the two. Although there is some existing work on this problem, such as [3], you should use the tools from this class to come up with a new approach and evaluate your performance on both synthetic and real-world datasets.

Contact Person: Micol Marchetti-Bowick (micolmb@cs.cmu.edu)

[1] S. Kim and E. P. Xing. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics, 2009.

[2] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 2008.

[3] K.-A. Song and S. Kim. Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization. AISTATS, 2012.

Efficient Structure Learning of Mixed Graphical Models for SNP-Gene Network Estimation

In [1], Song and Kim introduce a method for simultaneously estimating SNP-gene and gene-gene associations by recasting a multi-task regularization problem -- in which we can imagine that the SNPs are the inputs and the genes are the outputs -- as a structure learning problem for an undirected Gaussian graphical model (GGM). Note that a GGM is just a Markov random field in which the joint distribution of the variables is multivariate Gaussian and the conditional distribution of each variable given its neighbors is univariate Gaussian. In this case, structure learning can be performed by maximizing the penalized log likelihood of the model in order to estimate a sparse inverse covariance matrix, which encodes conditional independencies. This works because the conditional independence structure among the variables is exactly equivalent to the edge structure in the graphical model, due to the local Markov property of MRFs. However, the drawback of this approach when applied to the biological setting in which SNPs are the inputs (X) and genes are the outputs (Y) is that SNP genotypes are actually discrete-valued and it is therefore incorrect to model them as Normally distributed variables.

In this project, your goal will be to extend the model proposed in [1] to the setting where the inputs are discrete and the outputs are continuous. In particular, we recommend performing structure learning using a “mixed” undirected graphical model such as the one proposed in [2] or another model that you design. Furthermore, since we are mainly interested the SNP-gene and gene-gene interactions, but not the SNP-SNP interactions, you only need to learn part of the structure. One interesting part of this project is to figure out whether you can cast this structure learning task as a convex optimization problem.

Contact Person: Micol Marchetti-Bowick (micolmb@cs.cmu.edu)

[1] K.-A. Song and S. Kim. Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization. AISTATS, 2012.

[2] J. Lee and T. Hastie. Structure learning of mixed graphical models. AISTATS, 2013.

Detection of Disease-Related Genetic Variants

In biology, it is well known that specific genetic variants in the human genome can lead to increased susceptibility to certain diseases. For example, there may be a particular variant that increases an individual’s likelihood of getting Alzheimer’s disease later in life. Unfortunately, only a small number of such variants have been identified and catalogued. Most of these have been discovered by performing genome-wide association studies (GWAS), in which a set of genomic loci (such as single nucleotide polymorphisms, or SNPs) from across the entire genome are tested for significant association with the disease.

In this project, your goal will be to develop a method for discovering new genetic variants which are associated with a disease of your choice. As a starting point, you are encouraged to implement a well-studied baseline algorithm such as logistic regression with L1 regularization [1]. After that, your task will be to improve the baseline method by designing a more sophisticated model which might incorporate biological structure as prior knowledge (e.g. information about known gene functions or pathways) or leverage data from multiple sources (e.g. SNPs + gene expression). You should compare the performance of your method against the baseline on both synthetic and real-world datasets.

Contact Person: Micol Marchetti-Bowick (micolmb@cs.cmu.edu)

[1] T. T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 2009.

Inferring Gene Networks with Prior Knowledge

The problem of estimating gene regulatory networks has attracted a large amount of attention in both biology and machine learning over the past several years. One common approach that has been used for inferring gene networks is to model the joint distribution of the genes as a zero-mean multivariate Gaussian and then learn a sparse estimate of the inverse covariance matrix (also called the precision matrix or concentration matrix) of the distribution, which encodes the conditional independence structure among the variables. This technique can also be viewed as structure learning for an undirected Gaussian graphical model, where each gene is a variable in the model and the non-zero entries of the estimated precision matrix represent edges in the graph [1]. In addition, many other models and techniques for gene network inference have been explored [2].

In this project, your goal will be to develop a method that estimates these networks more accurately by incorporating prior biological knowledge into an existing model, or possibly into a completely new model that you design yourself. There are many online databases, e.g. [3], that you can use as resources for collecting this prior knowledge, which can come in the form of known gene functions, signaling pathways, protein-protein interactions, transcription factor binding sites, or anything else that you find useful. You can be creative about how to incorporate this prior knowledge into your model, but one suggestion might be to include a structured sparsity penalty in the estimation of the precision matrix of a Gaussian graphical model. You should apply your method to both synthetic and real-world datasets, and evaluate its performance against the baseline model that doesn’t incorporate any prior knowledge.

Contact Person: Micol Marchetti-Bowick (micolmb@cs.cmu.edu)

[1] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 2008.

[2] A. Dobra, C. Hans, B. Jones, J. R. Nevins, G. Yao, and M. West. Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis, 2004.

[3] The Gene Ontology, http://www.geneontology.org/

A Topic Model of Genetic Mutations in Cancer

Topic models, first introduced in [1], are an extremely popular class of probabilistic graphical models that have been applied in a broad variety of settings since their inception. The goal of these models is to take a set of documents (which can be summarizes as word frequency counts) and use them to learn a compact set of topics that describe the content of the documents, where each topic parameterized by a distribution over words in a vocabulary. The model simultaneously infers a distribution over topics for each document (e.g. 60% computer science, 30% linguistics, 10% math) and the specific topic assignments of every word in the document.

The goal of this project will be to build a topic model for learning a set of genetic “motifs” that together describe the mutated genetic landscape of cancer. To do this, you should find a dataset that provides DNA sequence information for a series of tumors from multiple cancer types (breast cancer, pancreatic cancer, etc.) using an open-access database such as [2]. Preprocess this data by compiling a set of mutation frequency counts for the mutated genes in each tumor. Using this data, you can treat the genes as words and the tumor samples as documents, and apply a topic model to learn “topics” in this setting. If you prefer, you may also extend or alter the traditional topic model to better fit this application. Afterward, you should perform a biological analysis of the resulting “topics” or “motifs” to determine whether they capture interesting structure in the data. For example, it’s possible that some of these topics could be highly correlated with clinical phenotypes such as tumor size or cancer type.

Contact Person: Micol Marchetti-Bowick (micolmb@cs.cmu.edu)

[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003.

[3] The Cancer Genome Atlas, http://cancergenome.nih.gov/

Learning of Dependence Between Discrete Random Variables

Gaussian graphical models (GGMs) are commonly used to represent conditional

dependencies between observed random variables [1,2]. When data are not Gaussian, one can use a copula transformation to separate inference on marginal distributions from modeling the dependence structure [3]. The extended rank likelihood approach was developed in [4] to allow copula models to be applied to problems of learning dependencies between discrete random variables. [5] employes the approach to learn copula Gaussian graphical models from discrete data. [6] develops constrained Hamiltonian Markov chain Monte Carlo method for maximizing the rank likelihood.

Your task will be to further explore this line of work. There are a couple of possibilities. First, notice that the procedures of [4,6] are not inherently Bayesian, however, they use a sampling procedures (either Gibbs sampling or Monte Carlo method) to approximate a high-dimensional integral. You could develop an expectation maximization procedure for the problem that would scale to much higher-dimensional problems. Another possibility is to incorporate [6] into a framework for estimation of structure of a Gaussian graphical model. Finally, you could explore extended rank likelihood in the context of learning multi-attribute networks [7] from discrete data.

Contact Person: Mladen Kolar (mladen.kolar@chicagobooth.edu)

[1] N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection with the lasso. Ann. Stat., 34(3):1436–1462, 2006.

[2] M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.

[3] H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasserman. High Dimensional Semiparametric Gaussian Copula Graphical Models. Ann. Stat., 40(4):2293–2326, 2012.

[4] P.D. Hoff. Extending the rank likelihood for semiparametric copula estimation. Ann. Appl. Stat., 1(1):265–283, 2007.

[5] A. Dobra and A. Lenkoski. Copula Gaussian graphical models and their application to modeling functional disability data. Ann. Appl. Stat., 5(2A):605-1125, 2011.

[6] http://arxiv.org/abs/1306.2685

[7] http://arxiv.org/abs/1210.7665

Large Scale Structure Learning of Conditional Gaussian Graphical Models

Gaussian graphical models (GGMs) are commonly used to represent relationships between objects in a complex system. One of the fundamental questions is how to estimate the structure of the graphical model that represents these relationships? It is known that the inverse of the covariance matrix, known as the precision matrix, encodes the structure of the relationships through the pattern on non-zero elements, that is, two objects are connected if the corresponding element in the precision matrix is not equal to zero. Therefore, given observations on the objects in the complex system, one estimates a sparse precision matrix in order to uncover relationships between objects. This can be done using a neighborhood selection procedure [1] or graphical lasso [2]. Unfortunately, GGM are not flexible enough to capture relationships that change over time and depend on environmental conditions. [3] develops a semiparametric class of models that are capable of capturing relationships between objects that change over time.

The approach taken in [3] can be roughly described as a sliding window approach in which data points that are close in time are used together to estimate the structure locally. One potential drawback of this approach is that the estimated structure at various time points can be different. In some applications, either due to limited sample size or due to system constraints, one would like the graph structure, or equivalently the pattern of zeros in the precision matrix, to be fixed, while still allowing the values of the non-zero elements to change.

In this project, you will develop a novel method for learning the fixed structure of a graphical model, but allowing the parameters of the model to change. This could be done by modifying the graphical lasso procedure of [2], by allowing the element of the precision matrix to be smooth functions of time. For example, one could assume that each element of the precision matrix belongs to a reproducing kernel Hilbert space (RKHS) and use an appropriate RKHS norm as a the penalty that simultaneously encourages sparsity and smoothness. An alternative approach to the problem would be to extend the approach of [1]. Again, one would develop an appropriate penalty that would fix the graph structure, but allow for change in the parameters. In order to effectively solve the resulting optimization procedures, you would develop a special ADMM solver that could scale to big problem instances [4,5].

If executed well, this project has a high chance of getting accepted to a number of top machine learning and statistical journals, including JMLR and Biometrika.

Contact Person: Mladen Kolar (mladen.kolar@chicagobooth.edu)

[1] N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection with the lasso. Ann. Stat., 34(3):1436–1462, 2006.

[2] M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.

[3] M. Kolar, L. Song, A. Ahmed, and E. P. Xing. Estimating Time-varying networks. Ann. Appl. Stat., 4(1):94–123, 2010.

[4] http://www.mit.edu/~medard/itmanet/pimeeting5/FLoWS_Talks/Boyd.pdf

[5] http://www.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf