STAD68 Syllabus - Fall 2016

STAD68: Advanced Machine Learning and Data Mining

Prof Daniel M. Roy

daniel.roy@utoronto.ca (please include “STAD68” in your email’s subject line or body)

Office hours: Wednesdays 10-11am and 3pm-4pm, or by appointment.

Yang Guan Jian (Tommy) Guo

All questions about homework grading should be directed to the teaching assistant.

Wednesdays 12pm--3pm in IC 328.

There are twelve lectures: The first lecture is Wednesday September 7. The last lecture is November 30. There is no lecture on October 12 due to reading week. The lecture on Wednesday October 5 will be moved to a date chosen on the first day of class.

Statistical aspects of supervised learning: regression, regularization methods, parametric and nonparametric classification methods, including Gaussian processes for regression and support vector machines for classification, model averaging, model selection, and mixture models for unsupervised learning. Some advanced methods will include Bayesian networks and graphical models.

CSCC11 (Intro to Machine Learning), STAC58 (Statistical Inference) and STAC67 (Regression Analysis).

Each student’s grade in the course will be based on:

- homework assignments (30%);
- in-class quizzes and participation (10%);
- a mid-term exam (30%), in class October 19th;
- a final project (30%), due November 30th 11:59pm.

- proposal (worth 10% of the final project mark), due November 2nd 11:59pm.
- draft report (worth 10% of the final project mark), due November 16th 11:59pm.

Homework and project submissions should be made electronically via Blackboard.

This course will be an undergraduate version of a machine learning course developed by Peter Orbanz (Columbia University). The tentative list of topics is below:

- Introduction

Review of basic concepts: Maximum likelihood, Gaussian distributions, etc. - Classification basics: Loss functions, naive Bayes, linear classifiers
- Support vector machines, convex optimization
- Kernels; model selection and cross validation
- Ensemble methods: Boosting, bagging, random forests
- Regression: Linear regression, regularization, ridge regression
- Linear algebra review, high-dimensional and sparse regression
- Dimension reduction, data visualization, principal component analysis
- Clustering, mixture models and EM algorithms
- Information theory; Text analysis
- Markov models, PageRank
- Hidden Markov models, speech recognition
- Bayesian models
- Sampling algorithms and MCMC

Homework assignments will be an important component of the learning experience.

Students will be responsible for a final project, wherein they apply some machine learning algorithm to data and analyze the results. The final project is organized into several stages in order to help you plan and succeed.

Proposal: Students will select a topic from a list or propose their own. The proposal should include a link to the data to be used for training/testing, a description of the machine learning task to be carried out, a description of the techniques that will be applied/compared, a description of the performance metrics that will be used to evaluate the techniques, and an outline of the experiments that will be run. The instructor will provide feedback on proposals.

Draft: Students will perform some preliminary experiments. The purpose of this step is to get every technique in the proposal up and running and to make a simple comparison between the techniques on synthetic data. Students should produce a plot/figure/table that compares the performance achieved according to the metrics proposed. Based on the preliminary findings, the student should then revise their experimental plans, if necessary.

Final report: Students will write a 3-5 page report (1.25" margins, 11pt font in Times New Roman or Computer Modern) describing the problem they are tackling, describing the data they have available to them to solve the task, describing the approach they will take, and documenting experiments that evaluate the merit of the approaches they took.

Students with diverse learning styles and needs are welcome in this course. Please feel free to approach me or Accessibility Services so we can assist you in achieving academic success in this course. If you have not registered with the Accessibility Services and have a disability, please visit the Accessibility Services website at http://www.accessibility.utoronto.ca for information on how to register.

Assignments are to be done by each student individually. You may discuss assignments in general terms with other students, but the work you hand in should be your own. In particular, you should not leave any discussion with someone else with any written notes (either paper or electronic). You may not use any resources/aids other than the book and Wikipedia unless the instructions explicitly say so. If you are not certain whether a resource is allowed, email the instructor.

Assignments are due by 11:59pm (Eastern time) on the date marked. Late assignments will be penalized 10% of the available marks per 24 hours up to a maximum of 72 hours. Beyond this, no extensions will be granted on homework assignments, except in the case of an official Student Medical Certificate or a written (not emailed) request submitted at least one week before the due date and approved by the instructor. Please plan ahead.

The late policy on the final project is

There is no required textbook.

There are many other excellent resources for learning about machine learning. The following list contains just a few:

- The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman.

Available free at http://statweb.stanford.edu/~tibs/ElemStatLearn/.

This book covers most of the first half of the course. The books below are useful references throughout, but especially on topics in the second half of the course.

- Information Theory, Inference, and Learning Algorithms

David J. MacKay

Available free at http://www.inference.phy.cam.ac.uk/mackay/itila/. - Pattern Recognition and Machine Learning

Chris Bishop.

See http://research.microsoft.com/en-us/um/people/cmbishop/prml/. - Machine Learning: A Probabilistic Perspective.

Kevin P. Murphy

MIT Press, 2012. - Bayesian Reasoning and Machine Learning.

David Barber.

Cambridge University Press, 2012. - Pattern Classification.

Richard O. Duda, Peter E. Hart, David G. Stork.

Wiley, 2001. - Convex Optimization.

Stephen Boyd and Lieven Vandenberghe.

Cambridge University Press, 2004.

Available free at http://stanford.edu/~boyd/cvxbook/.

Programming assignments can be completed in any language you like, although the assignment handouts and supporting files will generally only be provided in one language (usually R, MATLAB, or Python). To be an effective machine learner/data scientist, you must know how to program and manipulate data! Being able to work efficiently with all of the key machine learning languages is necessary in industry and in applied graduate work.

I highly recommend downloading the Anaconda distribution of Python 2.7 and using Jupyter Notebooks to use Python interactively. Anaconda comes with all the standard packages used in machine learning (numpy, scipy, matplotlib, etc.).

- Download Anaconda here. (Get the Python 2.7 version!)

https://www.continuum.io/downloads - Download Jupyter Notebook here. (You'll need Python first, though!)

http://jupyter.readthedocs.io/en/latest/install.html

The R language (http://www.R-project.org) is free software and widely used in statistics, though much less so in machine learning. All public lab machines have R installed. (See

http://cran.r-project.org/doc/manuals/R-intro.pdf

for an introduction. There are many other resources online.) If you have access to a machine where you can install software, you might consider R Studio (http://www.rstudio.com/products/rstudio/), which is an integrated development environment (IDE) that provides many useful features.

The MATLAB language is proprietary, but Murphy’s textbook has many examples in MATLAB in his book and online at

https://github.com/probml/pmtk3 and

https://code.google.com/p/pmtksupport/

MATLAB 2013 is available in the public Window labs, although you will need a version that has the Statistics toolkit installed to run the examples from Murphy’s book.

Other core machine learning languages include C/C++, Scala, and Julia.

I will use a mixture of Blackboard and the course website

http://danroy.org/teaching/2016/STAD68

to post material.