STAD68: Advanced Machine Learning and Data Mining

Instructor

Prof Daniel M. Roy

daniel.roy@utoronto.ca (please include “STAD68” in your email’s subject line or body)

Office hours: Mondays 3--4pm in IC 462, or by appointment. Changes will be announced.

Lectures

Mondays 7--10pm in IC 320.

There are twelve lectures: The first lecture is January 5. The last lecture is March 30. There is no lecture on February 16.

Overview

Statistical aspects of supervised learning: regression, regularization methods, parametric and nonparametric classification methods, including Gaussian processes for regression and support vector machines for classification, model averaging, model selection, and mixture models for unsupervised learning. Some advanced methods will include Bayesian networks and graphical models.

Prerequisites

STAC58 and STAC67.

Grading

Each student’s grade in the course will be based on:

Structure of the Course

The following is a tentative outline of the material we will cover:

Accessibility

Students with diverse learning styles and needs are welcome in this course. Please feel free to approach me or Accessibility Services so we can assist you in achieving academic success in this course. If you have not registered with the Accessibility Services and have a disability, please visit the Accessibility Services website at http://www.accessibility.utoronto.ca for information on how to register.

Advanced notice, especially for exams, is always welcome because this allows me to prepare better. Feel free to approach me in person either before or after class, or at my office. In the latter case, please send me an email to make sure I’ll be available.

Policy on collaboration

This assignment is to be done by each student individually. You may discuss it in general terms with other students, but the work you hand in should be your own. In particular, you should not leave any discussion with someone else with any written notes (either paper or electronic). You may not use any resources/aids other than the book and Wikipedia. If you are not certain whether a resource is allowed, email the instructor.

Class participation

In order to obtain full marks for class participation, students should remain attentive, ask for clarification when necessary, and offer answers to questions posed by the instructor during class. Students can also participate by sharing their (clearly handwritten or typeset) notes with their classmates (via  Blackboard). When programming assignments come with supporting code in one language, say R, class participation credit will be available to anyone who translates the supporting code (but note the answers, of course!), into another language, such as MATLAB or Python.

Policy on Late Work

Assignments are due by 6:50pm on the date marked. Late assignments will be penalized 10% of the available marks per 24 hours up to a maximum of 72 hours. Beyond this, no extensions will be granted on homework assignments, except in the case of an official Student Medical Certificate or a written (not emailed) request submitted at least one week before the due date and approved by the instructor. Please plan ahead.

Marking

As a general rule, small matters of marking on assignments (apparent errors, questions about evaluation criteria, etc.) should be taken first to the marker (via email). More significant issues, or unresolved matters on assignments, are appropriate to take to the professor. Matters of marking on exams should be taken to the professor.

Required Textbook and other Resources

The required textbook is the 4th printing of

   Kevin P. Murphy (2012), Machine Learning: A Probabilistic Perspective, MIT Press.

I have requested that a copy be put on 3-hour reserve in the library and that the bookstore order copies for purchase.

The printing # can be determined from the copyright page: Look for a sequence 10 9 8 7 6 … k. The number k is the printing. You might be able to find the book cheaper by googling around, but be mindful of the printing #. If you get an earlier printing, you’ll have to make (extensive) use of the online errata at:

   http://www.cs.ubc.ca/~murphyk/MLbook/errata.html.

Last I checked, there were 250 typos and ~10 significant errors in the first printing. The book should be available from the campus bookstore by the second week.

Other Resources

There are many other excellent resources for learning about machine learning. The following list contains just a few:

Computing

Programming assignments can be completed in any language you like, although the assignment handouts and supporting files will generally only be provided in one language (usually R, MATLAB, or Python). To be an effective machine learner/data scientist, you must know how to program and manipulate data! Being able to work efficiently with all of the key machine learning languages is necessary in industry and in applied graduate work.

Language Resources

The R language (http://www.R-project.org) is free software. All public lab machines have R installed. (See

     http://cran.r-project.org/doc/manuals/R-intro.pdf

for an introduction. There are many other resources online.)  If you have access to a machine where you can install software, you might consider R Studio (http://www.rstudio.com/products/rstudio/), which is an integrated development environment (IDE) that provides many useful features.

The MATLAB language is proprietary, but Murphy’s textbook has many examples in MATLAB in his book and online at

      https://github.com/probml/pmtk3   and

      https://code.google.com/p/pmtksupport/ 

MATLAB 2013 is available in the public Window labs, although you will need a version that has the Statistics toolkit installed to run the examples from Murphy’s book.

Other core machine learning languages include Python and C/C++. Some up and coming languages that have gained a lot of interest within machine learning include Scala and Julia.

Blackboard and Course Webpage

I will use a mixture of Blackboard and the course website

     http://danroy.org/teaching/2015/STAD68H3/ 

to post material.