STAD68: Advanced Machine Learning and Data Mining


Prof Daniel M. Roy (please include “STAD68” in your email’s subject line or body)

Office hours: Wednesdays 10-11am and 3pm-4pm, or by appointment.

Teaching Assistant

Yang Guan Jian (Tommy) Guo 

All questions about homework grading should be directed to the teaching assistant.


Wednesdays 12pm--3pm in IC 328.

There are twelve lectures: The first lecture is Wednesday September 7. The last lecture is November 30. There is no lecture on October 12 due to reading week.  The lecture on Wednesday October 5 will be moved to a date chosen on the first day of class.


Statistical aspects of supervised learning: regression, regularization methods, parametric and nonparametric classification methods, including Gaussian processes for regression and support vector machines for classification, model averaging, model selection, and mixture models for unsupervised learning. Some advanced methods will include Bayesian networks and graphical models.


CSCC11 (Intro to Machine Learning), STAC58 (Statistical Inference) and STAC67 (Regression Analysis).


Each student’s grade in the course will be based on:

Homework and project submissions should be made electronically via Blackboard.

Structure of the Course

This course will be an undergraduate version of a machine learning course developed by Peter Orbanz (Columbia University). The tentative list of topics is below:

Homework assignments will be an important component of the learning experience.

Final Project

Students will be responsible for a final project, wherein they apply some machine learning algorithm to data and analyze the results. The final project is organized into several stages in order to help you plan and succeed.

Proposal: Students will select a topic from a list or propose their own.  The proposal should include a link to the data to be used for training/testing, a description of the machine learning task to be carried out, a description of the techniques that will be applied/compared, a description of the performance metrics that will be used to evaluate the techniques, and an outline of the experiments that will be run.  The instructor will provide feedback on proposals.

Draft: Students will perform some preliminary experiments.  The purpose of this step is to get every technique in the proposal up and running and to make a simple comparison between the techniques on synthetic data. Students should produce a plot/figure/table that compares the performance achieved according to the metrics proposed. Based on the preliminary findings, the student should then revise their experimental plans, if necessary.

Final report: Students will write a 3-5 page report (1.25" margins, 11pt font in Times New Roman or Computer Modern) describing the problem they are tackling, describing the data they have available to them to solve the task, describing the approach they will take, and documenting experiments that evaluate the merit of the approaches they took.


Students with diverse learning styles and needs are welcome in this course. Please feel free to approach me or Accessibility Services so we can assist you in achieving academic success in this course. If you have not registered with the Accessibility Services and have a disability, please visit the Accessibility Services website at for information on how to register.

Advanced notice, especially for exams, is always welcome because this allows me to prepare better. Feel free to approach me in person either before or after class, or at my office. In the latter case, please send me an email to make sure I’ll be available.

Policy on collaboration

Assignments are to be done by each student individually. You may discuss assignments in general terms with other students, but the work you hand in should be your own. In particular, you should not leave any discussion with someone else with any written notes (either paper or electronic). You may not use any resources/aids other than the book and Wikipedia unless the instructions explicitly say so. If you are not certain whether a resource is allowed, email the instructor.

Class participation

In order to obtain full marks for class participation, students should remain attentive, ask for clarification when necessary, and offer answers to questions posed by the instructor during class. Students can also boost their participation mark by sharing their (clearly handwritten or typeset) notes with their classmates (send them to the instructor and they will be made available via Blackboard).

When programming assignments come with supporting code in one language, say R, class participation credit will be available to anyone who translates the supporting code (but note the answers, of course!), into another language, such as MATLAB or Python.

Policy on Late Work

Assignments are due by 11:59pm (Eastern time) on the date marked. Late assignments will be penalized 10% of the available marks per 24 hours up to a maximum of 72 hours. Beyond this, no extensions will be granted on homework assignments, except in the case of an official Student Medical Certificate or a written (not emailed) request submitted at least one week before the due date and approved by the instructor. Please plan ahead.

The late policy on the final project is


As a general rule, matters of marking on assignments (apparent errors, questions about evaluation criteria, etc.) should be taken first to the TA (via email). More significant issues, or unresolved matters on assignments, are appropriate to take to the instructor. Matters of marking on exams should be taken to the instructor.

Required Textbook and other Resources

There is no required textbook.

Other Resources

There are many other excellent resources for learning about machine learning. The following list contains just a few:

This book covers most of the first half of the course.  The books below are useful references throughout, but especially on topics in the second half of the course.


Programming assignments can be completed in any language you like, although the assignment handouts and supporting files will generally only be provided in one language (usually R, MATLAB, or Python). To be an effective machine learner/data scientist, you must know how to program and manipulate data! Being able to work efficiently with all of the key machine learning languages is necessary in industry and in applied graduate work.

Programming Language Resources

I highly recommend downloading the Anaconda distribution of Python 2.7 and using Jupyter Notebooks to use Python interactively.  Anaconda comes with all the standard packages used in machine learning (numpy, scipy, matplotlib, etc.).

The R language ( is free software and widely used in statistics, though much less so in machine learning. All public lab machines have R installed. (See

for an introduction. There are many other resources online.)  If you have access to a machine where you can install software, you might consider R Studio (, which is an integrated development environment (IDE) that provides many useful features.

The MATLAB language is proprietary, but Murphy’s textbook has many examples in MATLAB in his book and online at   and 

MATLAB 2013 is available in the public Window labs, although you will need a version that has the Statistics toolkit installed to run the examples from Murphy’s book.

Other core machine learning languages include C/C++, Scala, and Julia.

Blackboard and Course Webpage

I will use a mixture of Blackboard and the course website

to post material.