Published using Google Docs
CSE594: Probability and Statistics for Data Science: Syllabus (public-facing)
Updated automatically every 5 minutes

Probability and Statistics for Data Science

Stony Brook University
CSE594 - Spring 2016

Tuesdays and Thursdays, 4:00 - 5:20
201 Frey Hall

Instructor:         H. Andrew Schwartz

      office hours:Mon. 3-4p, Thurs: 5:30 - 6:30p

      office:        NCS 255

      email:        has@cs.stonybrook.edu

Teaching Assistant: Amoghavarsha Suresh

      office hours: Thurs 10 - 11am        

      office:        OldCS 2203

      email:         amsuresh@cs.stonybrook.edu

Description:  This course will cover core concepts of probability theory and an assortment of standard statistical techniques. It is intended to give computer scientists a broad range of theory, principles, and techniques to answer questions from data. For example:

Students are expected to have an elementary knowledge of statistics, linear algebra, and probability theory, as well as be proficient in the python programming language.

Recommended Textbooks:

These books are excellent references. However, we will not follow any book exactly, as we hope to cover a little more breadth of techniques (perhaps at the expense of depth) as well as some techniques often used in interdisciplinary data science which are not covered in these books.

Grading Proportions:

40%        3 exams - problem solving, short answer, essay, pseudocoding

30% 3 individual assignments - problem solving, short answer, coding

20% 1 team project - coding (prediction, clustering, and multi-hypothesis testing)

10% in the news - presentation and discussion participation
(subject to change with advance notice)

Current Scale:

D: [45,55), C-: [55,58), C: [58,64), C+: [64,67), B-: [67,70), B: [70,75), B+: [75,78), A-: [78,81), A: [81,100)

        

In the news.  Each student will be responsible for a 5 minute presentation discussing the probability and statistics used in a research study or data science related technology making it in the news from the last year. The presentation should describe the techniques employed and whether the conclusions drawn in the news article are accurate. Students will sign-up for a presentation slot during the first 2 weeks. The level of detail / sophistication should be commensurate with the material covered at that point in the course (i.e. later presentations will be expected to demonstrate a more thorough understanding and critical view).

Required Programming Language: Python 2.7. Unles otherwise specified all programming assignments must be completed using Python. Frequently used packages: pandas, numpy, scipy, sklearn, scikit-stats.

Late Assignment Policy. Assignments will be accepted up to 48 hours late. A 10% penalty will be assessed if it is less than 24 hours late, while a 25% penalty will be assessed if it is between 24 and 48 hours late. Any assignments submitted after 48 hours from the deadline will earn a 0.

Academic Honesty. Students are welcome and encouraged to converse about assignment problems and concepts. However, sharing answers, via any form of communication, or copying portions of answers from websites or other media is strictly prohibited. You are responsible for both not looking at another’s answers or code as well as making sure your own answers and code are not accessible by other students. At a minimum, all students involved in any cheating or scholarly misconduct will receive a 0 for the assignment and be reported to the graduate program director which will likely come with further consequences.

 

Schedule and Topics

week beginning

topics

deadlines

I. Foundations -- Probability Theory

1/25

probability review: independence, conditional probability, random variables, example application of probability, Markov chain

2/1

python-pandas, Bayes rule, chain/product rule, conditional independence, histograms, probability density functions

Assignment 1 Given (Tue.)

2/8

probability mass/density/cumulative functions, expectation, maximum likelihood estimation (MLE), kernel density estimation, matrix linear algebra

in the News starts

2/15

data science applications of probability & statistics, review

Assignment 1 Due (Tue.)

II.  Discovery -- Quantitative Research Methods (Statistics)

2/22

hypothesis testing,

Exam 1 (Tue, 2/23)

2/29

 resampling methods, multi-test correction

Assignment 2 Given (Tue. -- moved to next Tue)

3/7

linear regression and correlation, logistic regression, chi-square, discrete random variable comparisons

Assignment 2 Given (Tue.)

3/14

Spring Recess (No Classes)

3/21

hierarchical linear models, ecological fallacy, mediation, moderation (interaction), review

Assignment 2 Due (Tue.Wed)

III. Optimization -- Clustering and Prediction (Machine Learning)

3/28

penalized regression and classification (regularization)

Exam 2 (Tue., 3/29)

4/4

clustering, dimensionality reduction, factor analysis

4/11

Bayesian inference, naive bayes,

Assignment 3 Given (Mon.)

4/18

probabilistic modeling, maximum a posteriori (MAP)

Assignment 3 Due (Tues.Thurs.)

4/25

time series: autocorrelation, ARIMA,change point detection

5/2

information theory, review;

Exam 3 Alternative Date
(5/7, 4:00- 5:20)

5/10

Finals Begin (No Classes)

Team Proj Due (Thur, 5/12)

5/16

Exam 3 (5/16, 2:15-3:45pm)