Probability and Statistics for Data Science
Stony Brook University
CSE594 - Spring 2016
Tuesdays and Thursdays, 4:00 - 5:20
201 Frey Hall
Instructor: H. Andrew Schwartz office hours:Mon. 3-4p, Thurs: 5:30 - 6:30p office: NCS 255 email: has@cs.stonybrook.edu | Teaching Assistant: Amoghavarsha Suresh office hours: Thurs 10 - 11am office: OldCS 2203 email: amsuresh@cs.stonybrook.edu |
Description: This course will cover core concepts of probability theory and an assortment of standard statistical techniques. It is intended to give computer scientists a broad range of theory, principles, and techniques to answer questions from data. For example:
Students are expected to have an elementary knowledge of statistics, linear algebra, and probability theory, as well as be proficient in the python programming language.
Recommended Textbooks:
These books are excellent references. However, we will not follow any book exactly, as we hope to cover a little more breadth of techniques (perhaps at the expense of depth) as well as some techniques often used in interdisciplinary data science which are not covered in these books.
Grading Proportions:
40% 3 exams - problem solving, short answer, essay, pseudocoding
30% 3 individual assignments - problem solving, short answer, coding
20% 1 team project - coding (prediction, clustering, and multi-hypothesis testing)
10% in the news - presentation and discussion participation
(subject to change with advance notice)
Current Scale:
D: [45,55), C-: [55,58), C: [58,64), C+: [64,67), B-: [67,70), B: [70,75), B+: [75,78), A-: [78,81), A: [81,100)
In the news. Each student will be responsible for a 5 minute presentation discussing the probability and statistics used in a research study or data science related technology making it in the news from the last year. The presentation should describe the techniques employed and whether the conclusions drawn in the news article are accurate. Students will sign-up for a presentation slot during the first 2 weeks. The level of detail / sophistication should be commensurate with the material covered at that point in the course (i.e. later presentations will be expected to demonstrate a more thorough understanding and critical view).
Required Programming Language: Python 2.7. Unles otherwise specified all programming assignments must be completed using Python. Frequently used packages: pandas, numpy, scipy, sklearn, scikit-stats.
Late Assignment Policy. Assignments will be accepted up to 48 hours late. A 10% penalty will be assessed if it is less than 24 hours late, while a 25% penalty will be assessed if it is between 24 and 48 hours late. Any assignments submitted after 48 hours from the deadline will earn a 0.
Academic Honesty. Students are welcome and encouraged to converse about assignment problems and concepts. However, sharing answers, via any form of communication, or copying portions of answers from websites or other media is strictly prohibited. You are responsible for both not looking at another’s answers or code as well as making sure your own answers and code are not accessible by other students. At a minimum, all students involved in any cheating or scholarly misconduct will receive a 0 for the assignment and be reported to the graduate program director which will likely come with further consequences.
Schedule and Topics
week beginning | topics | deadlines |
I. Foundations -- Probability Theory | ||
1/25 | probability review: independence, conditional probability, random variables, example application of probability, Markov chain | |
2/1 | python-pandas, Bayes rule, chain/product rule, conditional independence, histograms, probability density functions | Assignment 1 Given (Tue.) |
2/8 | probability mass/density/cumulative functions, expectation, maximum likelihood estimation (MLE), kernel density estimation, matrix linear algebra | in the News starts |
2/15 | data science applications of probability & statistics, review | Assignment 1 Due (Tue.) |
II. Discovery -- Quantitative Research Methods (Statistics) | ||
2/22 | hypothesis testing, | Exam 1 (Tue, 2/23) |
2/29 | resampling methods, multi-test correction | Assignment 2 Given (Tue. -- moved to next Tue) |
3/7 | linear regression and correlation, logistic regression, chi-square, discrete random variable comparisons | Assignment 2 Given (Tue.) |
3/14 | Spring Recess (No Classes) | |
3/21 | hierarchical linear models, ecological fallacy, mediation, moderation (interaction), review | Assignment 2 Due (Tue.Wed) |
III. Optimization -- Clustering and Prediction (Machine Learning) | ||
3/28 | penalized regression and classification (regularization) | Exam 2 (Tue., 3/29) |
4/4 | clustering, dimensionality reduction, factor analysis | |
4/11 | Bayesian inference, naive bayes, | Assignment 3 Given (Mon.) |
4/18 | probabilistic modeling, maximum a posteriori (MAP) | Assignment 3 Due (Tues.Thurs.) |
4/25 | time series: autocorrelation, ARIMA,change point detection | |
5/2 | information theory, review; | Exam 3 Alternative Date |
5/10 | Finals Begin (No Classes) | Team Proj Due (Thur, 5/12) |
5/16 | Exam 3 (5/16, 2:15-3:45pm) |