Big Data Analytics

Stony Brook University
CSE545 - Fall 2017

Tuesdays and Thursdays, 4:00 - 5:20
(Old) CS 2120

Instructor:         H. Andrew Schwartz

      office hours:Tu 5:30-6:30p, We: 2 -3

      office:        NCS 255


Teaching Assistant: Youngseo Son

      office hours: Th: 5:30-6:30p

      office:        NCS 242


* Piazza is the primary place to ask questions. Please only email if the question is personal or you are sure no one else in the class will be interested in the answer (even then you can send private in Piazza).

This course will cover concepts, algorithms and standard tools used to analyze “Big Data”: MapReduce, Spark, graph analytics, text analytics, and streaming algorithms, over modern distributed analysis platforms (e.g. Hadoop). The course will have a large project component, incorporating analyses over large real world data sets.


Mining of Massive Datasets v2.1 (MMDS)
                -Leskovec, Rajarraman, and Ullman (
required; available at

Advanced Analytics with Spark
Ryza, Laserson, Ower, and Wills (required)

Hands-On Machine Learning with Scikit-Learn & TensorFlow

The course will predominantly follow MMDS, from which you will have required readings. Unfortunately, MMDS does not cover Spark so we will supplement with the Advanced Analytics with Spark book which also describes some applications we will cover. We will also go over distributed deep learning this semester, which is covered in the Geron book.


34%        Exams (2)

25% Team project

30% Individual assignments (2)

11% Data Topic Presentation and Participation
(subject to change with advance notice)

Grading Scale:

F: [0, 50), D: [50, 66), C-: [66,70), C: [70,76), C+: [76,78), B-: [78,80), B: [80,86), B+: [86,88), A-: [88,90), A: [90,100]

The scale above is intended to be fixed assuming the mean grade of the course is approx. 85 to 90. However, should the coursework reflect a much lower grade, a curve may be applied to raise the letter grades. The intention with having a published scale is so students always know where they stand in the course.  

Topic of Applications: This year we will be adopting a common topical focus on Big Data for the Sustainable Development Goals. The UN outlined are 17 goals that were generally agreed upon by participating nations as ideal for the future world. The goals will appear in examples we use in class and the final team project must address one of the goals -- either measurement of progress toward it or somehow seeking to improve progress.

Exams. Exams will take place in class. No calculators or other materials are permitted on one’s desk unless otherwise specified. Exams will last approximately 70 minutes and include questions whereby one demonstrates of familiarity with the material through (1) problem solving, (2) short answer essays, (3) coding specific algorithms, and (4) true/false statements. Material covered on the exam may include anything from class or in the readings. Lecture Slides are intended as an aid for the material covered in class; They are not a complete replacement for good note-taking in class or for doing the readings.

Team Project. The final project of the course is a large team project. Teams of 3 to 4 students must pick a project related to solving or measuring a sustainable development goal. There will be a sign up per goal with at most 2 teams allowed per goal.  The projects will utilize at least 1 data pipeline and at least a total of 3 other concepts from the course to perform large data analyses, and include: (1) a brief proposal presentation, (2) analysis code, (3) a report, and (4) a final presentation (during scheduled final exam).

Individual Assignments. Both projects will involve programming and allow one to use some of the concepts and algorithms learned from class lectures.

Data-Topic Presentation.  Each student must give a 4 minute presentation covering either:

Option 1: A dataset available related to one of the sustainable development goals.

Option 2: A big data technology (e.g. Hive,GraphX, etc...)

Criteria for each of these will be provided.

Each topic may only be presented by one student -- a signup sheet will be made available Thursday, 8/31.  and presentations will begin the third week, 9/12. A goal of these presentations is to practice thinking critically and carry a dialog in Big Data work. After all talks are given, there will be a 5 minutes QA session. Part of each students’ grade will be based on participation in this session as tracked by the TA.

Sign up and more information.


Late Assignments. Assignments will be accepted up to 48 hours late. A 10% penalty will be assessed if it is less than 24 hours late, while a 25% penalty will be assessed if it is between 24 and 48 hours late. Any assignments submitted after 48 hours from the deadline will earn a 0.

Required Programming Language: We will use Python 3.5+ as the default language during class. Java and Scala are also permissible for assignments, but examples in class and solutions to assignments are only guaranteed to be in Python. Acceptable libraries will be listed for each assignment.

Computing Servers. You will be provided with access to an Amazon Web Services cluster. As a backup, you may also receive access to a cluster at Stony Brook University. These machines are only to be used for class assignments.

Academic Honesty.

Copying work: Students are welcome and encouraged to converse about assignment problems and concepts. However, sharing answers, via any form of communication, or copying portions of answers from websites or other media is strictly prohibited. You are responsible for both not looking at another’s answers or code as well as making sure your own answers and code are not accessible by other students

Plagiarism: Plagiarism is defined as presenting someone else’s writing or work as if it was your own. Technically, copying work is plagiarizing, but a few more notes are made with regard to writing here. Information learned from books, websites, research papers, or any other source should either be (a) written in one’s own words and include a citation, or (b) quoted and include a citation. Although option (b) is not plagiarism, excessive use of quotations will result in a lower grade as it demonstrates less critical thinking and goes against the purpose of the assignment.  Cornell University has a wonderful webpage further defining plagiarism and included exercises to determine if something is plagiarism: . Consequences: At a minimum, all students involved in copying work, plagiarism, any cheating or scholarly misconduct will receive a 0 for the assignment and be reported to the graduate program director which may come with further consequences.

Schedule and Topics



Reading Assignment

Assignments, Exams

I. Data Frameworks


What is Big Data? Preliminaries.

MMDS Ch. 1


Labor Day, MapReduce

MMDS Chs. 2,


MapReduce, Spark

*Spark Ch. 2

A1 released



MMDS Chs. 4

D-T Pres. Start


Streaming Algorithms

*Spark Ch. 3-4

A1 Due (th, 9/28)

II. Big Data Algorithms and Analysis


Streaming, Exam

blog post

Exam 1 (Th. 10/5)


Similarity Search

MMDS Chs. 3, 5


Link Analysis

MMDS Chs. 6.2, 7


Probability and Statistics preliminaries

MMDS Ch. 11

Project Teams Due (10/24)


Supervised Statistics: Scalable Linear Modeling

MMDS Ch. 9


Unsupervised: Clustering, Dimensionality Reduction

MMDS Ch. 12

A2 Released

III. Project


Project Proposals

MMDS Ch. 10

Proposals Due (tu)


Recommendation Systems;  Thanksgiving Break

*HOML Ch. 9 - 10

A2 Due


TensorFlow, Distributed Deep Learning

*HOML Ch. 11 - 12

Last D-T Pres.


Exam, Team Work / Guest Speaker

Exam 2 (Tu, 12/5)



Finals Week: Team presentations or posters will be during our allotted final exam period: Monday, 12/18 2:15 - 5:00pm
(the presentation *is* your final exam)

Final Exam = Team Project Pres.

* Includes materials outside of the MMDS textbook.

This schedule is subject to change with advanced notice.