Big Data Analytics

Stony Brook University
CSE545 - Spring 2023

Monday and Wednesday, 4:25 - 5:45
Location: New Computer Science 120 (old) CS 2120

Instructor:         H. Andrew Schwartz

      office hours:Mo 2:30-3:30p

      office:        NCS 255

      email:        has@cs.stonybrook.edu*

Teaching Assistants: 

Peter Geiss, pgeiss@cs.stonybrook.edu
  hours: Tu: 4:30-5:30pm; Wed: 2:30-3:30p

Vishwas Bommakanti, vbommakanti@cs.stonybrook.edu
  hours: We: 11:00a-12:00pm; Th: 11:00a-12:00pm;

TA Office Location: (old) CS 2126

Course Website: http://www3.cs.stonybrook.edu/~has/CSE545/ + Brightspace

*Please only email if the question is personal or you are sure no one else in the class will be interested in the answer (even then you can send private in the course forum).
Course Discussion Forum: The primary place to ask questions in the class forum. We will use Brightspace forums to keep course websites to a minimum. However, should the forum prove non-sufficient for the purposes of the course, we may switch to Piazza.


This course will cover concepts, algorithms and standard tools used to analyze “Big Data”: MapReduce, Spark, graph analytics, text analytics, and streaming algorithms, over modern distributed analysis platforms (e.g. Hadoop). The course will have a large project component, incorporating analyses over large real world data sets.

Textbooks:

Required:

Mining of Massive Datasets v3.0 (“MMDS)
                -Leskovec, Rajarraman, and Ullman (available at
http://www.mmds.org/#ver30)

Recommended:

Advanced Analytics with Spark
                -
Ryza, Laserson, Ower, and Wills

Hands-On Machine Learning with Scikit-Learn & TensorFlow  (HOML)
                -Geron

The course will predominantly follow MMDS, from which you will have required readings. Unfortunately, MMDS does not cover Spark so we will supplement with online materials and the recommended materials for those that wish for extra coverage. We will also go over distributed deep learning this semester, which is partially covered in the Geron book.

Coursework:

46%        Exams (2)

30% Individual assignments (3)

24% Team project (1)

Grading Scale:

F: [0, 50), D: [50, 66), C-: [66,70), C: [70,76), C+: [76,78), B-: [78,80), B: [80,86), B+: [86,88), A-: [88,90), A: [90,)

The scale above is fixed.  However, should an exam or assignment reflect a lower median grade than an 85 then, a curve may be applied to that particular exam or assignment. If the median grade is above this range, no negative curve will be applied (it’s possible for the majority of the class to get greater than B+ if all work hard). Between the coursework percentages and scale above you can calculate where you stand at any stage of the course.

Exams. Exams will take place in class. No calculators or other materials are permitted on one’s desk unless otherwise specified. Exams will last approximately 70 minutes and include questions whereby one demonstrates familiarity with the material through (1) problem solving, (2) short answer essays, (3) coding specific algorithms, and (4) true/false statements. Material covered on the exam may include anything from class or in the readings. Lecture Slides are intended as an aid for the material covered in class; They are not a complete replacement for good note-taking in class or for doing the readings.

Assignments. There will be three prescribed projects: They will involve presenting individual work, applying concepts and algorithms learned from class to implement parts of big data systems, solve a problem, or perform analyses over real world data.

Team Project. The final project of the course is a large team project. Teams of 3 to 4 students must pick a project related to a the Sustainable Development Goal  – either (a) measuring progress toward the goal, or (b) provide technology that could help, in part, reach the goal (see "Topic of Applications" below). There will be a sign up per goal with at most 2 teams allowed per goal.  The projects will utilize at least 2 data workflow systems and at least a total of 2 other concepts from the course to perform large data analyses, and include: (1) a brief proposal/check-in presentation, (2) analysis code, (3) a report, and (4) a final presentation (during scheduled final exam). Here are examples and previous projects either measuring of progress toward it or somehow seeking to improve progress. Common types of data include social media from regions, satellite imagery, news and website from regions, and graphs of social networks.  A few additional topics may also be added.

Policies

Late Assignments. Assignments will be accepted up to 48 hours late. A 10% penalty will be assessed if it is less than 24 hours late, while a 25% penalty will be assessed if it is between 24 and 48 hours late. Any assignments submitted after 48 hours from the deadline will earn a 0. Questions submitted within 48 hours of a deadline may not get a response before the deadline.

Lecture Recordings. This is classified as an in-person class. If missing a class, your best source of missed content should be notes and discussion with another student who attended the class. Use the class forum if you do not have such a peer and the instructor or TA will attempt to connect you. Questions about material are also welcome on the forum or during office hours. As a secondary reference, the instructor intends, but can not guarantee, to provide audio and slide recordings.

Prerequisites. This course assumes students have an undergraduate knowledge from a modern computer science curriculum. There are no graduate pre-requisite classes for this course, but it is recommended that students take either: (1) Machine Learning, (2) Data Science Fundamentals, or (3) Prob and Stat for Data Science first, if they are not very familiar with the following undergraduate topics:

  • Handling data in Python (e.g. numpy, scipy)
  • Matrix linear algebra and linear regression
  • Probability theory
  • Hashing (hash functions, tables, hashmaps/dictionaries)
  • Algorithm theory and runtime complexity

Required Programming Language: We will use Python 3.8+ as the default language during class. Acceptable libraries will be listed for each assignment.

Computing Servers. You may be provided with access to either an Amazon Web Services or Google Cloud cluster, or a cluster at Stony Brook University. These machines are only to be used for class assignments.

Academic Honesty.

Copying work: Students are welcome and encouraged to converse about assignment problems and concepts. However, sharing answers, via any form of communication, or copying portions of answers from websites or other media is strictly prohibited. You are responsible for both not looking at another’s answers or code as well as making sure your own answers and code are not accessible by other students.

Plagiarism: Plagiarism is defined as presenting someone else’s writing or work without attribution as if it was your own. Copying any work, such as code, is plagiarizing.With regard to writing, information learned from books, websites, research papers, or any other source should either be (a) written in one’s own words and include a citation, or (b) quoted and include a citation. Although option (b) is not plagiarism, excessive use of quotations will result in a lower grade as it demonstrates less critical thinking and goes against the purpose of the assignment.  Cornell University has a wonderful webpage further defining plagiarism and included exercises to determine if something is plagiarism: http://plagiarism.arts.cornell.edu .

Consequences: At a minimum, all students involved in copying work, plagiarism, cheating, or scholarly misconduct will receive a 0 for the assignment/exam and be reported to the graduate program director which may come with further consequences.

Academic Integrity. Each student must pursue his or her academic goals honestly and be personally accountable for all submitted work. Representing another person's work as your own is always wrong. Faculty is required to report any suspected instances of academic dishonesty to the Academic Judiciary. For more comprehensive information on academic integrity, including categories of academic dishonesty please refer to the academic judiciary website at http://www.stonybrook.edu/commcms/academic_integrity/index.html

Student Accessibility Support Center.

If you have a physical, psychological, medical, or learning disability that may impact your course work, please contact the Student Accessibility Support Center, Stony Brook Union Suite 107, (631) 632-6748, or at sasc@stonybrook.edu. They will determine with you what accommodations are necessary and appropriate. All information and documentation is confidential.

Critical Incident Management.

Stony Brook University expects students to respect the rights, privileges, and property of other people. Faculty are required to report to the Office of Student Conduct and Community Standards any disruptive behavior that interrupts their ability to teach, compromises the safety of the learning environment or inhibits students' ability to learn. Faculty in the HSC Schools and the School of Medicine are required to follow their school-specific procedures. Further information about most academic matters can be found in the Undergraduate Bulletin, the Undergraduate Class Schedule, and the Faculty-Employee Handbook.

Schedule and Topics

Week starting

Topics

Reading Assignment

Assignments, Exams

I. Data/Workflow Frameworks (Pipelines)

1/25

What is Big Data? Preliminaries.

MMDS 1.0-1.5

1/30

Streaming Algorithms

MMDS 4.0 - 4.4

2/6

Hadoop File System, MapReduce

MMDS 2.0-2.3

A1 Release

2/13

Spark

MMDS  2.4 - 2.5,
RDD Paper

2/20

Introduction to Deep Learning Frameworks

TorchWFFundamentals; 13.0 - 13.1

A1 Due

2/27

Deep Learning; Workflow systems (exam 1) review

II. Big Data Algorithms and Analysis

3/6

Exam 1;
Similarity Search; Statistics-Prob Preliminaries

MMDS  3.0 - 3.5; 13.6; ProbReview

Exam 1 (Mo. 3/6)

A2 Release

3/13

Spring Recess

3/20

Regression to Deep Learning; RNNs

MMDS 13.2 - 13.3; 13.5

A2 Due

3/27

Self-Supervision; Transformers-Dialog

BERT,TransformerReviewsVideos

A3 Release

4/3

Large Scale Hypothesis Testing; Link Analysis

MMDS 12.1
Orloff and Bloom:NHST, McDonald2015

A3 Release

Project Teams Formation

4/10

Recommendation Systems

MMDS 5.0 - 5.2
MMDS 9.0 - 9.4

A3 Due
Project Teams Due (we. 4/12)

4/17

Time Series / Longitudinal Analyses Exam 2;  

Exam 2 (Wed. 4/19)

A3 Due (Sat.)

III. Project

4/24

Research Ethics, Applied Big Data Analytics
Team Project Proposals

Team Projects

Team Project Proposals Due (Sat.)

5/1

Research Ethics, Applied Big Data Analytics

Team Project Proposals

Team Projects

5/16

Finals Week: 
Team presentations or posters will be during our allotted final exam period: 5/16 from 2:15 to 5:00pm (the presentation
*is* your final exam)

Final Exam is the Team Project Pres.

5/16: 2:15 - 5:00pm

This schedule is subject to change with advanced notice.