Big Data Analytics
Stony Brook University
CSE545 - Spring 2023
Monday and Wednesday, 4:25 - 5:45
Location: New Computer Science 120 (old) CS 2120
Instructor: H. Andrew Schwartz office hours:Mo 2:30-3:30p office: NCS 255 email: has@cs.stonybrook.edu* | Teaching Assistants: Peter Geiss, pgeiss@cs.stonybrook.edu Vishwas Bommakanti, vbommakanti@cs.stonybrook.edu TA Office Location: (old) CS 2126 |
Course Website: http://www3.cs.stonybrook.edu/~has/CSE545/ + Brightspace *Please only email if the question is personal or you are sure no one else in the class will be interested in the answer (even then you can send private in the course forum).
Textbooks:Required: Mining of Massive Datasets v3.0 (“MMDS”) Recommended: Advanced Analytics with Spark Hands-On Machine Learning with Scikit-Learn & TensorFlow (HOML) The course will predominantly follow MMDS, from which you will have required readings. Unfortunately, MMDS does not cover Spark so we will supplement with online materials and the recommended materials for those that wish for extra coverage. We will also go over distributed deep learning this semester, which is partially covered in the Geron book. Coursework:46% Exams (2) 30% Individual assignments (3) 24% Team project (1) Grading Scale: F: [0, 50), D: [50, 66), C-: [66,70), C: [70,76), C+: [76,78), B-: [78,80), B: [80,86), B+: [86,88), A-: [88,90), A: [90,∞) The scale above is fixed. However, should an exam or assignment reflect a lower median grade than an 85 then, a curve may be applied to that particular exam or assignment. If the median grade is above this range, no negative curve will be applied (it’s possible for the majority of the class to get greater than B+ if all work hard). Between the coursework percentages and scale above you can calculate where you stand at any stage of the course. Exams. Exams will take place in class. No calculators or other materials are permitted on one’s desk unless otherwise specified. Exams will last approximately 70 minutes and include questions whereby one demonstrates familiarity with the material through (1) problem solving, (2) short answer essays, (3) coding specific algorithms, and (4) true/false statements. Material covered on the exam may include anything from class or in the readings. Lecture Slides are intended as an aid for the material covered in class; They are not a complete replacement for good note-taking in class or for doing the readings. Assignments. There will be three prescribed projects: They will involve presenting individual work, applying concepts and algorithms learned from class to implement parts of big data systems, solve a problem, or perform analyses over real world data. Team Project. The final project of the course is a large team project. Teams of 3 to 4 students must pick a project related to a the Sustainable Development Goal – either (a) measuring progress toward the goal, or (b) provide technology that could help, in part, reach the goal (see "Topic of Applications" below). There will be a sign up per goal with at most 2 teams allowed per goal. The projects will utilize at least 2 data workflow systems and at least a total of 2 other concepts from the course to perform large data analyses, and include: (1) a brief proposal/check-in presentation, (2) analysis code, (3) a report, and (4) a final presentation (during scheduled final exam). Here are examples and previous projects either measuring of progress toward it or somehow seeking to improve progress. Common types of data include social media from regions, satellite imagery, news and website from regions, and graphs of social networks. A few additional topics may also be added. PoliciesLate Assignments. Assignments will be accepted up to 48 hours late. A 10% penalty will be assessed if it is less than 24 hours late, while a 25% penalty will be assessed if it is between 24 and 48 hours late. Any assignments submitted after 48 hours from the deadline will earn a 0. Questions submitted within 48 hours of a deadline may not get a response before the deadline. Lecture Recordings. This is classified as an in-person class. If missing a class, your best source of missed content should be notes and discussion with another student who attended the class. Use the class forum if you do not have such a peer and the instructor or TA will attempt to connect you. Questions about material are also welcome on the forum or during office hours. As a secondary reference, the instructor intends, but can not guarantee, to provide audio and slide recordings. Prerequisites. This course assumes students have an undergraduate knowledge from a modern computer science curriculum. There are no graduate pre-requisite classes for this course, but it is recommended that students take either: (1) Machine Learning, (2) Data Science Fundamentals, or (3) Prob and Stat for Data Science first, if they are not very familiar with the following undergraduate topics:
Required Programming Language: We will use Python 3.8+ as the default language during class. Acceptable libraries will be listed for each assignment. Computing Servers. You may be provided with access to either an Amazon Web Services or Google Cloud cluster, or a cluster at Stony Brook University. These machines are only to be used for class assignments. Academic Honesty. Copying work: Students are welcome and encouraged to converse about assignment problems and concepts. However, sharing answers, via any form of communication, or copying portions of answers from websites or other media is strictly prohibited. You are responsible for both not looking at another’s answers or code as well as making sure your own answers and code are not accessible by other students. Plagiarism: Plagiarism is defined as presenting someone else’s writing or work without attribution as if it was your own. Copying any work, such as code, is plagiarizing.With regard to writing, information learned from books, websites, research papers, or any other source should either be (a) written in one’s own words and include a citation, or (b) quoted and include a citation. Although option (b) is not plagiarism, excessive use of quotations will result in a lower grade as it demonstrates less critical thinking and goes against the purpose of the assignment. Cornell University has a wonderful webpage further defining plagiarism and included exercises to determine if something is plagiarism: http://plagiarism.arts.cornell.edu . Consequences: At a minimum, all students involved in copying work, plagiarism, cheating, or scholarly misconduct will receive a 0 for the assignment/exam and be reported to the graduate program director which may come with further consequences. Academic Integrity. Each student must pursue his or her academic goals honestly and be personally accountable for all submitted work. Representing another person's work as your own is always wrong. Faculty is required to report any suspected instances of academic dishonesty to the Academic Judiciary. For more comprehensive information on academic integrity, including categories of academic dishonesty please refer to the academic judiciary website at http://www.stonybrook.edu/commcms/academic_integrity/index.html Student Accessibility Support Center. If you have a physical, psychological, medical, or learning disability that may impact your course work, please contact the Student Accessibility Support Center, Stony Brook Union Suite 107, (631) 632-6748, or at sasc@stonybrook.edu. They will determine with you what accommodations are necessary and appropriate. All information and documentation is confidential. Critical Incident Management. Stony Brook University expects students to respect the rights, privileges, and property of other people. Faculty are required to report to the Office of Student Conduct and Community Standards any disruptive behavior that interrupts their ability to teach, compromises the safety of the learning environment or inhibits students' ability to learn. Faculty in the HSC Schools and the School of Medicine are required to follow their school-specific procedures. Further information about most academic matters can be found in the Undergraduate Bulletin, the Undergraduate Class Schedule, and the Faculty-Employee Handbook. |
Week starting | Topics | Reading Assignment | Assignments, Exams |
I. Data/Workflow Frameworks (Pipelines) | |||
1/25 | What is Big Data? Preliminaries. | MMDS 1.0-1.5 | |
1/30 | Streaming Algorithms | MMDS 4.0 - 4.4 | |
2/6 | Hadoop File System, MapReduce | MMDS 2.0-2.3 | A1 Release |
2/13 | Spark | MMDS 2.4 - 2.5, | |
2/20 | Introduction to Deep Learning Frameworks | TorchWFFundamentals; 13.0 - 13.1 | A1 Due |
2/27 | Deep Learning; Workflow systems (exam 1) review | ||
II. Big Data Algorithms and Analysis | |||
3/6 | Exam 1; | MMDS 3.0 - 3.5; 13.6; ProbReview | Exam 1 (Mo. 3/6) A2 Release |
3/13 | Spring Recess | ||
3/20 | Regression to Deep Learning; RNNs | MMDS 13.2 - 13.3; 13.5 | A2 Due |
3/27 | Self-Supervision; Transformers-Dialog | A3 Release | |
4/3 | Large Scale Hypothesis Testing; Link Analysis | MMDS 12.1 | A3 Release Project Teams Formation |
4/10 | Recommendation Systems | MMDS 5.0 - 5.2 | A3 Due |
4/17 | Time Series / Longitudinal Analyses Exam 2; | Exam 2 (Wed. 4/19) A3 Due (Sat.) | |
III. Project | |||
4/24 | Research Ethics, Applied Big Data Analytics | Team Projects | Team Project Proposals Due (Sat.) |
5/1 | Research Ethics, Applied Big Data Analytics Team Project Proposals | Team Projects | |
5/16 | Finals Week: | Final Exam is the Team Project Pres. 5/16: 2:15 - 5:00pm |
This schedule is subject to change with advanced notice.