Natural Language Processing

Stony Brook University
CSE354 - Spring 2021

Mondays and Wednesday, 4:25 - 5:20
Location: Online (Zoom)

Instructor:         H. Andrew Schwartz

      office hours:Mo: 6:15-7:15p, We: 3:00 - 3:30

      office:        NCS 255 (and Zoom)

      email:        has@cs.stonybrook.edu*

Teaching Assistants: 

  Siddharth Mangalik
    of. hours: Tu, We: 6:15 - 7:15
 
   smangalik@cs.stonybrook.edu

Office Hours will be Held in Zoom (see link in Blackboard under "Zoom")

Course website: http://www3.cs.stonybrook.edu/~has/CSE354/

*Course communication: Piazza https://piazza.com/stonybrook/spring2021/cse354

As humans, we process language quite effortlessly, but why do our devices still only recognize basic phrases we ask and often misunderstand us so much? Getting computers to understand natural language is one of the grand challenges of AI and its pursuit has resulted in methods that largely power some of the key technologies of the modern digital world such as web search, translation, and personal assistants (e.g. Siri, Alexa). This course will introduce the algorithms and statistical techniques used for natural language processing, covering syntax (identifying structure), semantics (uncovering meaning), and applications (e.g. sentiment analysis, machine translation, and human language analysis). Students will be introduced to techniques in modern machine learning that power state-of-the-art NLP: deep learning (recursive neural networks, transformers) as well as discriminative learning (ridge regression, support vector machines).  The course will have a substantial project component giving students first-hand experience developing language processing software for useful real-world problems. 

Course Materials

Speech and Language Processing (3rd ed. draft). By. Dan Jurafsky and James H. Martin

Other research papers and tutorials will be listed in the schedule.

Coursework

40% Class review quizzes

36% Individual assignments (4)

10% NLP in the World Presentation and Participation

14% Final Exam

(subject to change with advance notice)

Grading Scale:

A: [90,100], A-: [88,90), B+: [86,88), B: [80,86), B-: [78,80),
C+: [76,78), C: [70,76), C-: [68,70), D+: [66, 68), D: [60, 66), F: [0, 60)

The scale above is fixed. However, curves may be applied to coursework to raise the median grade to approximately 80 to 86. There will not be a negative curve so it’s possible the median grade can be greater than 86. The intention with having a published scale is so students always know where they stand in the course.  

Class Review Quizzes. Starting the second week, after every class, a quiz will be posted on blackboard, available for 1.5 days, containing between 2 to 5 questions.  The questions are intended to review the material from the class, provide feedback to the instructor on class progress, and to establish a large portion of the overall grade assessment given the online nature of the course. We will go over the answers at the beginning of the next class. The questions will be either (a) problem solving, (b) multiple choice (including true/false), (c) filling in pseudocode for algorithms, or (d) short-answer. The 3 lowest scoring quizzes will be dropped from the overall quiz average used in grading.

Individual Assignments. Four individual assignments will involve programming and allow one to practice using concepts and implement algorithms learned from class lectures and the book. Assignments must be in Python (v3.6+) with Numpy (>= 1.14), PyTorch (v1.4), or SKLearn (v24.1). Each assignment will list other potential libraries and the general policy is not to use any libraries implementing the algorithm that is being asked for in the assignment. Some assignments may also have a written portion.

NLP in the World Presentation.  Students will give a 5 minute presentation covering natural language processing being used “in the real world”. Topics should be linked to a news or blog article (not directly a research article although a news or blog article may discuss a research article). Signup link will be posted here the third week.

Final Exam. The final exam will last approximately 90 minutes and include questions whereby one demonstrates familiarity with the course material. The final exam will be similar in format to quizzes, containing a combination of (a) problem solving, (b) short answer essays, (3) coding specific algorithms, and (4) multiple choice. Material covered on the exam may include anything from class or in the readings. Lecture Slides are intended as an aid for the material covered in class; They are not a complete replacement for good note-taking in class or for doing the readings. The current plan is to hold the exam online over a timed session.  This plan is subject to change given the changing nature of options available with highly advanced notice.

Policies

Late Assignments. Assignments will be accepted up to 48 hours late. A 10% penalty will be assessed if it is less than 24 hours late, and a 25% penalty will be assessed if it is between 24 and 48 hours late. Any assignments submitted after 48 hours from the deadline will earn a 0.

Required Programming Language: We will use Python 3.8 as the default language during class. Acceptable libraries will be listed for each assignment.

Academic Honesty.

Sharing answers and Copying work: Students are welcome and encouraged to converse about assignment problems and concepts. However, sharing answers with classmates, via any form of communication, or copying portions of answers from websites or other media is strictly prohibited. You are responsible for both not looking at another’s answers or code as well as making sure your own answers and code are not accessible by other students.

Plagiarism: Plagiarism is defined as presenting someone else’s writing or work without attribution, as if it was your own. Copying any work, such as code, without attribution is plagiarizing. With regard to writing, information learned from books, websites, research papers, or any other source should either be (a) written in one’s own words and include a citation, or (b) quoted and include a citation. Although option (b) is not plagiarism, excessive use of quotations will result in a lower grade as it demonstrates less critical thinking and goes against the purpose of the assignment.  If you’re unsure whether something is plagiarism then ask. Here is also a helpful website with exercises to determine if something is plagiarism: http://plagiarism.arts.cornell.edu.

Grading Policy. At a minimum, all students involved in copying work, sharing answers, plagiarism, or any cheating and scholarly misconduct will receive a 0 for the assignment and be reported to academic judiciary as noted below, which may come with further consequences.

Academic Integrity Statement. Each student must pursue his or her academic goals honestly and be personally accountable for all submitted work. Representing another person's work as your own is always wrong. Faculty is required to report any suspected instances of academic dishonesty to the Academic Judiciary. Faculty in the Health Sciences Center (School of Health Technology & Management, Nursing, Social Welfare, Dental Medicine) and School of Medicine are required to follow their school-specific procedures. For more comprehensive information on academic integrity, including categories of academic dishonesty please refer to the academic judiciary website at http://www.stonybrook.edu/commcms/academic_integrity/index.html

Student Accessibility Support Center.

If you have a physical, psychological, medical, or learning disability that may impact your course work, please contact the Student Accessibility Support Center, 128 ECC Building, (631) 632-6748, or via e-mail at:  sasc@stonybrook.edu. They will determine with you what accommodations are necessary and appropriate. All information and documentation is confidential.

Critical Incident Management.

Stony Brook University expects students to respect the rights, privileges, and property of other people. Faculty are required to report to the Office of University Community Standards any disruptive behavior that interrupts their ability to teach, compromises the safety of the learning environment, or inhibits students' ability to learn. Faculty in the HSC Schools and the School of Medicine are required to follow their school-specific procedures. Further information about most academic matters can be found in the Undergraduate Bulletin, the Undergraduate Class Schedule, and the Faculty-Employee Handbook.

Remote Attendance.

Students are expected to attend every class, report for examinations and submit major graded coursework as scheduled. If a student is unable to attend lecture(s), report for any exams or complete major graded coursework as scheduled due to extenuating circumstances, the student must contact the instructor as soon as possible.  Students may be requested to provide documentation to support their absence and/or may be referred to the Student Support Team for assistance. Students will be provided reasonable accommodations for missed exams, assignments or projects due to significant illness, tragedy or other personal emergencies. In the instance of missed lectures or labs, the student is responsible for seeking notes from classmates and reviewing posted slides and lecture video.  Please note, all students must follow Stony Brook, local, state and Centers for Disease Control and Prevention (CDC) guidelines to reduce the risk of transmission of COVID. For questions or more information click here.


Schedule and Topics

Week Beginning

Topics

Reading Assignment

Assignments

I. Words and Documents (Feature-based NLP)

2/1

Introduction to NLP; Regular Expressions; Tokenization; Probability

ambiguous phrases; SLP 2.0 - 2.4

2/8

Lexica; Supervised Classification: Logistic Regression; Sentiment

SLP 4.0; 5.0 - 5.4

Quizzes Start
A1 Released

2/15

Evaluating Classifiers; Regularization

SLP 4.7 - 4.9;
       5.5 - 5.6

2/22

Word Sense Disambiguation; Vector Models (Embeddings); Dimensionality Reduction-based Embeddings

SLP 18.0 - 18.5;
6.0 - 6.3;
Historical Notes

A1 Due

3/1

Word2vec; Topic Models;

SLP 6.8 - 6.10

A2 Released

3/8

Differential Language Analysis; NLP in the World

SLP 20.5 - 20.7

3/15

Human-Centered NLP; Bias and Ethics in NLP

Lynn_etAl; HovySpruit; SLP 6.11; Shah etAl

A2 Due

II. Sequences (Deep Learning for NLP)

3/22

Syntactic Tasks: POS Tagging, Dependency Parsing; Introduction to Language Modeling

SLP 8.0 - 8.2 14.1; 14.4

A3 Released

3/29

Language Modeling; Introduction to Neural Networks

SLP 3.0 - 3.4; 7.0 - 7.1

4/5

Recurrent Neural Networks (RNNs); LSTM-based RNNs

SLP 7.2 - 7.5; 9.0 - 9.2

A3 Due

4/12

GRU-based RNNs; Document and Sequence Classification: Named Entities; Attention Mechanism

SLP 8.3; 9.3 - 9.6

A3 Due

4/19

Transformer-based Language Models

HuggingFace
MaskedLM
text classification

A4 Released

4/26

Transfer Learning and Fine-Tuning; NLP in the World

BERT Collab
Seq2Seq

5/3

TBD: NLP Applications -- Translation, Speech Recognition, Question Answering, ChatBots, Information Retrieval

WMTTask;
SLP 26.0 - 26.3

A4 Due

Tue, 5/18

The final exam will be during SBU's scheduled final exam period: Tuesday, 5/18, from 2:15 - 5:00pm

Exam

This schedule is subject to change with advanced notice.

*Updated, 3/15/2021: Dependency parsing and POS tagging moved to week 3/22