1 of 44

RAU NLP

WEEK 2

2 of 44

Course Info + Contact

rau-nlp.github.io

goo.gl/74SKKF

goo.gl/BTEoBn

goo.gl/wBhydG #nlp

3 of 44

Who is missing?

Who did not join the mailing list, Telegram group or Slack channel?

4 of 44

Unix machines?

Linux or Mac

5 of 44

Python experience?

Linux or Mac

6 of 44

ML for NLP?

char-level RNN, word2vec, seq2seq

7 of 44

Other feedback?

How can we improve the lecture, website or communication?

8 of 44

HOMEWORK

9 of 44

norvig.com/spell-correct.html

10 of 44

How does it work?

Is it rules-based or is it learning?

11 of 44

How to improve it?

What are the pluses and minuses of Norvig’s approach?

12 of 44

Who is Norvig?

What is his contribution to Google? To natural language processing?

13 of 44

Is language structured?

14 of 44

THE STRUCTURE OF LANGUAGE

Raw: text

Sequences of chars, sequences of tokens, ...

Lexical: words

Language models, Zipfian distribution, stems and lemmata, n-grams...

Syntactic: phrases

Part-of-speech tags, syntax trees...

15 of 44

THE STRUCTURE OF LANGUAGE

Semantic: meaning

Sentiment, intents, ontologies...

Discourse: ...

???

16 of 44

How can we represent language numerically?

17 of 44

REPRESENTATIONS

Audio

Spectogram

[dense]

Images

Pixels

[dense]

Text

Word/sent/doc vec

[sparse]

[0, 0, 0.2, 0, 0, 0, 0.4, 0, 0, 0, 0.1, 0, … … ...]

If we use one-hot encoding to make every word a class...

18 of 44

REPRESENTATIONS

Audio

Spectogram

Images

Pixels

Text

Word/sent/doc vec

19 of 44

Why not?

20 of 44

How is English different than other languages?

21 of 44

Is English fundamentally easier?

22 of 44

break

23 of 44

WHAT ARE EXAMPLES OF NLP PROBLEMS?

24 of 44

NLP Problems

Building blocks

Applications

25 of 44

How are we progressing?

26 of 44

Industry

Datasets

Libs + APIs

Funding + Respect

27 of 44

aiindex.org/2017-report.pdf

AI Index

28 of 44

29 of 44

30 of 44

31 of 44

32 of 44

33 of 44

34 of 44

35 of 44

36 of 44

37 of 44

38 of 44

39 of 44

Opportunity

Can we have an impact in NLP?

40 of 44

Which areas have high barriers to entry?

ML + 3 languages + 3 scripts

41 of 44

RESEARCH HORIZON

Sentence vectors

Syntactic and semantic, not just averaging word vecs

Context resolution

Co-reference resolution across sentences

Mixed modes

Text, images and reasoning for tasks

42 of 44

HOMEWORK 2

43 of 44

Break a parser

1. Install spaCy

pip install spaCy, the Python library, including the data file for English or the language of your choice

2. Break it

Find an example where the parse is incorrect

3. Send it

Send the string and a screenshot to the email list, and explain what went wrong.

44 of 44

OFFICE HOURS

SATURDAY @ ISTC