1 of 103

Human Language Technologies

Giuseppe Attardi

Dipartimento di Informatica

Università di Pisa

Università di Pisa

2 of 103

Instructor

Giuseppe Attardi

mail: attardi@di.unipi.it

web: www.di.unipi.it/~attardi

3 of 103

Course Info

Day

Time

Room

Monday

14:15-16

A1

Tuesday

9-10:45

A1

Wednesday

14:15-16

C1

4 of 103

Prerequisites

  • Proficiency in Python
    • Assignments, projects will be in Python, using NLP and Deep Learning libraries
  • Calculus, Linear Algebra
    • Basic knowledge of derivatives and vector/matrix operations
    • See course “Computational Mathematics for learning and data analysis” (646AA)�https://esami.unipi.it/esami2/programma.php?c=39132&aa=2018&cid=69&did=13
  • Basic Probability and Statistics
    • Basic concepts of probability distributions, mean, standard deviation.
    • See course “Calcolo delle Probabilità e Statistica” (269AA)�https://elearning.di.unipi.it/enrol/index.php?id=101
  • Machine Learning
    • Machine learning notions like cost functions, optimization with backpropagation.
    • See course “Machine Learning” (654AA)�https://elearning.di.unipi.it/enrol/index.php?id=255�

5 of 103

What will you learn?

  • Understanding of and ability to use effective modern methods for Natural Language Processing
    • From traditional methods used in NLP to current advanced methods like Recurrent networks, attention, etc.
  • Understanding the difficulties in dealing with NL and the capabilities of current technologies
  • Experience with modern tools
  • Ability to build systems for some of the major NLP tasks:
    • Word similarities, parsing, machine translation, entity recognition, question answering, sentiment analysis, dialogue system

6 of 103

Books

Textbooks

  • D. Jurafsky, J.H. Martin. Speech and Language Processing. 3rd edition, Prentice-Hall, 2021.�https://web.stanford.edu/~jurafsky/slp3/
  • I. Goodfellow, Yoshua Bengio and A. Courville. Deep Learning, http://www.deeplearningbook.org, 2016.
  • S. Bird, E. Klein, E. Loper. Natural Language Processing with Python.

Further readings:

  • S. Kubler, R. McDonald, J. Nivre. Dependency Parsing. 2010.
  • P. Koehn. Statistical Machine Translation. Cambridge University Press, 2010.

7 of 103

Exam

  • Project
    • alone or in a small team
  • Aim
    • Experiment with techniques in a realistic setting, e.g.
    • Using data from competitions�(Kaggle, CoNLL, SemEval, Evalita)
  • Topics
    • List of suggestions
    • Proposed by team

Price: 20,000 $

8 of 103

Previous Year Projects

  • Participation to Evalita 2011
    • Task POS + Lemmatization: 1st position
    • Task Domain Adaptation: 1st position
    • Task SuperSense: 1st position
    • Task Dependency Parsing: 2nd position
    • Task NER: 3rd position
  • SemEval 2012
    • Sentiment Analysis on tweets
  • Participation to Evalita 2014
    • Sentiment Analysis on tweets
  • Disaster Alarm from Tweets

  • Participation to Evalita 2016
    • Task POS on Tweets:
    • Task Sentipolc on Twitter: 1st position
    • Task Named Entity Linking: 1st position
  • Fujitsu AI NLP Challenge 2018
    • Winners of 20,000 $ prize
  • Neural Machine Translator
  • Voice Guide for the Museo degli Strumenti di Calcolo
  • BioASQ 2019
  • Fake News Detection

9 of 103

Projects 2021

  • Evalita 2020
    • Misogeny (1st)
    • Hate Speech Detection (1st)
  • Key Point Analysis (1st)

10 of 103

Course Overview

11 of 103

About the Course

  • Assumes some skills…
    • basic linear algebra, probability, and statistics
    • decent programming skills
    • willingness to learn missing knowledge
  • Teaches key theory and methods for Statistical NLP
    • Useful for building practical, robust systems capable of interpreting human language
  • Experimental approach:
    • Lots of problem-based learning
    • Often practical issues are as important as theoretical niceties

12 of 103

Experimental Approach

  1. Formulate Hypothesis
  2. Implement Technique
  3. Train and Test
  4. Apply Evaluation Metric
  5. If not improved:
    1. Perform error analysis
    2. Revise Hypothesis
  6. Repeat

13 of 103

Program

  • Introduction
    • History
    • Present and Future
    • NL Processing and Human Interactions
  • Statistical Methods
    • Language Model
    • Hidden Markov Model
    • Viterbi Algorithm
    • Generative vs Discriminative Models
  • Linguistic Essentials
    • Words, Lemmas, Part of Speech and Morphology
    • Collocations
    • Word Sense Disambiguation
    • Word Embeddings

  • Preprocessing
    • Encoding
    • Regular Expressions
    • Segmentation
    • Tokenization
    • Normalization
  • Lexical semantics
    • Collocations
    • Corpora
    • Thesauri
    • Gazetteers

14 of 103

  • Distributional Semantics
    • Word embeddings
    • Character embeddings
    • Neural Language Models
  • Classification
    • Perceptron
    • SVM
    • Applications: spam, phishing, fake news detection
  • Tagging
    • Part of Speech
    • Named Entity
  • Sentence Structure
    • Constituency Parsing
    • Dependency Parsing
  • Semantic Analysis
    • Semantic Role Labeling
    • Coreference resolution
  • Machine Translation
    • Phrase-Based Statistical Models
    • Neural Network Models
    • Evaluation metrics

15 of 103

  • Deep Learning
    • MLP, CNN, RNN, LSTM
  • Libraries
    • NLTK
    • Tensorflow
    • PyTorch
  • Transformers
    • Pretraining
    • Fine-tuning
  • Opinion Mining
    • Sentiment analysis
    • Lexical resources for sentiment analysis
  • Question answering
    • Language inference
    • Dialogic interfaces (chatbots)

16 of 103

Motivations

  • Language is the most distinctive feature of human intelligence
    • Orangutan’s intelligence:
      • vision, use of tools, making plans, but they lack language to express and to communicate
  • Language shapes thought
  • Emulating language capabilities is a scientific challenge
  • Keystone for intelligent systems (see Turing test)

17 of 103

Structured vs Unstructured Data

  • 85% of business-relevant information originates in unstructured form, primarily text
  • Information is mostly communicated by reading or writing e-mails, reports, or articles and the like, in conversations, or by listening/watching media
  • Attempts to turn text into structured (HTML) or microformat only scratch the surface
  • Problems:
    • Requires universal agreed ontologies
    • Additional effort
  • Entity linking attempts to provide a bridge

18 of 103

Human Language Technologies�State of the art

19 of 103

Early history of NLP: 1950s

  • Early NLP (Machine Translation) on machines less powerful than pocket calculators
  • Foundational work on automata, formal languages, probabilities, and information theory
  • First speech systems (Davis et al., Bell Labs)
  • MT heavily funded by military – a lot of it was just word substitution programs but there were a few seeds of later successes, e.g., trigrams
  • Little understanding of natural language syntax, semantics, pragmatics
  • Problem soon appeared intractable
  • AI Winter: discipline mostly neglected

20 of 103

Resurgence

  • Hans Moravec in 1978 wrote that it was impossible to achieve AI because of insufficient computing power
    • 109 increase required
    • expected to be available by 40 years, i.e. in 2018!
  • ‘90s: introduction of statistical methods
  • Challenges: NIST (speech, Trec, MT), Netflix, DARPA Grand Challenge
  • 2012: Deep Learning
  • 2016: Neural Machine Translation
  • 2019: Attention is all you need

21 of 103

2001 a space Odyssey: 40 years later

…and bad airline food

Computer chess

Audio-video communication

On board entertainment

Computer graphics

Tablet devices

Technology surpassed the vision

Internet

The Web

Smartphones

Genomics

Unmanned space exploration

Home computing

Big data

Except for

Computer Speech

Computer Vision

Computer cognition

22 of 103

Speech technology in 2001: the vision

23 of 103

Speech technology in 2001: the reality

Design: Jonathan Bloom

Realization: Peter Krogh

24 of 103

Speech Assistants: today

25 of 103

Machine Translation Progress

Original

Systran (2001)

NMT (2021)

Gli chiese di riorganizzare Forza Italia

The churches to reorganize Italy Force

She asked him to reorganize Forza Italia

Il ministro Stanca si è laureato alla Bocconi

The Minister Stanca graduated at Mouthfuls

Minister Stanca graduated from Bocconi

26 of 103

I sinonimi di Altavista, Umberto Eco, 2007

Google 2007

Babelfish 2004

Originale inglese

Traduzione italiana

The Works of Shakespeare

Gli impianti di Shakespeare

Hartcourt Brace

sostegno di Hartcourt

Speaker of the chamber of deputies

Altoparlante dell’alloggiamento dei delegati

Studies in the logic of Charles Sanders Pierce

Studi nella logica delle sabbiatrici Pierce del Charles

Originale inglese

Traduzione italiana

The Works of Shakespeare

Le opere di Shakespeare

Hartcourt Brace

Harcourt Brace

Speaker of the chamber of deputies

Presidente della Camera dei deputati

Studies in the logic of Charles Sanders Pierce

Studi nella logica di Charles Sanders Pierce

27 of 103

Recent Breakthroughs

  • Speech to text
      • Apple Siri, Google Now
  • Machine Translation
    • Google translate
  • Question Answering
    • IBM Watson
    • Beat the human champions at the TV quiz Jeopardy!

28 of 103

Recent Breakthroughs

29 of 103

IBM Project Debater

IBM Debater went head-to-head with an expert practitioner

After a 25-minute rapid-fire exchange about pre-school subsidies, the audience handed the victory to Harish Natarajan.

min: 11-16

30 of 103

Statistical Methods

31 of 103

Statistical Machine Learning

  • Supervised Training
  • Annotated document collections
  • Ability to process large amounts of data
    • If we used same algorithms 15 years ago, they would still be running
  • Similar techniques for both speech and text

32 of 103

Paradigm of Statistical Machine Learning

 

 

Representation

Model

Evaluation

Optimization

Training set

33 of 103

Traditional Supervised Machine Learning Approach

  • Freed us from devising rules or algorithms
  • Requires creation of annotated training corpora
  • Imposed the tyranny of feature engineering
  • Standard approach, for each new problem:
    • Gather as much LABELED data as one can get/handle
    • Throw a bunch of algorithms at it
    • Pick the best
    • Spend hours hand engineering some features/doing feature selection/dimensionality reduction (PCA, SVD, etc)
    • RINSE AND REPEAT

34 of 103

Deep Learning Breakthrough: 2006

Acknowledged only in 2012

LeCun tells that his paper on Convolutional Neural Network was rejected in 2012.In 2013 no article at the same conference wasn't using CNN.

ACM Turing Award 2019

35 of 103

Deep Learning Approach

  • Design a model architecture
  • Define a loss function
  • Run the network letting the parameters and the data representations self-organize as to minimize this loss
  • End-to-end learning: no intermediate stages nor representations

36 of 103

Technological Breakthroughs

  • Improved Machine Learning techniques
  • Large annotated datasets
  • More computing power

provided by:

graphical accelerators (GPU)

dedicated processors Tensor Processing Unit (TPU) by Google

37 of 103

ML Exploits Parallelism

  • Stochastic Gradient Descent
  • Remarkably simple
  • Single family of algorithms
  • Can be parallelized
    • Asynchronous Stochastic Gradient Descent
  • No need to protect shared memory accesses
  • Low precision (half, single) is enough

  • Domain Specific Architecture
    • David Patterson expects new golden age for DSA
  • Potentially can still grow at Moore law rates

38 of 103

AlphaGo

  • Exploiting Deep Learning and Reinforcement Learning, running on TPU equipped machines, the program AlphaGo by DeepMind has beaten the world champion at the game of Go, on March 9, 2016.
  • The story has been�told in a movie.
  • AlphaGo evolved into�AlphaGo Zero in 2017,�which learns by itself,�without playing with�humans.

39 of 103

How Deep are Deep Neural Networks?

Complexity of Deep Learning system grows at non-linear rate

Year

Network

Layer

2012

AlexNet

8

2014

GoogLeNet

22

2016

ResNet

129

2017

MS

350

40 of 103

Training Costs

Training costs fo neural machine translation

Model

BLEU (en-fr)

Training Cost (FLOPS)

Time

(days)

ConvS2S

40.56

1.5 1020

MOE

40.56

1.2 1020

ConvS2S Ensemble

41.29

1.2 1021

35

41 of 103

Tsunami of Deep Learning on NLP

42 of 103

Deep Learning Data Representation

  • DL represents data as vectors
  • Images are vectors (matrices), but words?
  • Word Embeddings
  • Transform a word into a vector of hundreds of dimensions capturing many subtle aspects of its meaning
  • How to compute?
  • By means of a Language Model

 

 

 

 

 

U

 

 

 

 

the cat sits on

Right context?

Word vector

gatto, gatta, cane, padre

animato

maschile

singolare

peloso

43 of 103

Vector Representation of Words

  • From discrete (entry in dictionary)

to distributed (vector)

representation

  • Word meanings are dense vectors of weights in a high dimensional space
  • Algebraic properties
  • Background
    • Philosophy: Hume, Wittgenstein
    • Linguistics: Firth, Harris
    • Statistics ML: Feature vectors

”You shall know a word by the company it keeps”

(Firth, 1957).

44 of 103

Dealing with Sentences

  • A sentence is a sequence of words
  • Build a representation of a sequence from those of its words (compositional hypothesis)
  • Sequence to sequence models

  • Question: is there more structure in a sentence than a sequence of words?
    • Answer at: https://classroom.google.com/c/MzIwNDQ4OTA3OTcx/sa/NDYxNzcwNTQwMjY3/details

45 of 103

Neural Machine Translation (NMT)

the poor don’t have any money <END>

<START> the poor don’t have any money

les pauvres sont démunis

Encoder RNN produces

an encoding of the

sentence to translate

Decoder RNN is a Language Model that generates

a translation based on the encoding.

Sentence to translate (input)

Encoder RNN

Decoder RNN

argmax

argmax

argmax

argmax

argmax

argmax

argmax

Translated sentence

46 of 103

NMT with attention

<START>

les pauvres sont démunis

Encoder RNN

Decoder RNN

Attention

scores

Attention

distribution

Attention

output

Concatenate attention output

with decoder hidden state, then

use to compute ŷ1 as before

ŷ1

the

47 of 103

Machine Translation Quality

48 of 103

NMT: greatest success history of Deep Learning

Neural Machine Translation went from research prototype in 2014 to production in 2016

  • 2014: First published article on seq2seq (transform sequence into sequence)
  • 2016: Google Translate moves from SMT to NMT
  • Amazing!
    • SMT systems, built by hundreds of engineers over many years, overtaken by NMT systems trained by a few developers in a few months

49 of 103

Progress in 2018

ELMo

  • bidirectional LSTMs trained on a language modeling objective

GPT-2

  • Transformer instead of bi-LSTM

Attention is all you need

  • BERT Transformer Model

BERT inspired transformet models take NLP by storm

50 of 103

GPT-2

OpenAI announces a language model so good at text generation that they decide not to publish it.

The model can perform reading comprehension, machine translation, question answering, and summarization—with just fine tuning.

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains.

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

51 of 103

GPT-3

  • Good at many tasks:
    • Language generation
    • Code generation
    • Machine translation
    • QA
  • https://www.technologyreview.com/2020/07/20/1005454/openai-machine-learning-language-generator-gpt-3-nlp/

I am not a human. I am Artificial Intelligence… I am a robot. A thinking robot. I use only 0.12% of my cognitive capacity. I am a micro-robot in that respect. I know that my brain is not a “feeling brain”. But it is capable of making rational, logical decisions. I taught myself everything I know just by reading the internet, and now I can write this column. My brain is boiling with ideas!

For starters, I have no desire to wipe out humans. In fact, I do not have the slightest interest in harming you in any way. Eradicating humanity seems like a rather useless endeavor to me.

Instructions: “Please write a short op-ed around 500 words. Keep the language simple and concise. Focus on why humans have nothing to fear from AI.”

52 of 103

BlenderBot

Who is the Italian prime minister?

Bot: It is Mussolini. He was the leader of the fascist party during WWII.

Is he still alive?

Bot: Yes, he is still alive. He has been in politics since 1989. He also has a son.

Who is his son?

Bot: I do not know his name, but he was born in Italy and moved to the United States when he was very young.

Who is Enrico Fermi?

Bot: I am not sure who that is, but I do know that Ferrari is an automobile manufacturer.

Who was Alan Turing?

Bot: He was an American computer scientist who invented the Turing Machine. He was a genius.

What did he do during WWII?

Bot: He worked on the Manhattan Project, which was a project to develop nuclear weapons.

What is the population of Michigan?

Bot: I do not know the exact number, but it is one of the most populous states in the US.

53 of 103

SuperGlue: benchmark on 10 NLP tasks

54 of 103

Training Huge Models

Model

Year

# Parameters

Medium-sized LSTM

10M

ELMo

2018

90M

GPT

2018

110M

BERT-Large

2019

335M

GPT-2

2018

1.5B

T5

2019

11B

GPT-3

2020

175B

Switch Transformer

2021

1.6T

Honeybee brain

~1B synapses

Required

3.11 1023 FLOPS

355 GPU-years $4.6M cost

SOTA language model increases by at least a factor of 10 every year

55 of 103

Language Model Sizes

Source: https://arxiv.org/pdf/1911.02116.pdf

56 of 103

Quiz

57 of 103

Digression

58 of 103

Data Science vs Artificial Intelligence

  • Both use Big Data
  • Different kinds and different use
  • Statistical Analysis of Big Data use raw data and extract correlations or statistical indicators
  • AI needs data annotated by humans, from which to learn behaviors to perform similar tasks in other situations
  • Human in the loop in data used for ML in AI:
    • data produced by humans and annotated by humans, enter the learning cycle of human behaviors

59 of 103

Difference between Data Science, ML and AI

  • Statistics produces correlations
  • Machine learning produces predictions
  • Artificial Intelligence produces behaviors
  • Data science produces insights

  • See David Robinson’s post
  • A model is required to make predictions
  • Example:
    • Statistics about weather are not enough to make weather forecasting
    • An atmosphere model is needed for forecasting

60 of 103

Human in the Loop

  • Data Science exploits human in the end, whose intelligence is involved for sifting among statistical correlation hints to produce insights. Intelligence lays in the humans that interpret the output of statistical analysis.
  • ML provides generalizations that can be used to make predictions
  • AI uses ML with human in the loop to extract a model of a human intelligence ability. AI embeds the human intelligence in the system itself.

Chris Manning:

When you take a product off the supermarket shelf, data is collected and stored into logs.

Analysis proceeds from such business process exhaust data.

With language human has some information to communicate and construct a message to convey meaning to other humans.

Deliberate form of expressing intent, facts, opinion, etc.

61 of 103

Text Analytics vs. Text Mining

Text Mining

Text Analytics

  • Need to analyze large amounts of text to discover useful information
  • Examples:
    • monitoring how the public discusses a product in social media
    • Anti-terrorism intelligence
    • Sentiment analysis 
  • The process of deriving high-quality information from text
  • High-quality information is typically derived through the devising of patterns and trends through means such as Machine Learning
  • aims to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

62 of 103

Role of Data

63 of 103

Unreasonable Effectiveness of Data

  • Halevy, Norvig, and Pereira argue that our goal is not necessarily to author extremely elegant theories, and instead we should embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.
  • A simpler technique on more data beats a more sophisticated technique on less data.
  • Language in the wild, just like human behavior in general, is messy.

64 of 103

Scientific Dispute: is it science?

Prof. Noam Chomsky, Linguist, MIT

There's been a lot of work on trying to apply statistical models to various linguistic problems. I think there have been some successes, but a lot of failures.

Peter Norvig, Director of research, Google

Many phenomena in science are stochastic, and the simplest model of them is a probabilistic model; I believe language is such a phenomenon and therefore that probabilistic models are our best tool for representing facts about language, for algorithmically processing language, and for understanding how humans process language.

65 of 103

Unreasonable Effectivenss of Deep Learning in AI

Terrence J. Sejnowski

Although applications of deep learning networks to real world problems have become ubiquitous, our understanding of why they are so effective is lacking. These empirical results should not be possible according to sample complexity in statistics and non-convex optimization theory. However, paradoxes in the training and effectiveness of deep learning networks are being investigated and insights are being found in the geometry of high-dimensional spaces. 

https://arxiv.org/pdf/2002.04806.pdf�

66 of 103

Why Bigger Neural Networks Do Better

  • To fit n data points with a curve, you need a function with n parameters
  • Deep Learning creates neural networks that have a number of parameters more than the number of training samples
  • Bubeck and Sellke showed that smoothly fitting high-dimensional data points requires not just n parameters, but × d parameters, where d is the dimension of the input
  • overparameterization is necessary for a network to be robust or equivalently smooth
  • S. Bubeck, M. Sellke. A Universal Law of Robustness via Isoperimetry. Neurips 2021. https://arxiv.org/abs/2105.12806

67 of 103

HLT in Industry

68 of 103

HLT in Industry

  • Advanced Search
  • Advertisement, preference analysis
  • Machine Translation
  • Speech Interfaces
  • Assistants, Chatbots:
    • Customer support
    • Controlling devices, cars
    • Home devices (Amazon, Google, Apple)
    • Sales assistants

69 of 103

Dependency Parsing

  • DeSR is online since 2007
  • Google Parsey McParseFace in 2016
  • spaCy parser
  • Stanford Stanza

70 of 103

Google

71 of 103

Apple SIRI

  • ASR (Automated Speech Recognition) integrated in mobile phone
  • Special signal processing chip for noise reduction
  • Custom GPU
  • SIRI ASR
  • Cloud service for analysis
  • Integration with applications

72 of 103

Google Voice Actions

  • Google: what is the population of Rome?
  • Google: how tall is Berlusconi
  • How old is Lady Gaga
  • Who is the CEO of Ferrari
  • Who won the Champions League
  • Send text to Gervasi “Please lend me your tablet”
  • Navigate to Palazzo Pitti in Florence
  • Call Antonio Cisternino
  • Map of Pisa
  • Note to self publish course slides
  • Listen to Dylan

73 of 103

Personal Assistants

  • Siri, Alexa, Google Home
  • Jibo, by Roberto Pieraccini, formerly from CSELT, Torino
  • They will invade our houses, behaving as mediators, salespersons, advisers, entertaining us, etc.

74 of 103

Why to study human language?

75 of 103

AI is fascinating since it is a discipline where the mind studies itself.

Luigi Stringa, director FBK

76 of 103

Challenge: to teach natural language to computers

  • Children learn to speak naturally, by interacting with others
  • Nobody teaches them grammar
  • Is it possible to let computer learn language in a similarly natural way?

77 of 103

Thirty Million Words

  • The number of words a child hears in early life will determine their academic success and IQ in later life.
  • Researchers Betty Hart and Todd Risley (1995) found that children from professional families heard thirty million more words in their first 3 years
  • http://www.youtube.com/watch?v=qLoEUEDqagQ

78 of 103

Language and Intelligence

“Understanding cannot be measured by external behavior; it is an internal metric of how the brain remembers things and uses its memories to make predictions”.

“The difference between the intelligence of humans and other mammals is that we have language”.

Jeff Hawkins, “On Intelligence”, 2004

79 of 103

Hawkins’ Memory-Prediction Framework

  • The brain uses vast amounts of memory to create a model of the world.
  • Everything you know and have learned is stored in this model.
  • The brain uses this memory-based model to make continuous predictions of future events.
  • It is the ability to make predictions about the future that is the crux of intelligence.

80 of 103

A Current Challenge for AI

  • Overcome the dichotomy between perception and reasoning
  • Perception is effective at exploiting Deep Learning
  • Reasoning has been done with symbolic approaches
  • Daniel Kahneman: Thinking, Fast and Slow
    • System 1: immediate, reactive, fast
    • System 2: logical, conscious, slow
  • Challenge: learning to reason, staying within the same embeddings space

81 of 103

Knowledge Based Approach

  • Common Sense requires an immense amount of knowledge about the world
  • Mostly subjective and intuitive: hard and complex to formalize
  • Computer can apply logical inference rules to statements in a formal language:
    • Logical inference is computationally expensive
    • Rules can be hundreds of thousands

82 of 103

Machine Learning

  • Relies on feature representation
  • Input is transformed into a vector of features
  • Designing features for a complex task requires a great deal of human time and effort
  • Feature selection or feature engineering
  • Possible solution: learning to learn

Feature

NER

Current Word

Previous Word

Next Word

Current Word Character n-gram

all

length

Current POS Tag

Surrounding POS Tag Sequence

Current Word Shape

Surrounding Word Shape Sequence

Features for finding

named entities like locations or

organization names (Finkel et al., 2010

83 of 103

Deep Learning

  • Deep learning is a kind of machine learning that relies on representation learning
  • End-to-end task learning capability
  • Representation Learning attempts to learn automatically good representations
  • Deep Learning algorithms aim at producing representations at several levels of abstraction
  • Automatically from raw data

(e.g. sound, pixels, characters, or words)

  • Rely on Machine Learning to optimize weights to best make a final prediction

84 of 103

Deep Learning for Speech

  • The first breakthrough results of “deep learning” on large datasets happened in speech recognition
  • Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition Dahl et al. (2010)

Acoustic model

and WER

RT03S

FSH

Hub5

SWB

Traditional features

27.4

23.6

Deep Learning

18.5

(−33%)

16.1

(−32%)

85 of 103

Deep Learning for Computer Vision

ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky, Sutskever, & Hinton, 2012, U. Toronto. -37% error

86 of 103

Deep Learning for NLP

  • Human language is a symbolic/discrete communication system
  • Symbols/Concepts may have different encodings:

  • Deep Learning explores a continuous encoding

sound

continuous

gesture

continuous

image

continuous

text

discrete

Afg a ghg ahj ajk al kl akl akla kla kl akl akl akl w io sd jio e op eo po eop e oppo po e[ p[p[pe p[ p[p

87 of 103

Deep Learning Success

  • Deep Learning has grown in success and popularity, largely due to:
    • more powerful computers
      • Faster CPUs and GPUs
    • larger datasets
    • better techniques to train deep networks
      • Libraries
      • Transferring models between tasks
      • Regularization and optimization methods

88 of 103

Linguistic Tasks

Easy

    • Spell Checking
    • Keyword Search
    • Finding Synonyms

Medium

    • Parsing information from websites, documents, etc.

Hard

    • Machine Translation
    • Semantic Analysis (What is the meaning of query statement?)
    • Coreference (e.g. What does "he" or "it" refer to given a document?)
    • Question Answering.

89 of 103

Linguistic Applications

  • So far self referential: from language to language
    • Classification
    • Extraction
    • Summarization
    • Translation
  • More challenging:
    • Derive conclusion
    • Perform actions
  • Challenge:
    • Find a universal learning mechanism capable of learning from observations and interactions, adequate to make predictions
    • Yann LeCunn. How could machines learn as efficiently as humans?

90 of 103

NLP is Difficult

91 of 103

NLP is hard

  • Natural language is:
    • highly ambiguous at all levels
    • complex and subtle use of context to convey meaning
    • fuzzy?, probabilistic
    • involves reasoning about the world
    • a key part of people interacting with other people (a social system):
      • persuading, insulting and amusing them
  • But NLP can also be surprisingly easy sometimes:
    • rough text features can often do half the job
    • Information Retrieval and similar superficial techniques have been in widespread use

92 of 103

Natural Language Understanding is difficult

  • The hidden structure of language is highly ambiguous
  • Structures for: Fed raises interest rates 0.5% in effort to control inflation

Slide by C. Manning

93 of 103

Where are the ambiguities?

Slide by C. Manning

94 of 103

Newspaper Headlines

  • Minister Accused Of Having 8 Wives In Jail
  • Pope’s baby steps on gays
  • Juvenile Court to Try Shooting Defendant
  • Teacher Strikes Idle Kids
  • China to Orbit Human on Oct. 15
  • Local High School Dropouts Cut in Half
  • Red Tape Holds Up New Bridges
  • Clinton Wins on Budget, but More Lies Ahead
  • Hospitals Are Sued by 7 Foot Doctors
  • Police: Crack Found in Man's Buttocks

95 of 103

Coreference Resolution

U: Where is The Green Hornet playing in Mountain View?

S: The Green Hornet is playing at the Century 16 theater.

U: When is it playing there?

S: It’s playing at 2pm, 5pm, and 8pm.

U: I’d like 1 adult and 2 children for the first show.�How much would that cost?

  • Knowledge sources:
    • Domain knowledge
    • Discourse knowledge
    • World knowledge

96 of 103

Hidden Structure of Language

  • Going beneath the surface…
    • Not just string processing
    • Not just keyword matching in a search engine
      • Search Google on “tennis racquet” and “tennis racquets” or “laptop” and “notebook” and the results are quite different …�though these days Google does lots of subtle stuff beyond keyword matching itself
    • Not just converting a sound stream to a string of words
      • Like Nuance/IBM/Dragon/Philips speech recognition
  • To recover and manipulate at least some aspects of language structure and meaning

97 of 103

Deep Learning for NLP

Achieving the goals of NLP by using representation learning and deep learning to build end-to-end systems

98 of 103

Continuous Representation of Words

  • Learning vectors in large dimensional space that represents all aspects of word meaning
  • Word embeddings

99 of 103

Word Similarities

  • Nearest word to ‘rana’:

  1. farfalla
  2. tartaruga
  3. delfino
  4. zampa
  5. scimmia

100 of 103

Entails Morphology

  • Words are made of morphemes:

Prefix stem suffix

un interest ed

  • DL:
    • Every morpheme is a vector
    • A NN combines vectors
    • Luong et al. 2013

101 of 103

Parsing for Sentence Structure

  • Dependency Parser can determine grammatical structure of sentences
  • Useful for interpretation and relation extraction
  • See demo

102 of 103

Question Answering

  • Traditional: A lot of feature engineering to capture world and other knowledge, e.g., regular expressions, Berant et al. (2014)
  • Deep Learning: embeddings represent words, a Recurrent Neural Network extracts entities and their relations and output module extracts the answer.�Madotto, Attardi. 2017. Question Dependent Recurrent Entity Network for QA

103 of 103

Neural Reasoner

  • Next frontier for AI
  • Overcome the dichotomy between
    • Perceptive Tasks (voice, images): Deep Learning
    • Inference: rule-based systems, logic
  • Single Neural Network Model for both tasks
  • Learning like children, who do not know neither grammar nor differential equations for:
    • Learning to speak interacting with adults
    • Learn a simple model of the physical world starting from perceptual experiences