1 of 52

Course Overview

Human Language Technologies

Dipartimento di Informatica

Giuseppe Attardi

Università di Pisa

2 of 52

About the Course

  • Assumes some skills…
    • basic linear algebra, probability, and statistics
    • decent programming skills
    • willingness to learn missing knowledge
  • Teaches key theory and methods for Statistical NLP
    • Useful for building practical, robust systems capable of interpreting human language
  • Experimental approach:
    • Lots of problem-based learning
    • Often practical issues are as important as theoretical niceties

3 of 52

Experimental Approach

  1. Formulate Hypothesis
  2. Implement Technique
  3. Train and Test
  4. Apply Evaluation Metric
  5. If not improved:
    1. Perform error analysis
    2. Revise Hypothesis
  6. Repeat

4 of 52

Topics

  1. Machine Learning and Neural Net Fundamentals
    1. Text Classification and ML Fundamentals
    2. Neural Network Basics and Toolkits
    3. Language Modeling and NN Training
  2. Sequence Models
    • Recurrent Networks
    • Sequence Labeling
    • Conditioned Generation
    • Attention
  3. Representation and Pre-training
    • Pre-training Methods
    • Multi-task Learning
    • Interpreting and Debugging NLP Models
  4. NLP Applications
    • Machine Reading QA
    • Dialog
    • Computational Social Science, Bias and Fairness
    • Information Extraction and Knowledge-based QA
  5. Natural Language Analysis
    • Word Segmentation and Morphology
    • Syntactic Parsing
    • Semantic Parsing
    • Discourse Structure and Analysis
  6. Advanced Learning Techniques

5 of 52

Program

  • Introduction
    • History
    • Present and Future
    • NL Processing and Human Interactions
  • Statistical Methods
    • Language Model
    • Hidden Markov Model
    • Viterbi Algorithm
    • Generative vs Discriminative Models
  • Linguistic Essentials
    • Words, Lemmas, Part of Speech and Morphology
    • Collocations
    • Word Sense Disambiguation
    • Word Embeddings

  • Preprocessing
    • Encoding
    • Regular Expressions
    • Segmentation
    • Tokenization
    • Normalization
  • Lexical semantics
    • Collocations
    • Corpora
    • Thesauri
    • Gazetteers

6 of 52

  • Distributional Semantics
    • Word embeddings
    • Character embeddings
    • Neural Language Models
  • Classification
    • Perceptron
    • SVM
    • Applications: spam, phishing, fake news detection
  • Tagging
    • Part of Speech
    • Named Entity
  • Sentence Structure
    • Constituency Parsing
    • Dependency Parsing
  • Semantic Analysis
    • Semantic Role Labeling
    • Coreference resolution
  • Machine Translation
    • Phrase-Based Statistical Models
    • Neural Network Models
    • Evaluation metrics

7 of 52

  • Deep Learning
    • MLP, CNN, RNN, LSTM
  • Libraries
    • NLTK
    • Tensorflow
    • PyTorch
  • Transformers and Attention
    • Pretrained Large Language Models
    • Fine-tuning
    • Prompt tuning
  • Opinion Mining
    • Sentiment analysis
    • Lexical resources for sentiment analysis
  • Question answering
    • Language inference
    • Dialogic interfaces (chatbots)

8 of 52

Digression

9 of 52

Data Science vs Artificial Intelligence

  • Both use Big Data
  • Different kinds and different use
  • Statistical Analysis of Big Data use raw data and extract correlations or statistical indicators
  • AI needs data annotated by humans, from which to learn behaviors to perform similar tasks in other situations
  • Human in the loop in data used for ML in AI:
    • data produced by humans and annotated by humans, enter the learning cycle of human behaviors

10 of 52

Difference between Data Science, ML and AI

  • Statistics produces correlations
  • Machine learning produces predictions
  • Artificial Intelligence produces behaviors
  • Data science produces insights

  • See David Robinson’s post
  • A model is required to make predictions
  • Example:
    • Statistics about weather are not enough to make weather forecasting
    • An atmosphere model is needed for forecasting

11 of 52

Human in the Loop

  • Data Science exploits human in the end, whose intelligence is involved for sifting among statistical correlation hints to produce insights. Intelligence lays in the humans that interpret the output of statistical analysis.
  • ML provides generalizations that can be used to make predictions
  • AI uses ML with human in the loop to extract a model of a human intelligence ability. AI embeds the human intelligence in the system itself.

Chris Manning:

When you take a product off the supermarket shelf, data is collected and stored into logs.

Analysis proceeds from such business process exhaust data.

With language human has some information to communicate and construct a message to convey meaning to other humans.

Deliberate form of expressing intent, facts, opinion, etc.

12 of 52

Text Analytics vs. Text Mining

Text Mining

Text Analytics

  • Need to analyze large amounts of text to discover useful information
  • Examples:
    • monitoring how the public discusses a product in social media
    • Anti-terrorism intelligence
    • Sentiment analysis 
  • The process of deriving high-quality information from text
  • High-quality information is typically derived through the devising of patterns and trends through means such as Machine Learning
  • aims to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

13 of 52

Role of Data

14 of 52

Unreasonable Effectiveness of Data

  • Halevy, Norvig, and Pereira argue that our goal is not necessarily to author extremely elegant theories, and instead we should embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.
  • A simpler technique on more data beats a more sophisticated technique on less data.
  • Language in the wild, just like human behavior in general, is messy.

15 of 52

Scientific Dispute: is it science?

Prof. Noam Chomsky, Linguist, MIT

There's been a lot of work on trying to apply statistical models to various linguistic problems. I think there have been some successes, but a lot of failures.

Peter Norvig, Director of research, Google

Many phenomena in science are stochastic, and the simplest model of them is a probabilistic model; I believe language is such a phenomenon and therefore that probabilistic models are our best tool for representing facts about language, for algorithmically processing language, and for understanding how humans process language.

16 of 52

Unreasonable Effectivenss of Deep Learning in AI

Terrence J. Sejnowski

Although applications of deep learning networks to real world problems have become ubiquitous, our understanding of why they are so effective is lacking. These empirical results should not be possible according to sample complexity in statistics and non-convex optimization theory. However, paradoxes in the training and effectiveness of deep learning networks are being investigated and insights are being found in the geometry of high-dimensional spaces. 

https://arxiv.org/pdf/2002.04806.pdf�

17 of 52

Why Bigger Neural Networks Do Better

  • To fit n data points with a curve, you need a function with n parameters
  • Deep Learning creates neural networks that have a number of parameters more than the number of training samples
  • Bubeck and Sellke showed that smoothly fitting high-dimensional data points requires not just n parameters, but × d parameters, where d is the dimension of the input
  • overparameterization is necessary for a network to be robust or equivalently smooth
  • S. Bubeck, M. Sellke. A Universal Law of Robustness via Isoperimetry. Neurips 2021. https://arxiv.org/abs/2105.12806

18 of 52

HLT in Industry

19 of 52

HLT in Industry

  • Advanced Search
  • Advertisement, preference analysis
  • Machine Translation
  • Speech Interfaces
  • Assistants, Chatbots:
    • Customer support
    • Controlling devices, cars
    • Home devices (Amazon, Google, Apple)
    • Sales assistants

20 of 52

Dependency Parsing

  • DeSR is online since 2007
  • Google Parsey McParseFace in 2016
  • spaCy parser
  • Stanford Stanza

21 of 52

Google

22 of 52

Apple SIRI

  • ASR (Automated Speech Recognition) integrated in mobile phone
  • Special signal processing chip for noise reduction
  • Custom GPU
  • SIRI ASR
  • Cloud service for analysis
  • Integration with applications

23 of 52

Google Voice Actions

  • Google: what is the population of Rome?
  • Google: how tall is Berlusconi
  • How old is Lady Gaga
  • Who is the CEO of Ferrari
  • Who won the Champions League
  • Send text to Gervasi “Please lend me your tablet”
  • Navigate to Palazzo Pitti in Florence
  • Call Antonio Cisternino
  • Map of Pisa
  • Note to self publish course slides
  • Listen to Dylan

24 of 52

Personal Assistants

  • Siri, Alexa, Google Home
  • Jibo, by Roberto Pieraccini, formerly from CSELT, Torino
  • They will invade our houses, behaving as mediators, salespersons, advisers, entertaining us, etc.

25 of 52

Why to study human language?

26 of 52

AI is fascinating since it is a discipline where the mind studies itself.

Luigi Stringa, director FBK

27 of 52

Challenge: to teach natural language to computers

  • Children learn to speak naturally, by interacting with others
  • Nobody teaches them grammar
  • Is it possible to let computer learn language in a similarly natural way?

28 of 52

Thirty Million Words

  • The number of words a child hears in early life will determine their academic success and IQ in later life.
  • Researchers Betty Hart and Todd Risley (1995) found that children from professional families heard thirty million more words in their first 3 years
  • http://www.youtube.com/watch?v=qLoEUEDqagQ

29 of 52

Language and Intelligence

“Understanding cannot be measured by external behavior; it is an internal metric of how the brain remembers things and uses its memories to make predictions”.

“The difference between the intelligence of humans and other mammals is that we have language”.

Jeff Hawkins, “On Intelligence”, 2004

30 of 52

Hawkins’ Memory-Prediction Framework

  • The brain uses vast amounts of memory to create a model of the world.
  • Everything you know and have learned is stored in this model.
  • The brain uses this memory-based model to make continuous predictions of future events.
  • It is the ability to make predictions about the future that is the crux of intelligence.

31 of 52

A Current Challenge for AI

  • Overcome the dichotomy between perception and reasoning
  • Perception is effective at exploiting Deep Learning
  • Reasoning has been done with symbolic approaches
  • Daniel Kahneman: Thinking, Fast and Slow
    • System 1: immediate, reactive, fast
    • System 2: logical, conscious, slow
  • Challenge: learning to reason, staying within the same embeddings space

32 of 52

Knowledge Based Approach

  • Common Sense requires an immense amount of knowledge about the world
  • Mostly subjective and intuitive: hard and complex to formalize
  • Computer can apply logical inference rules to statements in a formal language:
    • Logical inference is computationally expensive
    • Rules can be hundreds of thousands

33 of 52

Machine Learning

  • Relies on feature representation
  • Input is transformed into a vector of features
  • Designing features for a complex task requires a great deal of human time and effort
  • Feature selection or feature engineering
  • Possible solution: learning to learn

Feature

NER

Current Word

Previous Word

Next Word

Current Word Character n-gram

all

length

Current POS Tag

Surrounding POS Tag Sequence

Current Word Shape

Surrounding Word Shape Sequence

Features for finding

named entities like locations or

organization names (Finkel et al., 2010

34 of 52

Deep Learning

  • Deep learning is a kind of machine learning that relies on representation learning
  • End-to-end task learning capability
  • Representation Learning attempts to learn automatically good representations
  • Deep Learning algorithms aim at producing representations at several levels of abstraction
  • Automatically from raw data

(e.g. sound, pixels, characters, or words)

  • Rely on Machine Learning to optimize weights to best make a final prediction

35 of 52

Deep Learning for Speech

  • The first breakthrough results of “deep learning” on large datasets happened in speech recognition
  • Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition Dahl et al. (2010)

Acoustic model

and WER

RT03S

FSH

Hub5

SWB

Traditional features

27.4

23.6

Deep Learning

18.5

(−33%)

16.1

(−32%)

36 of 52

Deep Learning for Computer Vision

ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky, Sutskever, & Hinton, 2012, U. Toronto. -37% error

37 of 52

Deep Learning for NLP

  • Human language is a symbolic/discrete communication system
  • Symbols/Concepts may have different encodings:

  • Deep Learning explores a continuous encoding

sound

continuous

gesture

continuous

image

continuous

text

discrete

Afg a ghg ahj ajk al kl akl akla kla kl akl akl akl w io sd jio e op eo po eop e oppo po e[ p[p[pe p[ p[p

38 of 52

Deep Learning Success

  • Deep Learning has grown in success and popularity, largely due to:
    • more powerful computers
      • Faster CPUs and GPUs
    • larger datasets
    • better techniques to train deep networks
      • Libraries
      • Transferring models between tasks
      • Regularization and optimization methods

39 of 52

Linguistic Tasks

Easy

    • Spell Checking
    • Keyword Search
    • Finding Synonyms

Medium

    • Parsing information from websites, documents, etc.

Hard

    • Machine Translation
    • Semantic Analysis (What is the meaning of query statement?)
    • Coreference (e.g. What does "he" or "it" refer to given a document?)
    • Question Answering.

40 of 52

Linguistic Applications

  • So far self referential: from language to language
    • Classification
    • Extraction
    • Summarization
    • Translation
  • More challenging:
    • Derive conclusion
    • Perform actions
  • Challenge:
    • Find a universal learning mechanism capable of learning from observations and interactions, adequate to make predictions
    • Yann LeCunn. How could machines learn as efficiently as humans?

41 of 52

NLP is Difficult

42 of 52

NLP is hard

  • Natural language is:
    • highly ambiguous at all levels
    • complex and subtle use of context to convey meaning
    • fuzzy?, probabilistic
    • involves reasoning about the world
    • a key part of people interacting with other people (a social system):
      • persuading, insulting and amusing them
  • But NLP can also be surprisingly easy sometimes:
    • rough text features can often do half the job
    • Information Retrieval and similar superficial techniques have been in widespread use

43 of 52

Natural Language Understanding is difficult

  • The hidden structure of language is highly ambiguous
  • Structures for: Fed raises interest rates 0.5% in effort to control inflation

Slide by C. Manning

44 of 52

Where are the ambiguities?

Slide by C. Manning

45 of 52

Newspaper Headlines

  • Minister Accused Of Having 8 Wives In Jail
  • Pope’s baby steps on gays
  • Juvenile Court to Try Shooting Defendant
  • Teacher Strikes Idle Kids
  • China to Orbit Human on Oct. 15
  • Local High School Dropouts Cut in Half
  • Red Tape Holds Up New Bridges
  • Clinton Wins on Budget, but More Lies Ahead
  • Hospitals Are Sued by 7 Foot Doctors
  • Police: Crack Found in Man's Buttocks

46 of 52

Coreference Resolution

U: Where is The Green Hornet playing in Mountain View?

S: The Green Hornet is playing at the Century 16 theater.

U: When is it playing there?

S: It’s playing at 2pm, 5pm, and 8pm.

U: I’d like 1 adult and 2 children for the first show.�How much would that cost?

  • Knowledge sources:
    • Domain knowledge
    • Discourse knowledge
    • World knowledge

47 of 52

Hidden Structure of Language

  • Going beneath the surface…
    • Not just string processing
    • Not just keyword matching in a search engine
      • Search Google on “tennis racquet” and “tennis racquets” or “laptop” and “notebook” and the results are quite different …�though these days Google does lots of subtle stuff beyond keyword matching itself
    • Not just converting a sound stream to a string of words
      • Like Nuance/IBM/Dragon/Philips speech recognition
  • To recover and manipulate at least some aspects of language structure and meaning

48 of 52

Deep Learning for NLP

Achieving the goals of NLP by using representation learning and deep learning to build end-to-end systems

49 of 52

Entails Morphology

  • Words are made of morphemes:

Prefix stem suffix

un interest ed

  • DL:
    • Every morpheme is a vector
    • A NN combines vectors
    • Luong et al. 2013

50 of 52

Parsing for Sentence Structure

  • Dependency Parser can determine grammatical structure of sentences
  • Useful for interpretation and relation extraction
  • See demo

51 of 52

Question Answering

  • Traditional: A lot of feature engineering to capture world and other knowledge, e.g., regular expressions, Berant et al. (2014)
  • Deep Learning: embeddings represent words, a Recurrent Neural Network extracts entities and their relations and output module extracts the answer.�Madotto, Attardi. 2017. Question Dependent Recurrent Entity Network for QA

52 of 52

Neural Reasoner

  • Next frontier for AI
  • Overcome the dichotomy between
    • Perceptive Tasks (voice, images): Deep Learning
    • Inference: rule-based systems, logic
  • Single Neural Network Model for both tasks
  • Learning like children, who do not know neither grammar nor differential equations for:
    • Learning to speak interacting with adults
    • Learn a simple model of the physical world starting from perceptual experiences