1 of 91

Natural Language Processing

UNIT – I: Introduction

Knowledge in Speech and Language Processing
Ambiguity
Models and Algorithms
Language, Thought and Understanding
History Regular Expressions
Words
Corpora
Text Normalization
Minimum Edit Distance

2 of 91

INTRODUCTION

Natural Language Processing(NLP) is defined as the branch of Artificial Intelligence that provides computers with the capability of understanding text and spoken words in the same way a human being can.
It incorporates machine learning models, statistics, and deep learning models into computational linguistics i.e. rule-based modeling of human language to allow computers to understand text, spoken words and understands human language, intent, and sentiment.
Humans communicate with each other using words and text. The way that humans convey information to each other is called Natural Language. Every day humans share a large quality of information with each other in various languages as speech or text.

3 of 91

However, computers cannot interpret this data, which is in natural language, as they communicate in 1s and 0s. The data produced is precious and can offer valuable insights. Hence, you need computers to be able to understand, emulate and respond intelligently to human speech.
Natural Language Processing or NLP refers to the branch of Artificial Intelligence that gives the machines the ability to read, understand and derive meaning from human languages.
NLP combines the field of linguistics and computer science to decipher language structure and guidelines and to make models which can comprehend, break down and separate significant details from text and speech.

4 of 91

Phases of NLP

5 of 91

Lexical Analysis

The first phase is lexical analysis/morphological processing. In this phase, the sentences, paragraphs are broken into tokens.

• These tokens are the smallest unit of text. It scans the entire source text and divides it into meaningful lexemes.

• For example, The sentence “He goes to college.” is

divided into [ ‘He’ , ‘goes’ , ‘to’ , ‘college’, ‘.’] .

• There are five tokens in the sentence. A paragraph may also be divided into sentences.

7 of 91

Syntactic Analysis/Parsing

The second phase is Syntactic analysis. In this phase, the sentence is checked whether it is well formed or not.

• The word arrangement is studied and a syntactic relationship is found between them. It is checked for word arrangements and grammar.

• For example, the sentence “Delhi goes to him” is rejected by the syntactic parser.

8 of 91

Semantic Analysis

The third phase is Semantic Analysis. In this phase, the sentence is checked for the literal meaning of each word and their arrangement together.

• For example, The sentence “I ate hot ice cream” will get rejected by the semantic analyzer because it doesn’t make sense.

• E.g.. “colorless green idea.” This would be rejected by the Symantec analysis as colorless Here; green doesn’t make any sense.

9 of 91

Discourse Integration

The fourth phase is discourse integration.
In this phase, the impact of the sentences before a particular sentence and the effect of the current sentence on the upcoming sentences is determined.
For example, the word “that” in the sentence “He wanted that” depends upon the prior discourse context.

Pragmatic Analysis

The last phase of natural language processing is Pragmatic analysis.
Sometimes the discourse integration phase and pragmatic analysis phase are combined.
The actual effect of the text is discovered by applying the set of rules that characterize cooperative dialogues.
E.g., “close the window?” should be interpreted as a request instead of an order.

10 of 91

NLP Implementation

Below, given are popular methods used for Natural Learning Process:

Machine learning:

The learning nlp procedures used during machine learning.
It automatically focuses on the most common cases.
So when we write rules by hand, it is often not correct at all concerned about human errors.
Statistical inference: NLP can make use of statistical inference algorithms.
It helps you to produce models that are robust. e.g., containing words or structures which are known to everyone.

11 of 91

How to Perform NLP?

– Segmentation

– Tokenizing

– Removing Stop Words

– Stemming

– Lemmatization

– Part of Speech Tagging

– Named Entity Tagging

12 of 91

Segmentation

You first need to break the entire document down into its constituent sentences.
You can do this by segmenting the article along with its punctuation like full stops and commas.

13 of 91

Tokenizing

For the algorithm to understand these sentences, you need to get the words in a sentence and explain them individually to our algorithm.
So, you break down your sentence into its constituent words and store them. This is called tokenizing, and each world is called a token

14 of 91

Removing Stop Words

You can make the learning process faster by getting rid of non-essential words, which add little meaning to our statement and are just there to make our statement sound more cohesive.
Words such as was, in, is, and, the, are called stop words and can be removed.

15 of 91

Stemming

It is the process of obtaining the Word Stem of a word. Word Stem gives new words upon adding affixes to them.

16 of 91

Lemmatization

The process of obtaining the Root Stem of a word.
Root Stem gives the new base form of a word that is present in the dictionary and from which the word is derived.
You can also identify the base words for different words based on the tense, mood, gender, etc.

17 of 91

Part of Speech Tagging

Now, you must explain the concept of nouns, verbs, articles, and other parts of speech to the machine by adding these tags to our words. This is called ‘part of’.

18 of 91

Named Entity Tagging

Next, introduce your machine to pop culture references and everyday names by flagging names of movies, important personalities or locations, etc that may occur in the document.
You do this by classifying the words into subcategories. This helps you find any keywords in a sentence. The subcategories are person, location, monetary value, quantity, organization, movie.
After performing the preprocessing steps, you then give your resultant data to a machine learning algorithm like Naive Bayes, etc., to create your NLP application.

19 of 91

Applications of NLP

20 of 91

NLP is one of the ways that people have humanized machines and reduced the need for labor. It has led to the automation of speech-related tasks and human interaction.
Some applications of NLP include :–

Translation Tools: Tools such as Google Translate, Amazon Translate, etc. translate sentences from one language to another using NLP.
Chatbots: Chatbots can be found on most websites and are a way for companies to deal with common queries quickly.

21 of 91

Virtual Assistants: Virtual Assistants like Siri, Cortana, Google Home, Alexa, etc can not only talk to you but understand commands given to them.
Targeted Advertising: Have you ever talked about a product or service or just googled something and then started seeing ads for it? This is called targeted advertising, and it helps generate tons of revenue for sellers as they can reach niche audiences at the right time.
Autocorrect: Autocorrect will automatically correct any spelling mistakes you make, apart from this grammar checkers also come into the picture which helps you write f lawlessly.
Information retrieval & Web Search: Google, Yahoo, Bing, and other search engines base their machine translation technology on NLP deep learning models. It allows algorithms to read text on a webpage, interpret its meaning and translate it to another language.
Grammar Correction: NLP technique is widely used by word processor software like MS-word for spelling correction & grammar check.

22 of 91

Question Answering– Type in keywords to ask Questions in Natural Language.
Text Summarization– The process of summarising important information from a source to produce a shortened version.
Machine Translation– Use of computer applications to translate text or speech from one natural language to another.
Future computers or machines with the help of NLP will able to learn from the information online and apply that in the real world, however, lots of work need to on this regard.
Natural language toolkit or nltk become more effective
Combined with natural language generation, computers will become more capable of receiving and giving useful and resourceful information or data.

23 of 91

1.1 Knowledge in Speech and Language Processing

By speech and language processing, we have in mind those computational techniques that process spoken and written human language, as language.
What distinguishes language processing applications from other data processing systems is their use of knowledge of language.
Consider the Unix wc program

When used to count bytes and lines, wc is an ordinary data processing application.
However, when it is used to count the words in a file, it requires knowledge about what it means to be a word and thus becomes a language processing system.

24 of 91

wc is an extremely simple system with an extremely limited and impoverished knowledge of language.
Sophisticated conversational agents like HAL, machine translation systems, or robust question-answering systems require much broader and deeper knowledge of language.
HAL must be able to recognize words from an audio signal and to generate an audio signal from a sequence of words. These tasks of speech recognition and speech synthesis require knowledge about phonetics and phonology:

how words are pronounced in terms of sequences of sounds and how each of these sounds is realized acoustically.

Producing and recognizing these and other variations of individual words (e.g., recognizing that doors is plural) requires knowledge about morphology, the way words break down into component parts that carry meanings like singular versus plural.

25 of 91

Syntax: the knowledge needed to order and group words together

HAL, the pod bay door is open.

HAL, is the pod bay door open?

I’m I do, sorry that afraid Dave I’m can’t

(Dave , I’m sorry I’m afraid I can’t do that.)

30 of 91

1.2 Ambiguity

A perhaps surprising fact about these categories of linguistic knowledge is that most tasks in speech and language processing can be viewed as resolving ambiguity at one of these levels.
We say some input is ambiguous if multiple, alternative linguistic structures can be built for it.

31 of 91

Consider the spoken sentence I made her duck. Here are five different meanings this sentence could have , each of which exemplifies an ambiguity at some level:

I cooked waterfowl for her.
I cooked waterfowl belonging to her.
I created the (plaster?) duck she owns.
I caused her to quickly lower her head or body.
I waved my magic wand and turned her into undifferentiated waterfowl.

36 of 91

1.3 Models and Algorithms

Closely related to these models are their declarative counterparts: formal rule systems. Among the more important ones we consider regular grammars and regular relations, context-free grammars, and feature-augmented grammars.
State machines and formal rule systems are the main tools used when dealing with knowledge of phonology, morphology, and syntax.
The algorithms associated with both state –machine and formal rule systems typically involve a search through a space of state representing hypotheses about an input.

44 of 91

Regular Exressions

Regular expressions are case sensitive.
Can be viewed as a way to specify:

Search patterns over text string
Design of a particular kind of machine called a finite state automation (FSA)
These are really equivalent.

45 of 91

Use of Regular Expression in NLP

As grep,pearl: simple but powerful tools for large corpus analysys and ‘shallow’ processing

What word is most likely to begin a sentence?
What word is most likely to begin a question?
In your own email, are u more or less polite than the people you correspond with?

Regular expressions define regular languages or sets.

46 of 91

Regular Expression: Formula in algebraic notation for specifying a set of strings
String: Any sequence of alphanumeric characters

Letters, numbers, spaces, tabs, punctuation marks

Regular Expression Search

Pattern: specifying the set of strings we want to search for
Corpus: the texts we want to search through

56 of 91

Substitutions

An Important use of regular expression is Substitution.
For example, substitution operator used in unix command like S/color/color/
It is often useful to be able to refer to a particular subpart of the string matched with the first pattern.
For example, 35 and we need to angular brackets. The syntax format will be S/(35)/<1> /
The use of paranthesis to store a pattern memory is called CAPTURE GROUP.
Substitutions and capture groups are very useful in implementing simple chat bots like ELIZA.

1 of 91

2 of 91

3 of 91

4 of 91

5 of 91

6 of 91

7 of 91

8 of 91

9 of 91

10 of 91

11 of 91

12 of 91

13 of 91

14 of 91

15 of 91

16 of 91

17 of 91

18 of 91

19 of 91

20 of 91

21 of 91

22 of 91

23 of 91

24 of 91

25 of 91

26 of 91

27 of 91

28 of 91

29 of 91

30 of 91

31 of 91

32 of 91

33 of 91

34 of 91

35 of 91

36 of 91

37 of 91

38 of 91

39 of 91

40 of 91

41 of 91

42 of 91

43 of 91

44 of 91

45 of 91

46 of 91

47 of 91

48 of 91

49 of 91

50 of 91

51 of 91

52 of 91

53 of 91

54 of 91

55 of 91

56 of 91

57 of 91

58 of 91

59 of 91

60 of 91

61 of 91

62 of 91

63 of 91

64 of 91

65 of 91

66 of 91

67 of 91

68 of 91

69 of 91

70 of 91

71 of 91

72 of 91

73 of 91

74 of 91

75 of 91

76 of 91

77 of 91

78 of 91

79 of 91

80 of 91