1 of 52

Natural Language Processing

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

Natural Language processing is concerned with development of computational models of aspects of human language processing.

Main reasons for NLP:

  1. To develop automated tools for language processing
  2. To gain a better understanding of human communication

2 of 52

Building computational models with human language processing abilities requires

  • Knowledge of how humans acquire store and process language.
  • Knowledge of the world and of language.

Two major approaches to NLP

  • Rationalist Approach: A significant part of the knowledge in the human mind is not� derived by the senses but is fixed in advance, presumably by � genetic inheritance
  • Empiricist Approach: The brain is able to perform association, pattern recognition, and � generalization and, thus, the structures of Natural Language can� be learned.

Linguistics is the scientific study of language. It deals with analysis of every aspect of language, as well as the methods for studying and modelling them.

3 of 52

Origins of NLP

Theoretical linguists identify rules that describe and restrict the structure of Languages(Grammar).

Theoretical Linguistics mainly provide structural description of natural language and its semantics.

Psycholinguistics explain how humans produce and comprehend natural language.

They are interested in representation of linguistic structures as well as in the process by which these structures are produced.

Computational linguistics are concerned with the study of language using computational models of linguistic phenomena.

It deals with the application of linguistic theories and computational techniques for NLP.

4 of 52

Computational models may be broadly classified under

  • Knowledge driven
  • Data driven

Knowledge driven: rely on explicitly coded linguistic knowledge, often expressed as a set of handcrafted grammar rules.

Data driven: presume the existence of large amount of data and usually employ some machine learning technique to learn syntactic patterns. The amount of human effort is less and the performance of these systems is dependent on the quantity of the data.

5 of 52

People use seven interdependent levels to understand and extract meaning from a text or spoken words. In order to understand natural languages, it’s important to distinguish among them:

1- Phonetic or phonological level: deals with pronunciation

2- Morphological level: deals with the smallest parts of words that carry meaning, and suffixes and prefixes.

3- Lexical level: deals with lexical meaning of a word.

4- Syntactic level: deals with grammar and structure of sentences.

5- Semantic level: deals with the meaning of words and sentences.

6- Discourse level: deals with the structure of different kinds of text.

7- Pragmatic level: deals with the knowledge that comes from the outside world, i.e., from outside the content of the document.

6 of 52

  1. Morphological Analysis:�While performing the morphological analysis, each particular word is analyzed. Non-word tokens such as punctuation are removed from the words. Hence the remaining words are assigned categories.

For instance, Ram’s iPhone cannot convert the video from .mkv to .mp4. In Morphological analysis, word by word the sentence is analyzed.�So here, Ram is a proper noun, Ram’s is assigned as possessive suffix and .mkv and .mp4 is assigned as a file extension.�As shown above, the sentence is analyzed word by word. Each word is assigned a syntactic category. The file extensions are also identified present in the sentence which is behaving as an adjective in the above example. In the above example, the possessive suffix is also identified. This is a very important step as the judgement of prefixes and suffixes will depend on a syntactic category for the word. For example, swims and swim’s are different. One makes it plural, while the other makes it a third-person singular verb. If the prefix or suffix is incorrectly interpreted then the meaning and understanding of the sentence are completely changed. The interpretation assigns a category to the word. Hence, discard the uncertainty from the word.

7 of 52

2. Syntactic Analysis:

There are different rules for different languages. Violation of these rules will give a syntax error. Here the sentence is transformed into the structure that represents a correlation between the words. This correlation might violate the rules occasionally. The syntax represents the set of rules that the official language will have to follow. For example, “To the movies, we are going.” Will give a syntax error. The syntactic analysis uses the results given by morphological analysis to develop the description of the sentence. The sentence which is divided into categories given by the morphological process is aligned into a defined structure. This process is called parsing. For example, the cat chases the mouse in the garden, would be represented as:

Here the sentence is broken down according to the categories. Then it is described in a hierarchical structure with nodes as sentence units. These parse trees are parsed while the syntax analysis run and if any error arises the processing stops and it displays syntax error. The parsing can be top-down or bottom-up.

    • Top-down: Starts with the first symbol and parse the sentence according to the grammar rules until each of the terminals in the sentence is parsed.
    • Bottom-up: Starts with the sentence which is to be parsed and apply all the rules backwards till the first symbol is reached.

8 of 52

  • Semantic Analysis:
  • The semantic analysis looks after the meaning. It allocates the meaning to all the structures built by the syntactic analyzer. Then every syntactic structure and the objects are mapped together into the task domain. If mapping is possible the structure is sent, if not then it is rejected. For example, “hot ice-cream” will give a semantic error. During semantic analysis two main operations are executed:
    • First, each separate word will be mapped with appropriate objects in the database. The dictionary meaning of every word will be found. A word might have more than one meaning.
    • Secondly, all the meanings of each different word will be integrated to find a proper correlation between the word structures. This process of determining the correct meaning is called lexical disambiguation. It is done by associating each word with the context.
  • This process defined above can be used to determine the partial meaning of a sentence. However semantic and syntax are two completely contrasting concepts. It might be possible that a syntactically correct sentence is semantically incorrect.
  • For example, “A rock smelled the colour nine.” It is syntactically correct as it obeys all the rules of English, but is semantically incorrect. The semantic analysis verifies that a sentence is abiding by the rules and creates correct information

9 of 52

10 of 52

Disclosure Integration:

While processing a language there can arise one major ambiguity known as referential ambiguity. Referential ambiguity is the ambiguity that can arise when a reference to a word cannot be determined. For example,

Ram won the race.

Mohan ate half of a pizza.

He liked it.

In the above example, “He” can be Ram or Mohan. This creates an ambiguity. The word “He” shows dependency on both sentences. This is known as disclosure integration. It means when an individual sentence relies upon the sentence that comes before it. Like in the above example the third sentence relies upon the sentence before it. Hence the goal of this model is to remove referential ambiguity.

It requires the knowledge of the world.

11 of 52

12 of 52

Challenges of NLP

Factors that make NLP difficult:

Problems of representation and interpretation:

Natural Language is highly ambiguous and vague,so it is quite difficult to embody all sources of knowledge that human uses to process language.

Identifying the semantics of language.

Words alone do not make a sentence. Instead, it is the words as well as their syntactic and semantic relation that gives meaning to a sentence.

Alas! They won.

New words are added continually and existing words ae introduced in new context. example

Tv channels use 9/11 t refer to the terrorist act on the world trade centre.

The only way a machine can learn the meaning of a specific word in a message is by considering its context, unless some explicitly coded general world or domain knowledge is available.the context of a word is defined by occurring words.

13 of 52

Idioms, metaphor and ellipses add more complexity to identify the meaning of the written text.

Idioms: a group of words established by usage as having a meaning not deducible from those of the individual words.

Example Idiom: Its a piece of cake(meaning its easy)

Metaphor:A metaphor is a figure of speech that describes an object or action in a way that isn't literally true, but helps explain an idea or make a comparison.

example:Laughter is the music of the soul.

Ellipses: Use an ellipsis to show an omission, or leaving out, of a word or words in a quote. Use ellipses to shorten the quote without changing the meaning.

14 of 52

For example:

  • "After school I went to her house, which was a few blocks away, and then came home."

Shorten the quote by replacing a few words with an ellipsis. Remember, the meaning of the quote should not change.

  • "After school I went to her house … and then came home."

We removed the words "which was a few blocks away" and replaced them with an ellipsis without changing the meaning of the original quote.

Quantifier scoping is another problem. Scope of quantifiers is often not clear and poses problem in automatic processing.

Example:

There are many things to do today.

We have a lot of time left, don’t worry.

15 of 52

Ambiguity of natural language is another difficulty:

As humans , we are aware of the context and current cultural knowledge, and also of the language and traditions and utilize these to process the meaning.however incorporating contextual and world knowledge poses the greatest difficulty in language computing.

There are various sources of ambiguities in natural language

Ambiguity at word level(Lexical Ambiguity)

A word can be ambiguous, word may represent a noun or a verb

Example: can,bunk, cat etc.

Sentence Level Ambiguity(structural Ambiguity)

Example:Stolen rifle found by the tree

Number of grammars have been proposed to describe the structure of the sentences. However there are an infinite number of ways to generate them. Which makes writing grammar rules and grammar itself, extremely complex.

16 of 52

17 of 52

18 of 52

19 of 52

20 of 52

21 of 52

22 of 52

23 of 52

24 of 52

25 of 52

26 of 52

27 of 52

28 of 52

29 of 52

30 of 52

31 of 52

32 of 52

Language and Grammar

Automatic Processing of Language requires the rules and exceptions of a language to be explained to the computer.

  • Grammar defines the language
  • It consists of a set of rules that allows us to parse and generate sentences in a Language. These rules relate information to coding devices at the language level and not at the world knowledge level.

Main hurdle :

Constantly changing nature of languages and the presence of large number of language exceptions.

Effort to provide specifications for the language has led to many grammars.

  • Phrase Structure Grammar
  • Transformational Grammar
  • Lexical Functional Grammar
  • Generalized phrase Structure Grammar
  • Dependency Grammar
  • Paninian Grammar
  • Tree-adjoining Grammar

33 of 52

Though many grammars were proposed but Transformational Grammar was identified as the better,

  • Noam Chomsky proposed the Transformational Grammar and suggested that each sentence in a language has two levels of representation, namely a deep structure and surface structure.
  • Mapping of deep structure to surface structure is carried out by transformations.
  • Deep structure can be transformed in a number of ways to yield many different surface level representations.
  • Sentences with different surface level representations having the same meaning, share a common deep-level representation.

Transformational meaning which changes the structure but not the meaning , It is also called Transformational Generative Grammar.

34 of 52

English is SVO Language.

Transformation grammar has three components

  • Phrase structure grammar
  • Transformational rules
  • Morphophonemic rules-These rules match each sentence representation to a string of phonemes

Each of these components consists of set of rules.

Phrase structure grammar consists of set of rules that generate natural language sentences and assign a structural description to them.

  • Sentences that can be generated using these rules are termed grammatical.

35 of 52

Transformational rules are applied on the terminal string generated by phrase structure rules.

  • It can be used to transform one phrase maker into another phrase marker.
  • These rules are used to transform one surface representation into another(an active sentence to passive one).
  • The rule relating active and passive sentences (as given by chomsky)

(s1) NP1-Aux-V-NP2-->NP2-Aux+be+en-V-by+NP1(s2).

This rules says that if the input has s1 structure it can be transformed to s2.

Transformational rules can be obligatory or optional.

  • Obligatory rules: ensures agreement in number of subject and verb etc.,
  • Optional rules: it modifies the structure of the sentence while preserving its meaning

36 of 52

37 of 52

Morphophonemic rules: match each sentence representation to a string of phonemes.

Phoneme, in linguistics, smallest unit of speech distinguishing one word (or word element) from another, as the element p in “tap,” which separates that word from “tab,” “tag,” and “tan.”

Consider the sentence:

The police will catch the snatcher

Structure obtained by applying phrase structure rule

(s1) NP1-Aux-V-NP2-->NP2-Aux+be+en-V-by+NP1(s2).

38 of 52

  • Application of phrase structure rules will assign the structure.
  • Passive transformation rules will convert the sentence into
  • The + culprit+will+be+en+catch+by+police
  • Another transformational rule will then reorder ‘en+catch’ to ‘catch+en’ and subsequently one of the morphophonemic rules will convert ‘catch+en’ to ‘caught’.

39 of 52

Processing Indian Languages

  • Unlike English,Indic scripts have a non linear structure.
  • Unlike English, Indian Languages have SOV as default sentence structure.
  • Indian Languages have a free word order, i.e, words can be moved freely within a sentence without changing the meaning of the sentence.

मैं फल खाता हूँ। (main phaL khaaTaa huun.)

(S + O + V)

मैं खाता हूँ फल(main khaaTaa huun phaL.)

(S +V+O)

  • Spelling standardardization is more subtle in Hindi than in English.

(standardization rules for spelling)

  • Indian Languages have a relatively rich set of morphological variants(morpheme: minimum meaningful unit example policy:start, starts,starting,started etc..

40 of 52

  • Indian Languages make extensive and productive use of complex predicates

The complex predicates are combination of two lexical items.

The first and second lexical items of the complex predicates are called polar and vector respectively. The way how these two lexical items come together and forms a CP is quite interesting to examine. Consider a CP ¤ాఠ ¨ాడు [paaTa maaDu] ‘teach lesson’. In the example, the first constituent ¤ాఠ [paaTa] ‘lesson’ is a polar and the second constituent ¨ాడు [maaDu] ‘do’ is a vector

  • Indian Languages use post-position case markers instead of prepositions(example: "on the table")
  • Indian Languages use verb complexes consisting of sequences of verbs(gaa raha hai, rahi hai).

41 of 52

NLP Applications

First application of NLP: Machine Translation, recent progress is information retrieval, information extraction, text summarization etc.

Machine Translation: Translation from one human language to another , demands the knowledge of words, phrases, grammars of two languages involved, world knowledge.

Speech Recognition: Process of mapping acoustic speech signals to a set of words.

Difficulty: wide variation in the pronunciation of words, homonym(dear and deer) and acoustic ambiguities.(ex: in the rest and interest).

Speech Synthesis: automatic production of speech. Such systems can read out mals o the telephone, or even read out a storybook for you.

Natural Language interfaces to Databases: allows querying structured database using Natural language sentences.

42 of 52

Information Extraction:

It captures and outputs factual Information contained in a document.

It extracts structured information from unstructured and/or semi-structured machine-readable documents.

In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP).

Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video could be seen as information extraction.

Information Retrieval: The IR system assists the users in finding the information they require but it does not explicitly return the answers to the question. It notifies regarding the existence and location of documents that might consist of the required information. Concerned with identifying the documents relevant to users query.

Example: google search

43 of 52

Question Answering : given a question and a set of documents, Question Answering system attempts to find the precise answer or atleast the precise portion of text in which the answer appears. Unlike Information extraction system, question answering system benefits from having an information extraction system to identify entities in the text.

Text Summarization: deals with creation of summaries of documents and involves syntactic, semantic and discourse level processing of text.

Some Successful Early NLP Systems

ELIZA: is an early natural language processing computer program created from 1964 to 1966 at the MIT Artificial Intelligence Laboratory by Joseph Weizenbaum.

SysTran(System Translation): First Machine Translation Tool developed in 1969 for Russian-English translation. SysTran provided the first online machine translation service called Babel Fish used by Alta Vista for handling translation requests for users.

TAUM METO: Natural Language generation system used in Canada to generate weather reports. It accepts daily weather data and generates weather reports in English and French.

44 of 52

SHRDLU(Winogard 1972): Natural language Understanding system that simulates actions of a robot in a block world domain. It uses syntactic parsing and semantic reasoning to understand instructions. User can ask the robot to manipulate the blocks, to tell the blocks configurations, and to explain its reasoning.

LUNAR(Woods 1977): Question answering system that answered questions about moon roc

45 of 52

Information Retrieval

  • Information refers to the data, and we are concerned with the text only. So, we consider words as the carrier of information and written text as message encoded in natural language.
  • Retrieval refers to the process of accessing information from memory, it also requires information to be processed and stored. Only relevant information expressed in the form of query is located.
  • Information retrieval is deals with organization , storage ,retrieval and evaluation of information relevant to the query.

Information retrieval deals with unstructured data.It is performed based on the content of the document rather than its structure.

Approaches for accessing large text collections can be broadly classified into 2 categories.

  1. Approaches that construct Topic hierarchy
  2. Approaches that rank the documents according to the relevance.

46 of 52

Issues involved in the design and evaluation of IR Systems

  1. Representation of the document: most human knowledge is coded in natural language which is difficult to use as knowledge representations.
  2. Most of the Retrieval systems are based on keyword representation, problem associated

Polysemy: lexeme with multiple meaning

  1. Polysemy is the coexistence of many possible meanings for a word or phrase.

Example: He fixed his hair.

They fixed a date for the wedding.

  1. Homonymy is the existence of two or more words having the same spelling or pronunciation but different meanings and origins.Ambiguity makes it difficult of a computer to automatically determine the conceptual content of documents.

Homonymy: ambiguity in which the words that appear the same have unrelated meanings ex: kneed,need, whole ,hole

Right vs Write

47 of 52

C. Synonymy : creates a problem when a document is indexed with one term and the query contains a � different term, and the two terms share a common meaning.

D. It ignores semantics and contextual information in the retrieval process.

E. Inappropriate characterization of queries by user: reason can be lack of knowledge of the subject or even

the inherent vagueness of the natural language.User may fail to include relevant terms in the query or� may include irrelevant terms.

F. Matching query representation with that of the document is another issue: selection of appropriate � similarity measure is a crucial issue in the design of IR system.

E. Evaluating the performance of IR systems is also a major issue. Recall and precision are the most widely

used measures of effectiveness.Recall and precision are the most widely used measures of effectiveness

F. Goal of IR is to search a document in a manner relevant to the query, understanding what constitutes� relevance ia an important issue.

G. Size of document collections and the varying needs of users also complicate text retrieval.some users � require answers of limited scope, while others require documents with a wider scope.

48 of 52

Why NLP?

To design, implement and test systems that can process natural language for practical applications.

Practical Applications:

  • Sentiment Analysis
  • Query Completion/Auto correction
  • Word Prediction
  • Information Retrieval
  • Text Summarization
  • Spam Detection

49 of 52

Difficulties that we face while designing Algorithms for NLP

  1. Lexical Ambiguity:(in a language the same word can provide different meaning which is called lexical Ambiguity)

Example: Rose rose to get a twig.

  1. Structural Ambiguity:

Example: The man saw the boy with the binoculars

Flying planes can be dangerous

Ambiguities:

Hospitals are sued by 7 foot doctors.

Stolen painting found by tree.

Teacher strikes idle kids.

50 of 52

A "morpheme" is a short segment of language that meets three basic criteria:

1. It is a word or a part of a word that has meaning.

2. It cannot be divided into smaller meaningful segments without changing its meaning or leaving a meaningless remainder.

3. It has relatively the same stable meaning in different verbal environments.

51 of 52

52 of 52