Natural Language Processing
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
Natural Language processing is concerned with development of computational models of aspects of human language processing.
Main reasons for NLP:
Building computational models with human language processing abilities requires
Two major approaches to NLP
Linguistics is the scientific study of language. It deals with analysis of every aspect of language, as well as the methods for studying and modelling them.
Origins of NLP
Theoretical linguists identify rules that describe and restrict the structure of Languages(Grammar).
Theoretical Linguistics mainly provide structural description of natural language and its semantics.
Psycholinguistics explain how humans produce and comprehend natural language.
They are interested in representation of linguistic structures as well as in the process by which these structures are produced.
Computational linguistics are concerned with the study of language using computational models of linguistic phenomena.
It deals with the application of linguistic theories and computational techniques for NLP.
Computational models may be broadly classified under
Knowledge driven: rely on explicitly coded linguistic knowledge, often expressed as a set of handcrafted grammar rules.
Data driven: presume the existence of large amount of data and usually employ some machine learning technique to learn syntactic patterns. The amount of human effort is less and the performance of these systems is dependent on the quantity of the data.
People use seven interdependent levels to understand and extract meaning from a text or spoken words. In order to understand natural languages, it’s important to distinguish among them:
1- Phonetic or phonological level: deals with pronunciation
2- Morphological level: deals with the smallest parts of words that carry meaning, and suffixes and prefixes.
3- Lexical level: deals with lexical meaning of a word.
4- Syntactic level: deals with grammar and structure of sentences.
5- Semantic level: deals with the meaning of words and sentences.
6- Discourse level: deals with the structure of different kinds of text.
7- Pragmatic level: deals with the knowledge that comes from the outside world, i.e., from outside the content of the document.
For instance, Ram’s iPhone cannot convert the video from .mkv to .mp4. In Morphological analysis, word by word the sentence is analyzed.�So here, Ram is a proper noun, Ram’s is assigned as possessive suffix and .mkv and .mp4 is assigned as a file extension.�As shown above, the sentence is analyzed word by word. Each word is assigned a syntactic category. The file extensions are also identified present in the sentence which is behaving as an adjective in the above example. In the above example, the possessive suffix is also identified. This is a very important step as the judgement of prefixes and suffixes will depend on a syntactic category for the word. For example, swims and swim’s are different. One makes it plural, while the other makes it a third-person singular verb. If the prefix or suffix is incorrectly interpreted then the meaning and understanding of the sentence are completely changed. The interpretation assigns a category to the word. Hence, discard the uncertainty from the word.
2. Syntactic Analysis:
There are different rules for different languages. Violation of these rules will give a syntax error. Here the sentence is transformed into the structure that represents a correlation between the words. This correlation might violate the rules occasionally. The syntax represents the set of rules that the official language will have to follow. For example, “To the movies, we are going.” Will give a syntax error. The syntactic analysis uses the results given by morphological analysis to develop the description of the sentence. The sentence which is divided into categories given by the morphological process is aligned into a defined structure. This process is called parsing. For example, the cat chases the mouse in the garden, would be represented as:
Here the sentence is broken down according to the categories. Then it is described in a hierarchical structure with nodes as sentence units. These parse trees are parsed while the syntax analysis run and if any error arises the processing stops and it displays syntax error. The parsing can be top-down or bottom-up.
Disclosure Integration:
While processing a language there can arise one major ambiguity known as referential ambiguity. Referential ambiguity is the ambiguity that can arise when a reference to a word cannot be determined. For example,
Ram won the race.
Mohan ate half of a pizza.
He liked it.
In the above example, “He” can be Ram or Mohan. This creates an ambiguity. The word “He” shows dependency on both sentences. This is known as disclosure integration. It means when an individual sentence relies upon the sentence that comes before it. Like in the above example the third sentence relies upon the sentence before it. Hence the goal of this model is to remove referential ambiguity.
It requires the knowledge of the world.
Challenges of NLP
Factors that make NLP difficult:
Problems of representation and interpretation:
Natural Language is highly ambiguous and vague,so it is quite difficult to embody all sources of knowledge that human uses to process language.
Identifying the semantics of language.
Words alone do not make a sentence. Instead, it is the words as well as their syntactic and semantic relation that gives meaning to a sentence.
Alas! They won.
New words are added continually and existing words ae introduced in new context. example
Tv channels use 9/11 t refer to the terrorist act on the world trade centre.
The only way a machine can learn the meaning of a specific word in a message is by considering its context, unless some explicitly coded general world or domain knowledge is available.the context of a word is defined by occurring words.
Idioms, metaphor and ellipses add more complexity to identify the meaning of the written text.
Idioms: a group of words established by usage as having a meaning not deducible from those of the individual words.
Example Idiom: Its a piece of cake(meaning its easy)
Metaphor:A metaphor is a figure of speech that describes an object or action in a way that isn't literally true, but helps explain an idea or make a comparison.
example:Laughter is the music of the soul.
Ellipses: Use an ellipsis to show an omission, or leaving out, of a word or words in a quote. Use ellipses to shorten the quote without changing the meaning.
For example:
Shorten the quote by replacing a few words with an ellipsis. Remember, the meaning of the quote should not change.
We removed the words "which was a few blocks away" and replaced them with an ellipsis without changing the meaning of the original quote.
Quantifier scoping is another problem. Scope of quantifiers is often not clear and poses problem in automatic processing.
Example:
There are many things to do today.
We have a lot of time left, don’t worry.
Ambiguity of natural language is another difficulty:
As humans , we are aware of the context and current cultural knowledge, and also of the language and traditions and utilize these to process the meaning.however incorporating contextual and world knowledge poses the greatest difficulty in language computing.
There are various sources of ambiguities in natural language
Ambiguity at word level(Lexical Ambiguity)
A word can be ambiguous, word may represent a noun or a verb
Example: can,bunk, cat etc.
Sentence Level Ambiguity(structural Ambiguity)
Example:Stolen rifle found by the tree
Number of grammars have been proposed to describe the structure of the sentences. However there are an infinite number of ways to generate them. Which makes writing grammar rules and grammar itself, extremely complex.
Language and Grammar
Automatic Processing of Language requires the rules and exceptions of a language to be explained to the computer.
Main hurdle :
Constantly changing nature of languages and the presence of large number of language exceptions.
Effort to provide specifications for the language has led to many grammars.
Though many grammars were proposed but Transformational Grammar was identified as the better,
Transformational meaning which changes the structure but not the meaning , It is also called Transformational Generative Grammar.
English is SVO Language.
Transformation grammar has three components
Each of these components consists of set of rules.
Phrase structure grammar consists of set of rules that generate natural language sentences and assign a structural description to them.
Transformational rules are applied on the terminal string generated by phrase structure rules.
(s1) NP1-Aux-V-NP2-->NP2-Aux+be+en-V-by+NP1(s2).
This rules says that if the input has s1 structure it can be transformed to s2.
Transformational rules can be obligatory or optional.
Morphophonemic rules: match each sentence representation to a string of phonemes.
Phoneme, in linguistics, smallest unit of speech distinguishing one word (or word element) from another, as the element p in “tap,” which separates that word from “tab,” “tag,” and “tan.”
Consider the sentence:
The police will catch the snatcher
Structure obtained by applying phrase structure rule
(s1) NP1-Aux-V-NP2-->NP2-Aux+be+en-V-by+NP1(s2).
Processing Indian Languages
मैं फल खाता हूँ। (main phaL khaaTaa huun.)
(S + O + V)
मैं खाता हूँ फल। (main khaaTaa huun phaL.)
(S +V+O)
(standardization rules for spelling)
The complex predicates are combination of two lexical items.
The first and second lexical items of the complex predicates are called polar and vector respectively. The way how these two lexical items come together and forms a CP is quite interesting to examine. Consider a CP ¤ాఠ ¨ాడు [paaTa maaDu] ‘teach lesson’. In the example, the first constituent ¤ాఠ [paaTa] ‘lesson’ is a polar and the second constituent ¨ాడు [maaDu] ‘do’ is a vector
NLP Applications
First application of NLP: Machine Translation, recent progress is information retrieval, information extraction, text summarization etc.
Machine Translation: Translation from one human language to another , demands the knowledge of words, phrases, grammars of two languages involved, world knowledge.
Speech Recognition: Process of mapping acoustic speech signals to a set of words.
Difficulty: wide variation in the pronunciation of words, homonym(dear and deer) and acoustic ambiguities.(ex: in the rest and interest).
Speech Synthesis: automatic production of speech. Such systems can read out mals o the telephone, or even read out a storybook for you.
Natural Language interfaces to Databases: allows querying structured database using Natural language sentences.
Information Extraction:
It captures and outputs factual Information contained in a document.
It extracts structured information from unstructured and/or semi-structured machine-readable documents.
In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP).
Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video could be seen as information extraction.
Information Retrieval: The IR system assists the users in finding the information they require but it does not explicitly return the answers to the question. It notifies regarding the existence and location of documents that might consist of the required information. Concerned with identifying the documents relevant to users query.
Example: google search
Question Answering : given a question and a set of documents, Question Answering system attempts to find the precise answer or atleast the precise portion of text in which the answer appears. Unlike Information extraction system, question answering system benefits from having an information extraction system to identify entities in the text.
Text Summarization: deals with creation of summaries of documents and involves syntactic, semantic and discourse level processing of text.
Some Successful Early NLP Systems
ELIZA: is an early natural language processing computer program created from 1964 to 1966 at the MIT Artificial Intelligence Laboratory by Joseph Weizenbaum.
SysTran(System Translation): First Machine Translation Tool developed in 1969 for Russian-English translation. SysTran provided the first online machine translation service called Babel Fish used by Alta Vista for handling translation requests for users.
TAUM METO: Natural Language generation system used in Canada to generate weather reports. It accepts daily weather data and generates weather reports in English and French.
SHRDLU(Winogard 1972): Natural language Understanding system that simulates actions of a robot in a block world domain. It uses syntactic parsing and semantic reasoning to understand instructions. User can ask the robot to manipulate the blocks, to tell the blocks configurations, and to explain its reasoning.
LUNAR(Woods 1977): Question answering system that answered questions about moon roc
Information Retrieval
Information retrieval deals with unstructured data.It is performed based on the content of the document rather than its structure.
Approaches for accessing large text collections can be broadly classified into 2 categories.
Issues involved in the design and evaluation of IR Systems
Polysemy: lexeme with multiple meaning
Example: He fixed his hair.
They fixed a date for the wedding.
Homonymy: ambiguity in which the words that appear the same have unrelated meanings ex: kneed,need, whole ,hole
Right vs Write
C. Synonymy : creates a problem when a document is indexed with one term and the query contains a � different term, and the two terms share a common meaning.
D. It ignores semantics and contextual information in the retrieval process.
E. Inappropriate characterization of queries by user: reason can be lack of knowledge of the subject or even
the inherent vagueness of the natural language.User may fail to include relevant terms in the query or� may include irrelevant terms.
F. Matching query representation with that of the document is another issue: selection of appropriate � similarity measure is a crucial issue in the design of IR system.
E. Evaluating the performance of IR systems is also a major issue. Recall and precision are the most widely
used measures of effectiveness.Recall and precision are the most widely used measures of effectiveness
F. Goal of IR is to search a document in a manner relevant to the query, understanding what constitutes� relevance ia an important issue.
G. Size of document collections and the varying needs of users also complicate text retrieval.some users � require answers of limited scope, while others require documents with a wider scope.
Why NLP?
To design, implement and test systems that can process natural language for practical applications.
Practical Applications:
Difficulties that we face while designing Algorithms for NLP
Example: Rose rose to get a twig.
Example: The man saw the boy with the binoculars
Flying planes can be dangerous
Ambiguities:
Hospitals are sued by 7 foot doctors.
Stolen painting found by tree.
Teacher strikes idle kids.
A "morpheme" is a short segment of language that meets three basic criteria:
1. It is a word or a part of a word that has meaning.
2. It cannot be divided into smaller meaningful segments without changing its meaning or leaving a meaningless remainder.
3. It has relatively the same stable meaning in different verbal environments.