1 of 47

Lexical Analysis

Dr. Noman Islam

2 of 47

Introduction

  • In this chapter we show how to construct a lexical analyzer
  • To implement a lexical analyzer by hand, it helps to start with a diagram or other description for the lexemes of each token. We can then write code to identify each occurrence of each lexeme on the input and to return information about the token identified.
  • We can also produce a lexical analyzer automatically by specifying the lexeme patterns to a lexical-analyzer generator and compiling those patterns into code that functions as a lexical analyzer.

3 of 47

Role of lexical analyzer

  • As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program.
  • The stream of tokens is sent to the parser for syntax analysis
  • When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table.
  • Commonly, the interaction is implemented by having the parser call the lexical analyzer.
  • The call, suggested by the getNextToken command, causes the lexical analyzer to read characters from its input until it can identify the next lexeme and produce for it the next token, which it returns to the parser.

4 of 47

5 of 47

Other tasks of lexical analyzer

  • Stripping out comments and whitespace
  • Correlating error messages generated by the compiler with the source program. It can associate a line number with each error message.
  • Sometimes, lexical analyzers are divided into a cascade of two processes:

a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one.

b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output.

6 of 47

Why a separate phase

  • Simplicity of design is the most important consideration
  • Compiler efficiency is improved
  • Compiler portability is enhanced

7 of 47

Tokens, patterns and lexemes

  • A token is a pair consisting of a token name and an optional attribute value.
  • A pattern is a description of the form that the lexemes of a token may take.
  • A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

8 of 47

Tokens example

1. One token for each keyword. The pattern for a keyword is the same as

the keyword itself.

2. Tokens for the1 operators, either individually or in classes such as the token comparison mentioned in Fig. 3.2.

3. One token representing all identifiers.

4. One or more tokens representing constants, such as numbers and literal

5. Tokens for each punctuation symbol, such as left and right parentheses,

comma, and semicolon.

9 of 47

Attributes for tokens

  • In many cases the lexical analyzer returns to the parser not only a token name, but an attribute value that describes the lexeme represented by the token;
  • The token name influences parsing decisions, while the attribute value influences translation of tokens after the parse.
  • The appropriate attribute value for an identifier is a pointer to the symbol-table entry for that identifier

10 of 47

11 of 47

Lexical errors

  • It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error
  • fi
  • Since f i is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser and let some other phase of the compiler — probably the parser in this case — handle an error due to transposition of the letters.
  • However, suppose a situation arises in which the lexical analyzer is unable to proceed because none of the patterns for tokens matches any prefix of the remaining input.
  • The simplest recovery strategy is "panic mode" recovery

12 of 47

  • We delete successive characters from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left
  • Other possible error-recovery actions are:

1. Delete one character from the remaining input.

2. Insert a missing character into the remaining input.

3. Replace a character by another character.

4. Transpose two adjacent characters

  • The simplest such strategy is to see whether a prefix of the remaining input can be transformed into a valid lexeme by a single transformation.

13 of 47

Input buffering

  • Let us examine some ways that the simple but important task of reading the source program can be speeded.
  • We often have to look one or more characters beyond the next lexeme before we can be sure we have the right lexeme.

14 of 47

  • Advancing forward requires that we first test whether we have reached the end of one of the buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of the newly loaded buffer.
  • As long as we never need to look so far ahead of the actual lexeme that the sum of the lexeme's length plus the distance we look ahead is greater than N, we shall never overwrite the lexeme in its buffer before determining it.

15 of 47

Sentinels

  • We must check, each time we advance forward, that we have not moved off one of the buffers; if we do, then we must also reload the other buffer.
  • We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end.

16 of 47

17 of 47

String and language

  • An alphabet is any finite set of symbols
  • The set {0,1} is the binary alphabet
  • ASCII is an important example of an alphabet; it is used in many software systems
  • A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
  • A language is any countable set of strings over some fixed alphabet

18 of 47

Operations on language

  • Union
  • Concatenation
  • Closure

19 of 47

Regular expression

20 of 47

Regular definition

21 of 47

22 of 47

Recognition of tokens

23 of 47

24 of 47

Our goal

25 of 47

Transition diagram

26 of 47

27 of 47

Handling reserved words

  • Install the reserved words in the symbol table initially. A field of the symbol-table entry indicates that these strings are never ordinary identifiers, and tells which token they represent. When we find an identifier, a call to installlD places it in the symbol table if it is not already there and returns a pointer to the symbol-table entry for the lexeme found.
  • Create separate transition diagrams for each keyword; an example for the keyword then

28 of 47

29 of 47

30 of 47

31 of 47

  • We could arrange for the transition diagrams for each token to be tried sequentially.
  • We could run the various transition diagrams "in parallel," feeding the next input character to all of them and allowing each one to make whatever transitions it required.
  • The preferred approach, and the one we shall take up in the following sections, is to combine all the transition diagrams into one.

32 of 47

Finite Automata

  • At the heart of the transition is the formalism known as finite automata.
  • Finite automat are recognizers; they simply say "yes" or "no" about each possible input string
  • Finite automat a come in two flavors:
    • (a) Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges. A symbol can label several edges out of the same state, and e, the empty string, is a possible label.
    • (b) Deterministic finite automata (DFA) have, for each state, and for each symbol of its input alphabet exactly one edge with that symbol leaving that state.

33 of 47

Non deterministic finite automata

34 of 47

Representing NFA

35 of 47

Example of an NFA

36 of 47

Transition table

The string aabb is accepted by the NFA of Fig. 3.24

37 of 47

Deterministic Finite Automata

  • There are no moves on input ɛ, and
  • For each state s and input symbol a, there is exactly one edge out of s
  • If we are using a transition table to represent a DFA, then each entry is a single state, we may therefore represent this state without the curly braces that we use to form sets

38 of 47

Simulating a DFA

39 of 47

Operations on NFA states

40 of 47

Subset construction

41 of 47

42 of 47

43 of 47

44 of 47

From regular expression to NFA

45 of 47

46 of 47

47 of 47