1 of 53

Compiler Design

UNIT -I

Compiler Design by Dr. S.Siva Nageswara Rao

2 of 53

Introduction

Translator: It is a program that translates one language to another Language. Example: Compiler

Compiler: A compiler is a program that accepts a program written in a High Level Language and produces an object (low-level) program.

Properties of Compiler�1. The compiler itself must be bug-free.�2. It must generate correct machine code�3. The generated machine code must run fast�4. The compiler itself must run fast.�5. The compiler must be portable.�6. It must give good diagnostics and error messages.�7. The generated code must work well with existing debuggers�8. It must have Consistent optimization.

Compiler Design by Dr. S.Siva Nageswara Rao

3 of 53

The compilation can be done in two parts: Analysis and Synthesis.
In Analysis part the source program is read and broken down into constituent pieces. The syntax and the meaning of the source string is determined and then an intermediate code is created from input source program.
In Synthesis part this intermediate form of the source language is taken and converted into an equivalent target program. During this process if certain code has to be optimized for efficient execution then the required code is optimized

Compiler Design by Dr. S.Siva Nageswara Rao

4 of 53

Language Processing System

Compiler Design by Dr. S.Siva Nageswara Rao

5 of 53

The Language Processing System can be represented as shown figure below.
PRE-PROCESSOR: A pre-processor produce input to compilers. They may perform the following functions.
Macro processing: A pre-processor may allow a user to define macros that are short hands for longer constructs.
File inclusion: A pre-processor may include header files into the program text.
Rational pre-processor: these pre-processors augment older languages with more modern flow-of-control and data structuring facilities.
Language Extensions: These pre-processor attempts to add capabilities to the language by certain amounts to build-in macros.

Compiler Design by Dr. S.Siva Nageswara Rao

6 of 53

COMPILER:

Compiler is a translator, program that translates a program written in (HLL) - the source program and translate it into an equivalent program in (MLL) - the target program.

ASSEMBLER:

Programmers found it difficult to write or read programs in machine language.
They begin to use a mnemonic (symbols) for each machine instruction, which they would subsequently translate into machine language.
Such a mnemonic machine language is now called an assembly language.
Programs known as assembler were written to automate the translation of assembly language in to machine language.
The input to an assembler program is called source program, the output is a machine language translation (object program)

Compiler Design by Dr. S.Siva Nageswara Rao

7 of 53

Difference between Compiler & interpreter

A compiler converts the high level instruction into machine language while an interpreter converts the high level instruction into an intermediate form.
Before execution, entire program is executed by the compiler whereas after translating the first line, an interpreter then executes it and so on.
List of errors is created by the compiler after the compilation process while an interpreter stops translating after the first error.
An independent executable file is created by the compiler whereas interpreter is required by an interpreted program each time.
The compiler produce object code whereas interpreter does not produce object code.
In the process of compilation the program is analyzed only once and then the code is generated whereas source program is interpreted every time it is to be executed and every time the source program is analyzed. hence interpreter is less efficient than compiler.
Examples of interpreter: A UPS Debugger is basically a graphical source level debugger but it contains built in C interpreter which can handle multiple source files. Example of compiler: Borland c compiler or Turbo C compiler compiles the programs written in C or C++

Compiler Design by Dr. S.Siva Nageswara Rao

8 of 53

LOADER AND LINK-EDITOR:

Once the assembler procedures an object program, that program must be placed into memory and executed.
The assembler could place the object program directly in memory and transfer control to it, thereby causing the machine language program to be execute.
This would waste core by leaving the assembler in memory while the user’s program was being executed. Also the programmer would have to retranslate his program with each execution, thus wasting translation time.
To overcome this problems of wasted translation time and memory. System programmers developed another component called loader.
“A loader is a program that places programs into memory and prepares them for execution.” It would be more efficient if subroutines could be translated into object form the loader could “relocate” directly behind the user’s program.
The task of adjusting programs they may be placed in arbitrary core locations is called relocation. Relocation loaders perform four functions.

Compiler Design by Dr. S.Siva Nageswara Rao

9 of 53

Phases of Compiler

A compiler operates in phases. A phase is a logically interrelated operation that takes source program in one representation and produces output in another representation.

There are two phases of compilation.

Analysis Phase or Front-end (Machine Independent/Language Dependent)
Synthesis Phase or Back-end (Machine Dependent/Language independent)

Compilation process is partitioned into no. of sub processes called ‘phases’.

Compiler Design by Dr. S.Siva Nageswara Rao

10 of 53

Compiler Design by Dr. S.Siva Nageswara Rao

11 of 53

Lexical analysis:

It is also called as Scanner, because it reads the program.
It is the first phase of the compiler. It gets input from the source program and produces tokens as output.
It reads the characters one by one, starting from left to right and forms the tokens.

Tokens: It represents a logically cohesive sequence of characters such as keywords, operators, identifiers, special symbols etc.

Example: position=initial + rate*60

Here, posotion,initial,rate,+,*,=,60 are all separate tokens.

Group of characters forming a token is called the Lexeme.
The lexical analyser not only generates a token but also enters the lexeme into the symbol table if it is not already there.

Compiler Design by Dr. S.Siva Nageswara Rao

12 of 53

Syntax Analysis:-

It is the second phase of the compiler. It is also known as parser.
It gets the token stream as input from the lexical analyser of the compiler and generates syntax tree as the output.
Syntax tree: It is a tree in which interior nodes are operators and exterior nodes are operands.

Example: For position=initial + rate*60, syntax tree is

Compiler Design by Dr. S.Siva Nageswara Rao

13 of 53

Compiler Design by Dr. S.Siva Nageswara Rao

14 of 53

Semantic analysis :

Semantic analysis checks the source program for semantic errors.
Syntax analysis phase to identify the operator and operands of expression and statements.
Semantic analysis performs type checking i.e, it check that each operator has operands that are permitted by the source language specification

Compiler Design by Dr. S.Siva Nageswara Rao

15 of 53

Intermediate Code Generations:-

It is the fourth phase of the compiler.
It gets input from the semantic analysis and converts the input into output as intermediate code such as three-address code.
The three-address code consists of a sequence of instructions, each of which has at most three operands.

Example: t1=t2+t3

Code Optimization:- It is the fifth phase of the compiler.

It gets the intermediate code as input and produces optimized intermediate code as output.
This phase reduces the redundant code and attempts to improve the intermediate code so that faster-running machine code will result.

Compiler Design by Dr. S.Siva Nageswara Rao

16 of 53

During the code optimization, the result of the program is not affected.
To improve the code generation,

Optimization involves

deduction and removal of dead code (unreachable code).
calculation of constants in expressions and terms.
collapsing of repeated expression into temporary string.
loop unrolling.
moving code outside the loop.
removal of unwanted temporary variables.

Compiler Design by Dr. S.Siva Nageswara Rao

17 of 53

Code Generation:

It is the final phase of the compiler.
It gets input from code optimization phase and produces the target code or object code as result.
Intermediate instructions are translated into a sequence of machine instructions that perform the same task.
The code generation involves

- allocation of register and memory

- generation of correct references

- generation of correct data types

-generation of missing code.

Compiler Design by Dr. S.Siva Nageswara Rao

18 of 53

Symbol Table Management :-

Symbol table is used to store all the information about identifiers used in the program.
It is a data structure containing a record for each identifier, with fields for the attributes of the identifier.
It allows to find the record for each identifier quickly and to store or retrieve data from that record.
Whenever an identifier is detected in any of the phases, it is stored in the symbol table.

Data Structures used on symbol table:

Linear Table
List
Trees
Hash table

Compiler Design by Dr. S.Siva Nageswara Rao

19 of 53

Symbol table example

Operations: Operations on symbol table are categorized in to two types

Block storage Structure: More than one block in the program { } { } { }

Operations: Insert , Lookup, modify and delete

Non Block Storage Structure : Only One block in the entire program{ }

Operations :Insert , Lookup

Compiler Design by Dr. S.Siva Nageswara Rao

20 of 53

Error Handlers:-

Each phase can encounter errors. After detecting an error, a phase must handle the error so that compilation can proceed
In lexical analysis, errors occur in separation of tokens.

Example: Int X; , inti x,y: , float m=5..7:

In syntax analysis, errors occur during construction of syntax tree. Example: if(no=200)
In semantic analysis, errors occur when the compiler detects constructs with right syntactic structure but no meaning and during type conversion. Ex: x=a+b, x=int,a=char, b=boolean
In code optimization, errors occur when the result is affected by the optimization.
In code generation, it shows error when code is missing etc.

Compiler Design by Dr. S.Siva Nageswara Rao

21 of 53

Compiler Design by Dr. S.Siva Nageswara Rao

22 of 53

Lexical Analysis

Lexical analysis is the process of converting a sequence of characters into a sequence of tokens.
A program or function which performs lexical analysis is called a lexical analyzer or scanner.
A lexer often exists as a single function which is called by a parser or another function.

The Role Of The Lexical Analyzer:

The lexical analyzer is the first phase of a compiler.
Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis.
Upon receiving a “get next token” command from the parser, the lexical analyzer reads input characters until it can identify the next token.

Compiler Design by Dr. S.Siva Nageswara Rao

23 of 53

Why Lexical Analyzer is separated from Syntax Analyzer?

To make the design of compiler simple.

• To improve the efficiency of the compiler.

• To enhance the computer portability.

Compiler Design by Dr. S.Siva Nageswara Rao

24 of 53

Token: A token is a sequence of characters that can de treated as a unit with logical meaning

Example:

Keywords: for, while, do, if

Identifier: Variable name, function name

Operator: +, ++, - ,/ ,

Separators: , . ;

Compiler Design by Dr. S.Siva Nageswara Rao

25 of 53

Lexeme: It is a sequence of characters in the source program that is matched by the pattern for a token

Example: “float”, “=”, “222” “;”

Pattern: It is a rule describing all those lexeme that can be represent a particular token in a source language

Compiler Design by Dr. S.Siva Nageswara Rao

26 of 53

Attributes for Tokens

Some tokens have attributes that can be passed back to the parser.
The lexical analyzer collects information about tokens into their associated attributes.
The attributes influence the translation of tokens.

i) Constant : value of the constant

ii) Identifiers: pointer to the corresponding symbol table entry.

Error Recovery Strategies In Lexical Analysis:

The following are the error-recovery actions in lexical analysis

1) Deleting an extraneous character.

2) Inserting a missing character.

3) Replacing an incorrect character by a correct character.

4) Transforming two adjacent characters.

5) Panic mode recovery: Deletion of successive characters from the token until error is resolved.

Compiler Design by Dr. S.Siva Nageswara Rao

27 of 53

Input Buffering�

Input buffering is an essential technique in the design of compilers.
In this phase, the compiler reads the source code character by character to identify tokens, which are the fundamental units of meaning such as keywords, identifiers, operators, and symbols.
The basic idea of input buffering is to use a buffer, which is a block of memory where the source code is temporarily stored.
In the absence of input buffering, the compiler would read each character individually, a process that is both slow and inefficient.
Input buffering addresses this issue by loading large chunks of characters into memory simultaneously, thereby reducing the number of read operations required.

Compiler Design by Dr. S.Siva Nageswara Rao

28 of 53

How Input Buffering Works

There are typically two types of buffers used:

Single Buffer: A single large block of memory that holds part of the source code.

The lexical analyzer processes the buffer character by character to detect tokens.
Once the buffer is depleted, it is refilled with the next set of characters, and the cycle continues.
Although straightforward, this approach can be inefficient as it requires processing to halt each time the buffer is replenished.

Compiler Design by Dr. S.Siva Nageswara Rao

29 of 53

2. Double Buffer: Two blocks of memory, used alternately, to ensure that while one buffer is being processed, the other can be filled with new characters from the source file

Double buffering is a more efficient method where two buffers are utilized.
As the lexical analyzer processes characters from one buffer, the second buffer is simultaneously loaded with the next set of characters from the source file.
This concurrent processing and reading ensure a steady stream of characters and minimizes wait times.

Compiler Design by Dr. S.Siva Nageswara Rao

30 of 53

Sentinels

For each character read, we make two tests: one for the end of the buffer, and one to determine what character is read.
We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof.
The sentinel arrangement is as shown below:

Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other than at the end of a buffer means that the input is at an end.

Compiler Design by Dr. S.Siva Nageswara Rao

31 of 53

Advantages of Input Buffering

1.Efficiency: By reading large blocks of data at once, input buffering reduces the number of input operations, making the process faster.

2. Reduced Latency: Double buffering ensures that while one buffer is being processed, the other is being filled, reducing waiting time and increasing the overall speed of the lexical analysis.

3. Smooth Processing: The use of sentinels helps in seamless buffer transitions, avoiding constant end-of-buffer checks.

Compiler Design by Dr. S.Siva Nageswara Rao

32 of 53

REGULAR EXPRESSIONS

Regular expression is a formula that describes a possible set of string.

Component of regular expression:

X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R2R1 either an R1 or an R2.

Compiler Design by Dr. S.Siva Nageswara Rao

33 of 53

Identifier: defined to be a letter followed by zero or more letters or digits. In regular expression notation we would write.

Identifier = letter (letter | digit)*

Here are the rules that define the regular expression over alphabet .

is a regular expression denoting { € }, that is, the language containing only the empty string.
For each ‘a’ in Σ, is a regular expression denoting { a }, the language with only one string consisting of the single symbol ‘a’ .

If R and S are regular expressions, then

(R) | (S) means Lr U Ls
R.S means Lr.Ls
R* denotes Lr*

Compiler Design by Dr. S.Siva Nageswara Rao

34 of 53

REGULAR DEFINITIONS

Identifiers are the set or string of letters and digits beginning with a letter. The following regular definition provides a precise specification for this class of string.

Example:

Ab*|cd? Is equivalent to (a(b*)) | (c(d?))

Pascal identifier

Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*

Compiler Design by Dr. S.Siva Nageswara Rao

35 of 53

Recognition of tokens:

Starting point is the language grammar to understand the tokens

Stmt -> if expr then stmt

| If expr then else stmt

| є

Expr -> term relop term

| term

Term -> id

| number

For relop ,we use the comparison operations of languages like Pascal or SQL where = is “equals” and < > is “not equals” because it presents an interesting structure of lexemes.

Compiler Design by Dr. S.Siva Nageswara Rao

36 of 53

Recognition of tokens:

Now, We know how to specify the tokens for one language. But, how do we write a program to recognize them

To handle white spaces:

Ws->(blank | tab| newline)

Compiler Design by Dr. S.Siva Nageswara Rao

37 of 53

Compiler Design by Dr. S.Siva Nageswara Rao

38 of 53

Compiler Design by Dr. S.Siva Nageswara Rao

39 of 53

�TRANSITION DIAGRAM: �

As an intermediate step in the construction of a lexical analyzer, we first produce flowchart, called a Transition diagram.
Transition Diagram has a collection of nodes or circles, called states. Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one of several patterns .
Transition diagrams depict the actions that take place when a lexical analyzer is called by the parser to get the next token.
Edges are directed from one state of the transition diagram to another. Each edge is labeled by a symbol or set of symbols.
It does that by moving from position to position in the diagram as characters are read.

Compiler Design by Dr. S.Siva Nageswara Rao

40 of 53

Components of Transition Diagram

One state is labelled the Start State. It is the initial state of transition diagram where control resides when we begin to recognize a token.

Position is a transition diagram are drawn as circles and are called states.

The states are connected by Arrows called edges. Labels on edges are indicating the input characters

Zero or more final states or Accepting states are represented by double circle in which the tokens has been found

Compiler Design by Dr. S.Siva Nageswara Rao

41 of 53

Compiler Design by Dr. S.Siva Nageswara Rao

42 of 53

We begin in state 0, the start state.
If we see < as the first input symbol, then among the lexemes that match the pattern for relop we can only be looking at <, <>, or <=, go to state 1, and look at the next character.
If it is =, then we recognize lexeme <=, enter state 2, and return the token relop with attribute LE, the symbolic constant representing this particular comparison operator.
If in state 1 the next character is >, then instead we have lexeme <>, and enter state 3 to return an indication that the not-equals operator has been found.
On any other character, the lexeme is <, and we enter state 4 to return that information.
Note: State 4 has a * to indicate that we must retract the input one position.

Compiler Design by Dr. S.Siva Nageswara Rao

43 of 53

If in state 0 the first character we see is =, then this one character must be the lexeme. We immediately return that fact from state 5.
The remaining possibility is that the first character is >. Then, we must enter state 6 and decide, on the basis of the next character, whether the lexeme is >= (if we next see the = sign), or just >.
Note: in state 0, we see any character besides <, =, or >, we can not possibly be seeing a relop lexeme, so this transition diagram will not be used.

Compiler Design by Dr. S.Siva Nageswara Rao

44 of 53

Recognition of Reserved Words and Identifiers

There are two ways that we can handle reserved words that look like identifiers: (Method-I)

Install the reserved words in the symbol table initially
installID() checks if the lexeme is already in the table. If it is not present, the lexeme is installed as an id token. In either case a pointer to the entry is returned.
getToken() examines the lexeme and returns the token name, either id or a name corresponding to a reserved keyword.

Compiler Design by Dr. S.Siva Nageswara Rao

45 of 53

Method-II

Create separate transition diagram for each keyword: an example for the keyword then is shown fig .
Such a transition diagram consist of states representing the situation after each successive letter of the keyword is seen, followed by a text for a “non letter-or-digit” .i.e., any character that cannot be the continuation of an identifier.

Compiler Design by Dr. S.Siva Nageswara Rao

46 of 53

Completion of the Running Example�

Figure: Transition diagram for unsigned numbers

Compiler Design by Dr. S.Siva Nageswara Rao

47 of 53

Starts with State 12,if we see a digit, go to state13.
In that state13, we can read any number of additional digits
However, we have seen a number in the form of an integer. Example:123
If we a see dot(.)in state 13,then we have an “optional fraction”
State 14 is entered and have one or more additional digits.
State 15 used for that purpose. If we see an ‘E’, it is “optional Exponent”, whose recognition is the job of states 16 through19.
In state 15, instead see anything but E or digit, the new have to come to the end of fraction , there is no exponent and return the lexeme found via state 21.

Compiler Design by Dr. S.Siva Nageswara Rao

48 of 53

Recognizing Whitespace

The diagram itself is quite simple reflecting the simplicity of the corresponding regular expression.

The delim in the diagram represents any of the whitespace characters, space, tab, and newline.
The final star is there because we needed to find a non-whitespace character in order to know when the whitespace ends and this character begins the next token.
There is no action performed at the accepting state. Indeed the lexer does not return to the parser, but starts again from its beginning as it still must find the next token.

Compiler Design by Dr. S.Siva Nageswara Rao

49 of 53

LEX

Lex is a computer program that generates lexical analyzers.
Lex is commonly used with the YACC parser generator.
Creating a lexical analyser

First, a specification of a lexical analyzer is prepared by creating a program lex.l in the Lex language.

Then, lex.l is run through the Lex compiler to produce a C program lex.yy.c.

Finally, lex.yy.c is run through the C compiler to produce an object program a.out, which is the lexical analyzer that transforms an input stream into a sequence of tokens.

Compiler Design by Dr. S.Siva Nageswara Rao

50 of 53

Lex Specification
A Lex program consists of three parts:

{ definitions }

{ rules }

{ user subroutines }

Definitions include declarations of variables, constants, and regular definitions
Rules are statements of the form p1 {action1} p2 {action2} … pn {action} where pi is regular expression and actioni describes what action the lexical analyzer should take when pattern pi matches a lexeme. Actions are written in C code.
User subroutines are auxiliary procedures needed by the actions. These can be compiled separately and loaded with the lexical analyzer.

Compiler Design by Dr. S.Siva Nageswara Rao

51 of 53

YACC-

Yet Another Compiler-Compiler Yacc provides a general tool for describing the input to a computer program.
The Yacc user specifies the structures of his input, together with code to be invoked as each such structure is recognized.
Yacc turns such a specification into a subroutine that handles the input process; frequently, it is convenient and appropriate to have most of the flow of control in the user's application handled by this subroutine

Compiler Design by Dr. S.Siva Nageswara Rao

52 of 53

FINITE AUTOMATA

Finite Automata is one of the mathematical models that consist of a number of states and edges.
It is a transition diagram that recognizes a regular expression or grammar.

Types of Finite Automata There are tow types of Finite Automata

Non-deterministic Finite Automata (NFA)
Deterministic Finite Automata (DFA) Non-deterministic Finite Automata NFA is a mathematical model that consists of five tuples denoted by M = {Qn, Ʃ, δ, q0, fn}

Qn – finite set of states Ʃ – finite set of input symbols

δ – transition function that maps state-symbol pairs to set of states q0 – starting state fn – final state

Compiler Design by Dr. S.Siva Nageswara Rao

53 of 53

Deterministic Finite Automata DFA is a special case of a NFA in which

i) no state has an ε-transition.

ii) there is at most one transition from each state on any input.

DFA has five tuples denoted by M = {Qd, Ʃ, δ, q0, fd}

Qd – finite set of states Ʃ – finite set of input symbols

δ – transition function that maps state-symbol pairs to set of states

q0 – starting state fd – final state

Construction of DFA from regular expression

The following steps are involved in the construction of DFA from regular expression:

i) Convert RE to NFA using Thomson’s rules

ii) Convert NFA to DFA

iii) Construct minimized DFA

Compiler Design by Dr. S.Siva Nageswara Rao