1 of 42

6.1120: Parsing

Basic Concepts

2 of 42

Goal: Tokenization for (2-1)+1

Op

+

Open

(

Close

)

Int

1

Op

-

Int

2

Int

1

Token:

Text:

3 of 42

Lexical Categories Example

IfKeyword = if
WhileKeyword = while
Operator = +|-|*|/
Integer = [0-9] [0-9]*
Float = [0-9]*. [0-9]*
Identifier = [a-z]([a-z]|[0-9])*
Note that [0-9] = (0|1|2|3|4|5|6|7|8|9)

[a-z] = (a|b|c|…|y|z)

Will use lexical categories in next level

4 of 42

Lexical Structure in Languages

Each language typically has several categories of words. In a typical programming language:

Keywords (if, while)
Arithmetic Operations (+, -, *, /)
Integer numbers (1, 2, 45, 67)
Floating point numbers (1.0, .2, 3.337)
Identifiers (abc, i, j, ab345)

Typically have a lexical category for each keyword and/or each category
Each lexical category defined by regexp

5 of 42

Write a Regular Expression

All strings of the wedge alphabet { <, > }

(<|>)*

Strings with open wedges followed by closed wedges

<*>*

Strings with matching wedges

Not with a regular expression!

6 of 42

Nested Expressions

Are regular languages sufficient for specifying syntax?
Try the following

(a+(b-c))*(d-(x-(y-z)))
if (x < y) if (y < z) a = 5 else a = 6 else a = 7

7 of 42

Programming Language Syntax

Regular languages suboptimal for specifying programming language syntax
Why? Constructs with nested syntax

(a+(b-c))*(d-(x-(y-z)))
if (x < y) if (y < z) a = 5 else a = 6 else a = 7

Regular languages lack state required to model nesting
Canonical example: nested expressions
No regular expression for language of parenthesized expressions

8 of 42

Goal: Parse Tree for <2-1>+1

Start

Expr

Op

+

Open

<

Close

>

Expr

Int

1

Op

-

Expr

Int

2

Expr

Int

1

9 of 42

Solution – Context-Free Grammar

Set of terminals

{ Op, Int, Open, Close }

Each terminal defined

by regular expression

Set of nonterminals

{ Start, Expr }

Set of productions

Single nonterminal on LHS
Sequence of terminals and nonterminals on RHS

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

Start → Expr

Expr → Expr Op Expr

Expr → Int

Expr → Open Expr Close

10 of 42

Generation

start with Start nonterminal

repeat

choose a nonterminal

choose a production with that nonterminal in LHS

replace nonterminal with RHS of production

until no more nonterminals in the string

11 of 42

Sample Derivation for <2-1>+1

Start

Expr

Expr Op Expr

Open Expr Close Op Expr

Open Expr Op Expr Close Op Expr

Open Int Op Expr Close Op Expr

Open Int Op Expr Close Op Int

Open Int Op Int Close Op Int

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

1) Start → Expr

2) Expr → Expr Op Expr

3) Expr → Int

4) Expr → Open Expr Close

12 of 42

Sample Derivation for <2-1>+1

Open Int Op Int Close Op Int

< Int Op Int Close Op Int

< [0-9][0-9]* Op Int Close Op Int

< 2 Op Int Close Op Int

< 2 +|-|*|/ Int Close Op Int

< 2 - Int Close Op Int

< 2 - [0-9][0-9]* Close Op Int

< 2 - 1 Close Op Int

< 2 - 1 > Op Int

< 2 - 1 > +|-|*|/ Int

< 2 - 1 > + Int

< 2 - 1 > + [0-9][0-9]*

< 2 - 1 > + 1

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

1) Start → Expr

2) Expr → Expr Op Expr

3) Expr → Int

4) Expr → Open Expr Close

13 of 42

Goal: Parse Tree for <2-1>+1

Start

Expr

Op

+

Open

<

Close

>

Expr

Int

1

Op

-

Expr

Int

2

Expr

Int

1

14 of 42

Parse Tree

Internal Nodes: Nonterminals
Leaves: Terminals
Edges:

From Nonterminal of LHS of production
To Nodes from RHS of production

Captures derivation of string

Gives the first specification of the categories and structure of our language primitives!

15 of 42

Goal: Parse Tree for <2-1>+1

Start

Expr

Op

+

Open

<

Close

>

Expr

Int

1

Op

-

Expr

Int

2

Expr

Int

1

1) Start → Expr

2) Expr → Expr Op Expr

3) Expr → Int

4) Expr → Open Expr Close

16 of 42

Recognition (Parsing)

Parser: Converts program into a parse tree

Text -> Parse Tree (Recognition)

Can be written by hand
Or produced automatically by parser generator

Accepts a grammar as input
Produces a parser as output

17 of 42

Terminals (aka Lexing)

< 2 - 1 > + 1

Open 2 - 1 > + 1

Open Int - 1 > + 1

Open Int Op 1 > + 1

Open Int Op Int > + 1

Open Int Op Int Close + 1

Open Int Op Int Close Op 1

Open Int Op Int Close Op Int

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

1) Start → Expr

2) Expr → Expr Op Expr

3) Expr → Int

4) Expr → Open Expr Close

18 of 42

Parsing for <2-1>+1

Open Int Op Int Close Op Int

Open Expr Op Int Close Op Int

Open Expr Op Expr Close Op Int

Open Expr Close Op Int

Expr Op Int

Expr Op Expr

Expr

Start

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

1) Start → Expr

2) Expr → Expr Op Expr

3) Expr → Int

4) Expr → Open Expr Close

19 of 42

What is this Grammar?

What is the language of the following grammar?

1) Start → S

2) S → ( L )

3) S → a

4) L → L , S

5) L → S

20 of 42

1) Start → S

2) S → ( L )

3) S → a

4) L → L , S

5) L → S

Consider: (a, (a, a))

(

a

,

(

a

,

a

)

S

L

S

L

S

21 of 42

Parse Tree

Internal Nodes: Nonterminals
Leaves: Terminals
Edges:

From Nonterminal of LHS of production
To Nodes from RHS of production

Captures derivation of string

Gives the first specification of the categories and structure of our language primitives!

22 of 42

Parser

Converts program into a parse tree
Can be written by hand
Or produced automatically by parser generator

Accepts a grammar as input
Produces a parser as output

Next lecture: implementation techniques

23 of 42

A Problem

Start

Expr

Op

+

Expr

Op

-

Int

2

Int

1

Int

1

Parse Tree: 2-1+1

Tree corresponding

to <2-1>+1

24 of 42

A Problem

Start

Expr

Op

+

Expr

Op

-

Int

2

Int

1

Int

1

Start

Expr

Op

-

Expr

Op

+

Int

1

Int

1

Int

2

Parse Tree: 2-1+1

Tree corresponding

to <2-1>+1

Tree corresponding

to 2-<1+1>

25 of 42

Ambiguity in Grammar

Grammar is ambiguous if there are multiple derivations (therefore multiple parse trees) for a single string

Derivation and parse tree usually reflect semantics of the program

Ambiguity in grammar often reflects ambiguity in semantics of language

(which is considered undesirable)

26 of 42

Eliminating Ambiguity

Solution: hack the grammar

Conceptually, makes all operators associate to left

Original Grammar

Start → Expr

Expr → Expr Op Expr

Expr → Int

Expr → Open Expr Close

Hacked Grammar

Start → Expr

Expr → Expr Op Int

Expr → Int

Expr → Open Expr Close

27 of 42

Parse Trees for Hacked Grammar

Start

Expr

Op

+

Expr

Op

-

Int

2

Int

1

Int

1

Start

Expr

Op

-

Expr

Op

+

Int

1

Int

1

Int

2

Only one parse tree for 2-1+1!

Valid parse tree

No longer valid parse tree

28 of 42

Precedence Violations

All operators associate to left

Start

Expr

Op

*

Expr

Op

-

Int

2

Int

3

Int

4

Parse tree for

2-3*4

29 of 42

Precedence Violations

All operators associate to left
Violates precedence of * over +

2-3*4 associates like <2-3>*4

Start

Expr

Op

*

Expr

Op

-

Int

2

Int

3

Int

4

Parse tree for

2-3*4

We’ll come back to this later

30 of 42

Parser

Converts program into a parse tree
Can be written by hand
Or produced automatically by parser generator

Accepts a grammar as input
Produces a parser as output

Strategy:

Start with intuitive but ambiguous grammar
Hack grammar to make it unambiguous

31 of 42

Summary

Lexical and Syntactic Levels of Structure

Lexical – regular expressions and automata
Syntactic – grammars

Generation versus Recognition Approaches

Generation more convenient for specification
Recognition required in implementation

Grammar ambiguities and Precedence

Hacked grammars

Next up: algorithms for implementation and implementation concerns

32 of 42

End

33 of 42

Precedence Violations

All operators associate to left

Start

Expr

Op

*

Expr

Op

-

Int

2

Int

3

Int

4

Parse tree for

2-3*4

We’ll come back to this later

34 of 42

Precedence Violations

All operators associate to left
Violates precedence of * over +

2-3*4 associates like <2-3>*4

Start

Expr

Op

*

Expr

Op

-

Int

2

Int

3

Int

4

Parse tree for

2-3*4

35 of 42

Hacking Around Precedence

Original Grammar

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

Start → Expr

Expr → Expr Op Int

Expr → Int

Expr → Open Expr Close

Hacked Grammar

AddOp = +|-

MulOp = *|/

Int = [0-9] [0-9]*

Open = <

Close = >

Start → Expr

Expr → Expr AddOp Term

Expr → Term

Term → Term MulOp Num

Term → Num

Num → Int

Num → Open Expr Close

36 of 42

Parse Tree Changes

Start

Expr

Op

*

Expr

Op

-

Int

2

Int

3

Int

4

Old parse tree

for 2-3*4

Start

Expr

AddOp

-

Term

New parse tree

for 2-3*4

Term

MulOp

*

Num

Int

4

Num

Int

2

Num

Int

3

37 of 42

General Idea

Group Operators into Precedence Levels

* and / in one level level, bind strongest
+ and – in another level, bind next strongest

Nonterminal for each Precedence Level

Term is nonterminal for * and /
Expr is nonterminal for + and –

Stronger precedence level, lower in grammar/parse tree
Can make operators left or right associative within each level
Generalizes for arbitrary levels of precedence

38 of 42

Parse Tree

Practical problem

Parse tree for hacked grammar is complicated
Would like to start with more intuitive parse tree

Abstract versus Concrete Syntax

Abstract syntax corresponds to “intuitive” way of thinking of structure of program

Omits details like superfluous keywords that are there to make the language unambiguous
Abstract syntax may be ambiguous

Concrete Syntax corresponds to full grammar used to parse the language

Parsers are often written to produce abstract syntax trees.

39 of 42

Abstract Syntax Trees

Start with intuitive but ambiguous grammar
Hack grammar to make it unambiguous

Concrete parse trees
Less intuitive

Convert concrete parse trees to abstract syntax trees

Correspond to intuitive grammar for language
Simpler for program to manipulate

40 of 42

Example

Intuitive but Ambiguous Grammar

Op = *|/|+|-

Int = [0-9] [0-9]*

Start → Expr

Expr → Expr Op Expr

Expr → Int

Hacked Unambiguous Grammar

AddOp = +|-

MulOp = *|/

Int = [0-9] [0-9]*

Open = <

Close = >

Start → Expr

Expr → Expr AddOp Term

Expr → Term

Term → Term MulOp Num

Term → Num

Num → Int

Num → Open Expr Close

41 of 42

Concrete parse tree

for <2-3>*4

Start

Expr

Op

*

Expr

Op

-

Int

2

Expr

Int

4

Expr

Int

3

Abstract syntax tree

for <2-3>*4

Uses intuitive grammar
Eliminates superfluous terminals

Open
Close

Start

Expr

AddOp

-

Term

MulOp

*

Num

Int

4

Num

Int

2

Num

Int

3

42 of 42

Start

Expr

Op

*

Op

-

Int

2

Int

4

Int

3

Start

Expr

Op

*

Expr

Op

-

Int

2

Expr

Int

4

Expr

Int

3

Abstract parse tree

for <2-3>*4

Further simplified abstract syntax tree

for <2-3>*4