1 of 42

6.1120: Parsing

Basic Concepts

2 of 42

Goal: Tokenization for (2-1)+1

Op

+

Open

(

Close

)

Int

1

Op

-

Int

2

Int

1

Token:

Text:

3 of 42

Lexical Categories Example

  • IfKeyword = if
  • WhileKeyword = while
  • Operator = +|-|*|/
  • Integer = [0-9] [0-9]*
  • Float = [0-9]*. [0-9]*
  • Identifier = [a-z]([a-z]|[0-9])*
  • Note that [0-9] = (0|1|2|3|4|5|6|7|8|9)

[a-z] = (a|b|c|…|y|z)

  • Will use lexical categories in next level

4 of 42

Lexical Structure in Languages

Each language typically has several categories of words. In a typical programming language:

    • Keywords (if, while)
    • Arithmetic Operations (+, -, *, /)
    • Integer numbers (1, 2, 45, 67)
    • Floating point numbers (1.0, .2, 3.337)
    • Identifiers (abc, i, j, ab345)
  • Typically have a lexical category for each keyword and/or each category
  • Each lexical category defined by regexp

5 of 42

Write a Regular Expression

  • All strings of the wedge alphabet { <, > }
    • (<|>)*

  • Strings with open wedges followed by closed wedges
    • <*>*

  • Strings with matching wedges
    • Not with a regular expression!

6 of 42

Nested Expressions

  • Are regular languages sufficient for specifying syntax?
  • Try the following
    • (a+(b-c))*(d-(x-(y-z)))
    • if (x < y) if (y < z) a = 5 else a = 6 else a = 7

7 of 42

Programming Language Syntax

  • Regular languages suboptimal for specifying programming language syntax
  • Why? Constructs with nested syntax
    • (a+(b-c))*(d-(x-(y-z)))
    • if (x < y) if (y < z) a = 5 else a = 6 else a = 7
  • Regular languages lack state required to model nesting
  • Canonical example: nested expressions
  • No regular expression for language of parenthesized expressions

8 of 42

Goal: Parse Tree for <2-1>+1

Start

Expr

Expr

Expr

Op

+

Open

<

Close

>

Expr

Int

1

Op

-

Expr

Int

2

Expr

Int

1

9 of 42

Solution – Context-Free Grammar

  • Set of terminals

{ Op, Int, Open, Close }

Each terminal defined

by regular expression

  • Set of nonterminals

{ Start, Expr }

  • Set of productions
    • Single nonterminal on LHS
    • Sequence of terminals and nonterminals on RHS

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

StartExpr

ExprExpr Op Expr

Expr → Int

Expr → Open Expr Close

10 of 42

Generation

start with Start nonterminal

repeat

choose a nonterminal

choose a production with that nonterminal in LHS

replace nonterminal with RHS of production

until no more nonterminals in the string

11 of 42

Sample Derivation for <2-1>+1

Start

Expr

Expr Op Expr

Open Expr Close Op Expr

Open Expr Op Expr Close Op Expr

Open Int Op Expr Close Op Expr

Open Int Op Expr Close Op Int

Open Int Op Int Close Op Int

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

1) StartExpr

2) ExprExpr Op Expr

3) Expr → Int

4) Expr → Open Expr Close

12 of 42

Sample Derivation for <2-1>+1

Open Int Op Int Close Op Int

< Int Op Int Close Op Int

< [0-9][0-9]* Op Int Close Op Int

< 2 Op Int Close Op Int

< 2 +|-|*|/ Int Close Op Int

< 2 - Int Close Op Int

< 2 - [0-9][0-9]* Close Op Int

< 2 - 1 Close Op Int

< 2 - 1 > Op Int

< 2 - 1 > +|-|*|/ Int

< 2 - 1 > + Int

< 2 - 1 > + [0-9][0-9]*

< 2 - 1 > + 1

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

1) StartExpr

2) ExprExpr Op Expr

3) Expr → Int

4) Expr → Open Expr Close

13 of 42

Goal: Parse Tree for <2-1>+1

Start

Expr

Expr

Expr

Op

+

Open

<

Close

>

Expr

Int

1

Op

-

Expr

Int

2

Expr

Int

1

14 of 42

Parse Tree

  • Internal Nodes: Nonterminals
  • Leaves: Terminals
  • Edges:
    • From Nonterminal of LHS of production
    • To Nodes from RHS of production
  • Captures derivation of string

  • Gives the first specification of the categories and structure of our language primitives!

15 of 42

Goal: Parse Tree for <2-1>+1

Start

Expr

Expr

Expr

Op

+

Open

<

Close

>

Expr

Int

1

Op

-

Expr

Int

2

Expr

Int

1

1) StartExpr

2) ExprExpr Op Expr

3) Expr → Int

4) Expr → Open Expr Close

16 of 42

Recognition (Parsing)

  • Parser: Converts program into a parse tree
    • Text -> Parse Tree (Recognition)
  • Can be written by hand
  • Or produced automatically by parser generator
    • Accepts a grammar as input
    • Produces a parser as output

17 of 42

Terminals (aka Lexing)

< 2 - 1 > + 1

Open 2 - 1 > + 1

Open Int - 1 > + 1

Open Int Op 1 > + 1

Open Int Op Int > + 1

Open Int Op Int Close + 1

Open Int Op Int Close Op 1

Open Int Op Int Close Op Int

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

1) StartExpr

2) ExprExpr Op Expr

3) Expr → Int

4) Expr → Open Expr Close

18 of 42

Parsing for <2-1>+1

Open Int Op Int Close Op Int

Open Expr Op Int Close Op Int

Open Expr Op Expr Close Op Int

Open Expr Close Op Int

Expr Op Int

Expr Op Expr

Expr

Start

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

1) StartExpr

2) ExprExpr Op Expr

3) Expr → Int

4) Expr → Open Expr Close

19 of 42

What is this Grammar?

What is the language of the following grammar?

1) StartS

2) S → ( L )

3) S → a

4) LL , S

5) L S

20 of 42

1) StartS

2) S → ( L )

3) S → a

4) LL , S

5) L S

Consider: (a, (a, a))

(

a

,

(

a

,

a

)

)

S

S

S

L

L

L

S

L

S

21 of 42

Parse Tree

  • Internal Nodes: Nonterminals
  • Leaves: Terminals
  • Edges:
    • From Nonterminal of LHS of production
    • To Nodes from RHS of production
  • Captures derivation of string

  • Gives the first specification of the categories and structure of our language primitives!

22 of 42

Parser

  • Converts program into a parse tree
  • Can be written by hand
  • Or produced automatically by parser generator
    • Accepts a grammar as input
    • Produces a parser as output

  • Next lecture: implementation techniques

23 of 42

A Problem

Start

Expr

Expr

Expr

Op

+

Expr

Expr

Op

-

Int

2

Int

1

Int

1

Parse Tree: 2-1+1

Tree corresponding

to <2-1>+1

24 of 42

A Problem

Start

Expr

Expr

Expr

Op

+

Expr

Expr

Op

-

Int

2

Int

1

Int

1

Start

Expr

Expr

Expr

Op

-

Expr

Expr

Op

+

Int

1

Int

1

Int

2

Parse Tree: 2-1+1

Tree corresponding

to <2-1>+1

Tree corresponding

to 2-<1+1>

25 of 42

Ambiguity in Grammar

Grammar is ambiguous if there are multiple derivations (therefore multiple parse trees) for a single string

Derivation and parse tree usually reflect semantics of the program

Ambiguity in grammar often reflects ambiguity in semantics of language

(which is considered undesirable)

26 of 42

Eliminating Ambiguity

Solution: hack the grammar

Conceptually, makes all operators associate to left

Original Grammar

StartExpr

ExprExpr Op Expr

Expr → Int

Expr → Open Expr Close

Hacked Grammar

StartExpr

ExprExpr Op Int

Expr → Int

Expr → Open Expr Close

27 of 42

Parse Trees for Hacked Grammar

Start

Expr

Expr

Op

+

Expr

Op

-

Int

2

Int

1

Int

1

Start

Expr

Expr

Expr

Op

-

Expr

Expr

Op

+

Int

1

Int

1

Int

2

Only one parse tree for 2-1+1!

Valid parse tree

No longer valid parse tree

28 of 42

Precedence Violations

  • All operators associate to left

Start

Expr

Expr

Op

*

Expr

Op

-

Int

2

Int

3

Int

4

Parse tree for

2-3*4

29 of 42

Precedence Violations

  • All operators associate to left
  • Violates precedence of * over +
    • 2-3*4 associates like <2-3>*4

Start

Expr

Expr

Op

*

Expr

Op

-

Int

2

Int

3

Int

4

Parse tree for

2-3*4

We’ll come back to this later

30 of 42

Parser

  • Converts program into a parse tree
  • Can be written by hand
  • Or produced automatically by parser generator
    • Accepts a grammar as input
    • Produces a parser as output
  • Strategy:
    • Start with intuitive but ambiguous grammar
    • Hack grammar to make it unambiguous

31 of 42

Summary

  • Lexical and Syntactic Levels of Structure
    • Lexical – regular expressions and automata
    • Syntactic – grammars
  • Generation versus Recognition Approaches
    • Generation more convenient for specification
    • Recognition required in implementation
  • Grammar ambiguities and Precedence
    • Hacked grammars

  • Next up: algorithms for implementation and implementation concerns

32 of 42

End

33 of 42

Precedence Violations

  • All operators associate to left

Start

Expr

Expr

Op

*

Expr

Op

-

Int

2

Int

3

Int

4

Parse tree for

2-3*4

We’ll come back to this later

34 of 42

Precedence Violations

  • All operators associate to left
  • Violates precedence of * over +
    • 2-3*4 associates like <2-3>*4

Start

Expr

Expr

Op

*

Expr

Op

-

Int

2

Int

3

Int

4

Parse tree for

2-3*4

35 of 42

Hacking Around Precedence

Original Grammar

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

StartExpr

ExprExpr Op Int

Expr → Int

Expr → Open Expr Close

Hacked Grammar

AddOp = +|-

MulOp = *|/

Int = [0-9] [0-9]*

Open = <

Close = >

StartExpr

ExprExpr AddOp Term

ExprTerm

TermTerm MulOp Num

TermNum

Num → Int

Num → Open Expr Close

36 of 42

Parse Tree Changes

Start

Expr

Expr

Op

*

Expr

Op

-

Int

2

Int

3

Int

4

Old parse tree

for 2-3*4

Start

Expr

Expr

AddOp

-

Term

New parse tree

for 2-3*4

Term

Term

MulOp

*

Num

Int

4

Num

Int

2

Num

Int

3

37 of 42

General Idea

  • Group Operators into Precedence Levels
    • * and / in one level level, bind strongest
    • + and – in another level, bind next strongest
  • Nonterminal for each Precedence Level
    • Term is nonterminal for * and /
    • Expr is nonterminal for + and –
  • Stronger precedence level, lower in grammar/parse tree
  • Can make operators left or right associative within each level
  • Generalizes for arbitrary levels of precedence

38 of 42

Parse Tree

  • Practical problem
    • Parse tree for hacked grammar is complicated
    • Would like to start with more intuitive parse tree
  • Abstract versus Concrete Syntax
    • Abstract syntax corresponds to intuitive way of thinking of structure of program
      • Omits details like superfluous keywords that are there to make the language unambiguous
      • Abstract syntax may be ambiguous
    • Concrete Syntax corresponds to full grammar used to parse the language
  • Parsers are often written to produce abstract syntax trees.

39 of 42

Abstract Syntax Trees

  • Start with intuitive but ambiguous grammar
  • Hack grammar to make it unambiguous
    • Concrete parse trees
    • Less intuitive
  • Convert concrete parse trees to abstract syntax trees
    • Correspond to intuitive grammar for language
    • Simpler for program to manipulate

40 of 42

Example

Intuitive but Ambiguous Grammar

Op = *|/|+|-

Int = [0-9] [0-9]*

StartExpr

ExprExpr Op Expr

Expr → Int

Hacked Unambiguous Grammar

AddOp = +|-

MulOp = *|/

Int = [0-9] [0-9]*

Open = <

Close = >

StartExpr

ExprExpr AddOp Term

ExprTerm

TermTerm MulOp Num

TermNum

Num → Int

Num → Open Expr Close

41 of 42

Concrete parse tree

for <2-3>*4

Start

Expr

Expr

Op

*

Expr

Op

-

Int

2

Expr

Int

4

Expr

Int

3

Abstract syntax tree

for <2-3>*4

  • Uses intuitive grammar
  • Eliminates superfluous terminals
    • Open
    • Close

Start

Expr

Expr

AddOp

-

Term

Term

Term

MulOp

*

Num

Int

4

Num

Int

2

Num

Int

3

42 of 42

Start

Expr

Expr

Op

*

Op

-

Int

2

Int

4

Int

3

Start

Expr

Expr

Op

*

Expr

Op

-

Int

2

Expr

Int

4

Expr

Int

3

Abstract parse tree

for <2-3>*4

Further simplified abstract syntax tree

for <2-3>*4