1 of 42

6.1120: Parsing

Top-Down Parsing

2 of 42

Example

Boolean Term()

if (token = Int n) token = NextToken(); return(TermPrime())

else return(false)

Boolean TermPrime()

if (token = *)

token = NextToken();

if (token = Int n) token = NextToken(); return(TermPrime())

else return(false)

else if (token = /)

token = NextToken();

if (token = Int n) token = NextToken(); return(TermPrime())

else return(false)

else return(true)

Term → Int Term

Term* Int Term

Term/ Int Term

Term → ε

3 of 42

Example: the skeleton

4 of 42

Generative Approach: �<2-1>+1

Start

Expr

Expr Op Expr

Open Expr Close Op Expr

Op = +|-|*|/

Int = [0-9] [0-9]*

Open = <

Close = >

1) StartExpr

2) ExprExpr Op Expr

3) Expr → Int

4) Expr → Open Expr Close

Which production to choose?

Which nonterminal �to expand?

5 of 42

Top-Down Parsing Algorithm So Far

Fundamentally, there are two design choices:

  • Which nonterminal to expand
    • Left-most derivation
  • Which production to choose?
    • Predictive parsing
  • Implementation Concerns
    • Will the algorithm terminate? (Eliminate Left Recursion)

Yesterday: fast overview of everything

Today: systematic thinking for predictive parsing.

6 of 42

Predictive Parsing + Hand Coding = Recursive Descent Parser

  • One procedure per nonterminal NT
    • Productions NT → β1 , …, NT → βn
    • Procedure examines the current input symbol T to determine which production to apply
      • If T∈Firstk)
      • Apply production k
      • Consume terminals in βk (check for correct terminal)
      • Recursively call procedures for nonterminals in βk
    • Current input symbol stored in global variable token
  • Procedures return
    • true if parse succeeds
    • false if parse fails

7 of 42

Example

Boolean Term()

if (token = Int n) token = NextToken(); return(TermPrime())

else return(false)

Boolean TermPrime()

if (token = *)

token = NextToken();

if (token = Int n) token = NextToken(); return(TermPrime())

else return(false)

else if (token = /)

token = NextToken();

if (token = Int n) token = NextToken(); return(TermPrime())

else return(false)

else return(true)

Term → Int Term

Term* Int Term

Term/ Int Term

Term → ε

8 of 42

Nonterminals

  • What about productions with nonterminals?

NT NT1 α1

NT NT2 α 2

  • Must choose based on possible first terminals that NT1 and NT2 can generate
  • What if NT1 or NT2 can generate ε?
    • Must choose based on α1 and α2

9 of 42

First(β)

  • T∈ First(β ) if T can appear as the first symbol in a derivation starting from β

1) T∈First(T )

2) First(S ) ⊆ First(S β)

3) NT derives ε implies First(β) ⊆ First(NT β)

4) NTS β implies First(S β) ⊆ First(NT )

  • Notation
    • T is a terminal, NT is a nonterminal, S is a terminal or nonterminal, and β is a sequence of terminals or nonterminals

10 of 42

First(β)

  • T∈ First(β ) if T can appear as the first symbol in a derivation starting from β

1) T∈First(T )

2) First(S ) ⊆ First(S β)

3) NT derives ε implies First(β) ⊆ First(NT β)

4) NTS β implies First(S β) ⊆ First(NT )

  • Notation
    • T is a terminal, NT is a nonterminal, S is a terminal or nonterminal, and β is a sequence of terminals or nonterminals

11 of 42

NT derives ε

  • Two rules
    • NT → ε implies NT derives ε
    • NTNT1 ... NTn and for all 1i n NTi derives ε implies NT derives ε

12 of 42

Fixed Point Algorithm for Derives ε

for all nonterminals NT

set NT derives ε to be false

for all productions of the form NT → ε

set NT derives ε to be true

while (some NT derives ε changed in last iteration)

for all productions of the form NTNT1 ... NTn

if (for all 1i n NTi derives ε)

set NT derives ε to be true

13 of 42

First(β)

  • T∈ First(β ) if T can appear as the first symbol in a derivation starting from β

1) T∈First(T )

2) First(S ) ⊆ First(S β)

3) NT derives ε implies First(β) ⊆ First(NT β)

4) NTS β implies First(S β) ⊆ First(NT )

  • Notation
    • T is a terminal, NT is a nonterminal, S is a terminal or nonterminal, and β is a sequence of terminals or nonterminals

14 of 42

Rules + Request Generate System of Subset Inclusion Constraints

Grammar

Term* Int Term

Term/ Int Term

Term → ε

Request: What is First(Term )?

Constraints

First(* Int Term ) ⊆ First(Term )

First(/ Int Term ) ⊆ First(Term )

First(*) ⊆ First(* Int Term )

First(/) ⊆ First(/ Int Term )

*∈First(*)

/ ∈First(/)

Rules

1) T∈First(T )

2) First(S) ⊆ First(S β)

3) NT derives ε implies First(β) ⊆ First(NT β)

4) NTS β implies

First(S β) ⊆ First(NT )

15 of 42

Constraint Propagation Algorithm

Constraints

First(* Int Term ) ⊆ First(Term )

First(/ Int Term ) ⊆ First(Term )

First(*) ⊆ First(* Int Term )

First(/) ⊆ First(/ Int Term )

*∈First(*)

/ ∈First(/)

Solution

First(Term ) = {}

First(* Int Term ) = {}

First(/ Int Term ) = {}

First(*) = {}

First(/) = {}

Initialize Sets to {}

Propagate Constraints Until Fixed Point

16 of 42

Constraint Propagation Algorithm

Constraints

First(* Num Term ) ⊆ First(Term )

First(/ Num Term ) ⊆ First(Term )

First(*) ⊆ First(* Num Term )

First(/) ⊆ First(/ Num Term )

*∈First(*)

/ ∈First(/)

Solution

First(Term ) = {}

First(* Num Term ) = {}

First(/ Num Term ) = {}

First(*) = {}

First(/) = {}

Grammar

Term* Int Term

Term/ Int Term

Term → ε

17 of 42

Constraint Propagation Algorithm

Constraints

First(* Num Term ) ⊆ First(Term )

First(/ Num Term ) ⊆ First(Term )

First(*) ⊆ First(* Num Term )

First(/) ⊆ First(/ Num Term )

*∈First(*)

/ ∈First(/)

Solution

First(Term ) = {}

First(* Num Term ) = {}

First(/ Num Term ) = {}

First(*) = {*}

First(/) = {/}

Grammar

Term* Int Term

Term/ Int Term

Term → ε

18 of 42

Constraint Propagation Algorithm

Solution

First(Term ) = {}

First(* Num Term ) = {*}

First(/ Num Term ) = {/}

First(*) = {*}

First(/) = {/}

Constraints

First(* Num Term ) ⊆ First(Term )

First(/ Num Term ) ⊆ First(Term )

First(*) ⊆ First(* Num Term )

First(/) ⊆ First(/ Num Term )

*∈First(*)

/ ∈First(/)

Grammar

Term* Int Term

Term/ Int Term

Term → ε

19 of 42

Constraint Propagation Algorithm

Solution

First(Term ) = {*,/}

First(* Num Term ) = {*}

First(/ Num Term ) = {/}

First(*) = {*}

First(/) = {/}

Constraints

First(* Num Term ) ⊆ First(Term )

First(/ Num Term ) ⊆ First(Term )

First(*) ⊆ First(* Num Term )

First(/) ⊆ First(/ Num Term )

*∈First(*)

/ ∈First(/)

Grammar

Term* Int Term

Term/ Int Term

Term → ε

20 of 42

Predictive Parsing + Hand Coding = Recursive Descent Parser

  • One procedure per nonterminal NT
    • Productions NT → β1 , …, NT → βn
    • Procedure examines the current input symbol T to determine which production to apply
      • If T∈Firstk)
      • Apply production k
      • Consume terminals in βk (check for correct terminal)
      • Recursively call procedures for nonterminals in βk
    • Current input symbol stored in global variable token
  • Procedures return
    • true if parse succeeds
    • false if parse fails

21 of 42

Example

Boolean Term()

if (token = Int n) token = NextToken(); return(TermPrime())

else return(false)

Boolean TermPrime()

if (token = *)

token = NextToken();

if (token = Int n) token = NextToken(); return(TermPrime())

else return(false)

else if (token = /)

token = NextToken();

if (token = Int n) token = NextToken(); return(TermPrime())

else return(false)

else return(true)

Term → Int Term

Term* Int Term

Term/ Int Term

Term → ε

First(* Num Term ) = {*}

First(/ Num Term ) = {/}

22 of 42

Act 5

23 of 42

Multiple Productions With Same Prefix in RHS

  • Example Grammar

NT → if then

NT → if then else

  • Assume NT is current position in parse tree, and if is the next token
  • Unclear which production to apply
    • Multiple k such that T∈First(βk)
    • if ∈ First(if then)
    • if ∈ First(if then else)

24 of 42

Solution: Left Factor the Grammar

  • New Grammar Factors Common Prefix Into Single Production

NT → if then NT

NT → else

NT → ε

  • No choice when next token is if!
  • All choices have been unified in one production.

25 of 42

Act 6

26 of 42

How far we have come

Unambiguous/Precedence

TermTerm * Int

TermTerm / Int

Term → Int

Eliminate Left Recursion�

Term → Int Term

Term* Int Term

Term/ Int Term

Term ε

Original Grammar Fragment�

ExprExpr Op Expr

Expr → Int

Op = (* | /)

27 of 42

Building A Parse Tree

  • Have each procedure return the section of the parse tree for the part of the string it parsed
  • Use exceptions to make code structure clean

Term

Int

2

Term

Int

3

*

Term

Int

4

*

Term

ε

Parse Tree

28 of 42

Building Parse Tree In Example

Term()

if (token = Int n)

oldToken = token; token = NextToken();

node = TermPrime();

if (node == NULL) return oldToken;

else return(new TermNode(oldToken, node);

else throw SyntaxError

TermPrime()

if (token = *) || (token = /)

first = token; next = NextToken();

if (next = Int n)

token = NextToken();

return(new TermPrimeNode(first, next, TermPrime())

else throw SyntaxError

else return(NULL)

29 of 42

Why Use Hand-Coded Parser?

  • Why not use parser generator?
  • What do you do if your parser doesnt work?
    • Recursive descent parser – write more code
    • Parser generator
      • Hack grammar
      • But if parser generator doesnt work, nothing you can do
  • If you have complicated grammar
    • Increase chance of going outside comfort zone of parser generator
    • Your parser may NEVER work

30 of 42

Bottom Line

  • Recursive descent parser properties
    • Probably more work
    • But less risk of a disaster - you can almost always make a recursive descent parser work
    • May have easier time dealing with resulting code
      • Single language system
      • No need to deal with potentially flaky parser generator
      • No integration issues with automatically generated code
  • If your parser development time is small compared to rest of project, or you have a really complicated language, use hand-coded recursive descent parser

31 of 42

Summary

  • Top-Down Parsing
  • Use Lookahead to Avoid Backtracking
  • Parser is
    • Hand-Coded
    • Set of Mutually Recursive Procedures

32 of 42

Backup

33 of 42

How far we have come

Unambiguous/Precedence

TermTerm * Int

TermTerm / Int

Term → Int

Eliminate Left Recursion�

Term → Int Term

Term* Int Term

Term/ Int Term

Term ε

Original Grammar Fragment�

ExprExpr Op Expr

Expr → Int

Op = (* | /)

34 of 42

Parse Tree for 2*3*4

Term

Int

2

Term

Int

3

*

Term

Int

4

*

Term

ε

Term

Int

3

*

Term

Int

4

*

Int

2

Concrete

Parse Tree

Desired

Abstract

Parse Tree

35 of 42

Direct Generation of Abstract Tree

  • TermPrime builds an incomplete tree
    • Missing leftmost child
    • Returns root and incomplete node
  • (root, incomplete) = TermPrime()
    • Called with token = *
    • Remaining tokens = 3 * 4

Term

Int

3

*

Term

Int

4

*

root

incomplete

Missing left child to be filled in by caller

36 of 42

Code for Term

Term()

if (token = Int n)

leftmostInt = token; token = NextToken();

(root, incomplete) = TermPrime();

if (root == NULL) return leftmostInt;

incomplete.leftChild = leftmostInt;

return root;

else throw SyntaxError

Int

2

token

2*3*4

Input to parse

37 of 42

Code for Term

Term()

if (token = Int n)

leftmostInt = token; token = NextToken();

(root, incomplete) = TermPrime();

if (root == NULL) return leftmostInt;

incomplete.leftChild = leftmostInt;

return root;

else throw SyntaxError

Int

2

token

2*3*4

Input to parse

38 of 42

Code for Term

Term()

if (token = Int n)

leftmostInt = token; token = NextToken();

(root, incomplete) = TermPrime();

if (root == NULL) return leftmostInt;

incomplete.leftChild = leftmostInt;

return root;

else throw SyntaxError

Int

2

token

2*3*4

Input to parse

39 of 42

Code for Term

Term()

if (token = Int n)

leftmostInt = token; token = NextToken();

(root, incomplete) = TermPrime();

if (root == NULL) return leftmostInt;

incomplete.leftChild = leftmostInt;

return root;

else throw SyntaxError

Term

Int

3

*

Term

Int

4

*

Int

2

root

incomplete

leftmostInt

2*3*4

Input to parse

40 of 42

Code for Term

Term()

if (token = Int n)

leftmostInt = token; token = NextToken();

(root, incomplete) = TermPrime();

if (root == NULL) return leftmostInt;

incomplete.leftChild = leftmostInt;

return root;

else throw SyntaxError

Term

Int

3

*

Term

Int

4

*

Int

2

root

incomplete

leftmostInt

2*3*4

Input to parse

41 of 42

Code for Term

Term()

if (token = Int n)

leftmostInt = token; token = NextToken();

(root, incomplete) = TermPrime();

if (root == NULL) return leftmostInt;

incomplete.leftChild = leftmostInt;

return root;

else throw SyntaxError

Term

Int

3

*

Term

Int

4

*

Int

2

root

incomplete

leftmostInt

2*3*4

Input to parse

42 of 42

Code for TermPrime

TermPrime()

if (token = *) || (token = /)

op = token; next = NextToken();

if (next = Int n)

token = NextToken();

(root, incomplete) = TermPrime();

if (root == NULL)

root = new ExprNode(NULL, op, next);

return (root, root);

else newChild = new ExprNode(NULL, op, next);

incomplete.leftChild = newChild;

return(root, newChild);

else throw SyntaxError

else return(NULL,NULL)

Missing left child to be filled in by caller