1 of 32

Id3 = 0

while (id3 < 10) {

println(“”,id3);� id3 = id3 + 1 }

source code

i�d3

=�

0�LF

id3�=�0�while�(�id3

<�10�)

lexer

characters

words

(tokens)

trees

parser

assign

while

i 0

assign

a[i]

Compiler

2 of 32

Parse Tree vs Abstract Syntax Tree (AST)

while (x > 0) x = x - 1

Pretty printer: takes abstract syntax tree (AST) and outputs the �leaves of one possible (concrete) parse tree.

parse(prettyPrint(ast)) ≈ ast

3 of 32

Parse Tree vs Abstract Syntax Tree (AST)

Each node in parse tree has children corresponding precisely to right-hand side of grammar rules. The definition of parse trees is fixed given the grammar

Often compiler never actually builds parse trees in memory

Nodes in abstract syntax tree (AST) contain only useful information and usually omit the punctuation signs. We can choose our own syntax trees, to make it convenient for both construction in parsing and for later stages of our compiler or interpreter

A compiler typically directly builds AST

4 of 32

Abstract Syntax Trees for Statements

statmt ::= println ( stringConst , ident )

| ident = expr

| if ( expr ) statmt (else statmt)^?

| while ( expr ) statmt� | { statmt* }

abstract class Statmt

case class PrintlnS(msg : String, var : Identifier) extends Statmt

case class Assignment(left : Identifier, right : Expr) extends Statmt

case class If(cond : Expr, trueBr : Statmt,

falseBr : Option[Statmt]) extends Statmt

case class While(cond : Expr, body : Expr) extends Statmt

case class Block(sts : List[Statmt]) extends Statmt

grammar:

AST classes:

5 of 32

Abstract Syntax Trees for Statements

statmt ::= println ( stringConst , ident )

| ident = expr

| if ( expr ) statmt (else statmt)^?

| while ( expr ) statmt� | { statmt* }

abstract class Statmt

case class PrintlnS(msg : String, var : Identifier) extends Statmt

case class Assignment(left : Identifier, right : Expr) extends Statmt

case class If(cond : Expr, trueBr : Statmt,

falseBr : Option[Statmt]) extends Statmt

case class While(cond : Expr, body : Statmt) extends Statmt

case class Block(sts : List[Statmt]) extends Statmt

6 of 32

Our Parser Produced Nothing ☹

def skip(t : Token) : Unit = � if (lexer.token==t) lexer.next

else error(“Expected”+ t)

def expr : Unit = { … }

// statmt ::=

def statmt : Unit = {

// println ( stringConst , ident )

if (lexer.token == Println) { lexer.next;

skip(OParen)� skip(StringConst); skip(Comma)

skipIdentifier; skip(CParen)

// | ident = expr

} else if (lexer.token == Ident) { lexer.next;

skip(Equality); expr

7 of 32

New Parser: Returning an AST ☺

def getIdent: String = { lexer.token match {� case ID(str) => str� case _ => error(“Expected an identifier but found” + lexer.token)�}}

def expr : Expr = { … }

def statmt : Statmt = {

// println ( stringConst , ident )

if (lexer.token == Println) { lexer.next;

skip(openParen); val s = getString

skip(comma)

val id = getIdent; skip(closedParen)

PrintlnS(s, id)

} else if …

8 of 32

Constructing Tree for ‘if’

def statmt : Statmt = {

…

// if ( expr ) statmt (else statmt)^?

// case class If(cond : Expr, trueBr: Statmt, falseBr: Option[Statmt])

} else if (lexer.token == ifKeyword) { lexer.next;

skip(openParen); val c = expr; skip(closedParen);

val trueBr = statmt

val elseBr = if (lexer.token == elseKeyword) {

lexer.next; Some(statmt) } else None

If(c, trueBr, elseBr) // we made a tree node ☺

}

9 of 32

Task: Constructing AST for ‘while’

def statmt : Statmt = {

…

// while ( expr ) statmt

// case class While(cond : Expr, body : Expr) extends Statmt

} else if (lexer.token == WhileKeyword) { ��

} else …

10 of 32

Here each alternative started with a different token

statmt ::=

println ( stringConst , ident )

| ident = expr

| if ( expr ) statmt (else statmt)^?

| while ( expr ) statmt� | { statmt* }

What if this is not the case?

11 of 32

Left Factoring Example: Function Calls

statmt ::=

println ( stringConst , ident )

| ident = expr

| if ( expr ) statmt (else statmt)^?

| while ( expr ) statmt� | { statmt* }

| ident (expr (, expr )* )

code to parse the grammar:

} else if (lexer.token.class == Ident) {

???� }

foo = 42 + x

foo ( u , v )

12 of 32

Left Factoring Example: Function Calls

statmt ::=

println ( stringConst , ident )

| ident assignmentOrCall

| if ( expr ) statmt (else statmt)^?

| while ( expr ) statmt� | { statmt* }

assignmentOrCall ::= “=“ expr | (expr (, expr )* )

code to parse the grammar:

} else if (lexer.token.class == Ident) {

val id = getIdentifier(lexer.token); lexer.next

assignmentOrCall(id)� }

// Factoring pulls common parts from alternatives

13 of 32

Beyond Statements:�Parsing Expressions

14 of 32

While Language with Simple Expressions

expr ::= intLiteral | ident

| expr ( + | / ) expr

statmt ::=

println ( stringConst , ident )

| ident = expr

| if ( expr ) statmt (else statmt)^?

| while ( expr ) statmt� | { statmt* }

15 of 32

Abstract Syntax Trees for Expressions

abstract class Expr

case class IntLiteral(x : Int) extends Expr

case class Variable(id : Identifier) extends Expr

case class Plus(e1 : Expr, e2 : Expr) extends Expr

case class Divide(e1 : Expr, e2 : Expr) extends Expr

expr ::= intLiteral | ident

| expr + expr | expr / expr

foo + 42 / bar + arg

16 of 32

Parser That Follows the Grammar?

def expr : Expr = {

if (??) IntLiteral(getInt(lexer.token))

else if (??) Variable(getIdent(lexer.token))

else if (??) {

val e1 = expr; val op = lexer.token; val e2 = expr

op match Plus {

case PlusToken => Plus(e1, e2)

case DividesToken => Divides(e1, e2)

} }

expr ::= intLiteral | ident

| expr + expr | expr / expr

When should parser enter the recursive case?!

input:

foo + 42 / bar + arg

17 of 32

Ambiguous Grammars

expr ::= intLiteral | ident

| expr + expr | expr / expr

ident + intLiteral / ident + ident

Ambiguous grammar: if some token sequence has multiple parse trees

(then it is has multiple abstract trees).

Each node in parse tree is given by

one grammar alternative.

18 of 32

Ambiguous Expression Grammar

expr ::= intLiteral | ident

| expr + expr | expr / expr

Show that the input above has two parse trees!

Each node in parse tree is given by

one grammar alternative.

ident + intLiteral / ident + ident

19 of 32

Exercise: Balanced Parentheses I

Show that the following balanced parentheses grammar is ambiguous (by finding two parse trees for some input sequence).

B ::= ε | ( B ) | B B �

20 of 32

Remark

The same parse tree can be derived using two different derivations, e.g.

B -> (B) -> (BB) -> ((B)B) -> ((B)) -> (())

B -> (B) -> (BB) -> ((B)B) -> (()B) -> (())

this correspond to different orders in which nodes in the tree are expanded.

Ambiguity refers to the fact that there are actually multiple parse trees, not just multiple derivations.

21 of 32

Exercise: Balanced Parentheses

Show that the following balanced parentheses grammar is ambiguous (by finding two parse trees for some input sequence) and find unambiguous grammar for the same language.

B ::= ε | ( B ) | B B �

22 of 32

Solution for Unambiguous Parenthesis Grammar

Proposed solution:

B ::= ε | B (B)

How to come up with it?
Clearly, rule B::= B B generates any sequence of B's. We can also encode it like this:� B ::= C*� C ::= (B)
Now we express sequence using right-recursive rule that does not create ambiguity:

B ::= ε | C B� C ::= (B)

but now, look, we "inline" C back into the rules for so we get exactly the rule

B ::= ε | B (B)

This grammar is not ambiguous and is the solution. We did not prove unambiguity (we only tried to find ambiguous trees but did not find any).

23 of 32

Exercise: �Left Recursive and Right Recursive

We call a production rule “left recursive” if it is of the form

A ::= A p

for some sequence of symbols p. Similarly, a "right-recursive" rule is of a form

A ::= q A

Is every context free grammar that contains both left and right recursive rule for a some nonterminal A ambiguous?

Answer: it is ambiguous if such A is such that� S => … => uAv => … => w for w a string of terminals

24 of 32

An attempt to rewrite the grammar

def simpleExpr : Expr = { … }

def expr : Expr = {

var e = simpleExpr

while (lexer.token == PlusToken ||

lexer.token == DividesToken)) {

val op = lexer.token

val eNew = simpleExpr

op match {

case TokenPlus => { e = Plus(e, eNew) }

case TokenDiv => { e = Divide(e, eNew) }� }

}

e }

expr ::= simpleExpr (( + | / ) simpleExpr)*

simpleExpr ::= intLiteral | ident

foo + 42 / bar + arg

Not ambiguous, but gives wrong tree.

25 of 32

Making Grammar Unambiguous�and Constructing Correct Trees

26 of 32

Goal: Build Expression Trees

abstract class Expr

case class Variable(id : Identifier) extends Expr

case class Minus(e1 : Expr, e2 : Expr) extends Expr

case class Exp(e1 : Expr, e2 : Expr) extends Expr

different parse trees give different ASTs:

Minus(e1, Minus(e2,e3)) e1 - (e2 - e3)

Minus(Minus(e1,e2),e3) (e1 - e2) - e3

we want priorities and associativity such that

a – b^c^d – e becomes, just like (a – (b^(c^d))) – e

Minus(Minus(a, Exp(b, Exp(c, d))), e)

27 of 32

1) Layer the grammar by priorities

expr ::= term (- term)*

term ::= factor (^ factor)*

factor ::= id | (expr)

lower priority binds weaker,

so it goes outside

expr ::= ident | expr - expr | expr ^ expr | (expr)

28 of 32

2) Building trees: left-associative "-"

LEFT-associative operator

x – y – z ➔ (x – y) – z

Minus(Minus(Var(“x”),Var(“y”)), Var(“z”))

def expr : Expr = {

var e =

while (lexer.token == MinusToken) {

lexer.next

}

29 of 32

3) Building trees: right-associative "^"

RIGHT-associative operator – using recursion � (or also loop and then reverse a list)

x ^ y ^ z ➔ x ^ (y ^ z)

Exp(x, Exp(y,z) )

def expr : Expr = {

val e = factor

if (lexer.token == ExpToken) {

lexer.next

Exp(e, expr)

} else e

}

30 of 32

Manual Construction of Parsers

Typically one applies previous transformations to get a nice grammar
Then we write recursive descent parser as set of mutually recursive procedures that check if input is well formed
Then enhance such procedures to construct trees, paying attention to the associativity and priority of operators

31 of 32

Exercise: Unary Minus

1) Show that the grammar

A ::= − A � A ::= A − id � A ::= id

is ambiguous by finding a string that has two different parse trees. Show those parse trees.

2) Make two different unambiguous grammars for the same language:� a) One where prefix minus binds stronger than infix minus.� b) One where infix minus binds stronger than prefix minus.�3) Show the syntax trees using the new grammars for the string you used to prove the original grammar ambiguous.

4) Give a regular expression describing the same language.

32 of 32

Unary Minus Solution Sketch

1) An example of a string with two parse trees is� - id - id�The two parse trees are generated by these imaginary parentheses (shown red): -(id-id) (-id)-id

and can generated by these derivations that give different parse trees� A => -A => - A - id => - id - id� A => A - id => - A - id => - id - id

2) a) prefix minus binds stronger:� A ::= B | A - id B ::= -B | id� b) infix minus binds stronger

A ::= C | -A C ::= id | C - id�3) in two trees that used to be ambiguous instead of some A’s we have B’s in a) grammar or C’s in b) grammar.

4) -*id(-id)*