6.1120: Parsing
Basic Concepts
Goal: Tokenization for (2-1)+1
Op
+
Open
(
Close
)
Int
1
Op
-
Int
2
Int
1
Token:
Text:
Lexical Categories Example
[a-z] = (a|b|c|…|y|z)
Lexical Structure in Languages
Each language typically has several categories of words. In a typical programming language:
Write a Regular Expression
Nested Expressions
Programming Language Syntax
Goal: Parse Tree for <2-1>+1
Start
Expr
Expr
Expr
Op
+
Open
<
Close
>
Expr
Int
1
Op
-
Expr
Int
2
Expr
Int
1
Solution – Context-Free Grammar
{ Op, Int, Open, Close }
Each terminal defined
by regular expression
{ Start, Expr }
Op = +|-|*|/
Int = [0-9] [0-9]*
Open = <
Close = >
Start → Expr
Expr → Expr Op Expr
Expr → Int
Expr → Open Expr Close
Generation
start with Start nonterminal
repeat
choose a nonterminal
choose a production with that nonterminal in LHS
replace nonterminal with RHS of production
until no more nonterminals in the string
Sample Derivation for <2-1>+1
Start
Expr
Expr Op Expr
Open Expr Close Op Expr
Open Expr Op Expr Close Op Expr
Open Int Op Expr Close Op Expr
Open Int Op Expr Close Op Int
Open Int Op Int Close Op Int
Op = +|-|*|/
Int = [0-9] [0-9]*
Open = <
Close = >
1) Start → Expr
2) Expr → Expr Op Expr
3) Expr → Int
4) Expr → Open Expr Close
Sample Derivation for <2-1>+1
Open Int Op Int Close Op Int
< Int Op Int Close Op Int
< [0-9][0-9]* Op Int Close Op Int
< 2 Op Int Close Op Int
< 2 +|-|*|/ Int Close Op Int
< 2 - Int Close Op Int
< 2 - [0-9][0-9]* Close Op Int
< 2 - 1 Close Op Int
< 2 - 1 > Op Int
< 2 - 1 > +|-|*|/ Int
< 2 - 1 > + Int
< 2 - 1 > + [0-9][0-9]*
< 2 - 1 > + 1
Op = +|-|*|/
Int = [0-9] [0-9]*
Open = <
Close = >
1) Start → Expr
2) Expr → Expr Op Expr
3) Expr → Int
4) Expr → Open Expr Close
Goal: Parse Tree for <2-1>+1
Start
Expr
Expr
Expr
Op
+
Open
<
Close
>
Expr
Int
1
Op
-
Expr
Int
2
Expr
Int
1
Parse Tree
Goal: Parse Tree for <2-1>+1
Start
Expr
Expr
Expr
Op
+
Open
<
Close
>
Expr
Int
1
Op
-
Expr
Int
2
Expr
Int
1
1) Start → Expr
2) Expr → Expr Op Expr
3) Expr → Int
4) Expr → Open Expr Close
Recognition (Parsing)
Terminals (aka Lexing)
< 2 - 1 > + 1
Open 2 - 1 > + 1
Open Int - 1 > + 1
Open Int Op 1 > + 1
Open Int Op Int > + 1
Open Int Op Int Close + 1
Open Int Op Int Close Op 1
Open Int Op Int Close Op Int
Op = +|-|*|/
Int = [0-9] [0-9]*
Open = <
Close = >
1) Start → Expr
2) Expr → Expr Op Expr
3) Expr → Int
4) Expr → Open Expr Close
Parsing for <2-1>+1
Open Int Op Int Close Op Int
Open Expr Op Int Close Op Int
Open Expr Op Expr Close Op Int
Open Expr Close Op Int
Expr Op Int
Expr Op Expr
Expr
Start
Op = +|-|*|/
Int = [0-9] [0-9]*
Open = <
Close = >
1) Start → Expr
2) Expr → Expr Op Expr
3) Expr → Int
4) Expr → Open Expr Close
What is this Grammar?
What is the language of the following grammar?
1) Start → S
2) S → ( L )
3) S → a
4) L → L , S
5) L → S
1) Start → S
2) S → ( L )
3) S → a
4) L → L , S
5) L → S
Consider: (a, (a, a))
(
a
,
(
a
,
a
)
)
S
S
S
L
L
L
S
L
S
Parse Tree
Parser
A Problem
Start
Expr
Expr
Expr
Op
+
Expr
Expr
Op
-
Int
2
Int
1
Int
1
Parse Tree: 2-1+1
Tree corresponding
to <2-1>+1
A Problem
Start
Expr
Expr
Expr
Op
+
Expr
Expr
Op
-
Int
2
Int
1
Int
1
Start
Expr
Expr
Expr
Op
-
Expr
Expr
Op
+
Int
1
Int
1
Int
2
Parse Tree: 2-1+1
Tree corresponding
to <2-1>+1
Tree corresponding
to 2-<1+1>
Ambiguity in Grammar
Grammar is ambiguous if there are multiple derivations (therefore multiple parse trees) for a single string
Derivation and parse tree usually reflect semantics of the program
Ambiguity in grammar often reflects ambiguity in semantics of language
(which is considered undesirable)
Eliminating Ambiguity
Solution: hack the grammar
Conceptually, makes all operators associate to left
Original Grammar
Start → Expr
Expr → Expr Op Expr
Expr → Int
Expr → Open Expr Close
Hacked Grammar
Start → Expr
Expr → Expr Op Int
Expr → Int
Expr → Open Expr Close
Parse Trees for Hacked Grammar
Start
Expr
Expr
Op
+
Expr
Op
-
Int
2
Int
1
Int
1
Start
Expr
Expr
Expr
Op
-
Expr
Expr
Op
+
Int
1
Int
1
Int
2
Only one parse tree for 2-1+1!
Valid parse tree
No longer valid parse tree
Precedence Violations
Start
Expr
Expr
Op
*
Expr
Op
-
Int
2
Int
3
Int
4
Parse tree for
2-3*4
Precedence Violations
Start
Expr
Expr
Op
*
Expr
Op
-
Int
2
Int
3
Int
4
Parse tree for
2-3*4
We’ll come back to this later
Parser
Summary
End
Precedence Violations
Start
Expr
Expr
Op
*
Expr
Op
-
Int
2
Int
3
Int
4
Parse tree for
2-3*4
We’ll come back to this later
Precedence Violations
Start
Expr
Expr
Op
*
Expr
Op
-
Int
2
Int
3
Int
4
Parse tree for
2-3*4
Hacking Around Precedence
Original Grammar
Op = +|-|*|/
Int = [0-9] [0-9]*
Open = <
Close = >
Start → Expr
Expr → Expr Op Int
Expr → Int
Expr → Open Expr Close
Hacked Grammar
AddOp = +|-
MulOp = *|/
Int = [0-9] [0-9]*
Open = <
Close = >
Start → Expr
Expr → Expr AddOp Term
Expr → Term
Term → Term MulOp Num
Term → Num
Num → Int
Num → Open Expr Close
Parse Tree Changes
Start
Expr
Expr
Op
*
Expr
Op
-
Int
2
Int
3
Int
4
Old parse tree
for 2-3*4
Start
Expr
Expr
AddOp
-
Term
New parse tree
for 2-3*4
Term
Term
MulOp
*
Num
Int
4
Num
Int
2
Num
Int
3
General Idea
Parse Tree
Abstract Syntax Trees
Example
Intuitive but Ambiguous Grammar
Op = *|/|+|-
Int = [0-9] [0-9]*
Start → Expr
Expr → Expr Op Expr
Expr → Int
Hacked Unambiguous Grammar
AddOp = +|-
MulOp = *|/
Int = [0-9] [0-9]*
Open = <
Close = >
Start → Expr
Expr → Expr AddOp Term
Expr → Term
Term → Term MulOp Num
Term → Num
Num → Int
Num → Open Expr Close
Concrete parse tree
for <2-3>*4
Start
Expr
Expr
Op
*
Expr
Op
-
Int
2
Expr
Int
4
Expr
Int
3
Abstract syntax tree
for <2-3>*4
Start
Expr
Expr
AddOp
-
Term
Term
Term
MulOp
*
Num
Int
4
Num
Int
2
Num
Int
3
Start
Expr
Expr
Op
*
Op
-
Int
2
Int
4
Int
3
Start
Expr
Expr
Op
*
Expr
Op
-
Int
2
Expr
Int
4
Expr
Int
3
Abstract parse tree
for <2-3>*4
Further simplified abstract syntax tree
for <2-3>*4