Parsers and Interpreters
Part 1: Grammar and ASTs
Our View Of The World So Far
Program Source Code (string)
Compiler / Interpreter
(Magic Happens Here)
Results (string / image)
Overview
Definition: An interpreter is a program that runs other programs.
Example: Your web browser runs JavaScript programs, thus it has a JavaScript interpreter. (Strictly speaking, your web browser has a Just-in-time (JIT) compiler, but an interpreter is an essential part of a JIT.)
This week’s project: You will write an interpreter for a tiny portion of JavaScript. (If you want to learn more, take COMPSCI 497P.)
Example Behavior
function interpProgram(inputString) {
...
return resultValue;
}
interpProgram("print(2 + 3)"); // prints 5
interpProgram("let x = 2; print(x * 10)"); // prints 20
interpProgram("let n = 5; let r = 0;" +
"while (n > 0) { r = r * n; n = n - 1; }" +
"print(r);"); // prints 120
interp does not directly process inputString.
Why interp does not directly process inputString
1 + 2 * 3
= 3 * 3
= 9
What does “1 + 2 * 3” mean?
1 + 2 * 3
= 1 * 6
= 6
It is mathematical convention to treat “1 + 2 * 3” as “1 + (2 * 3)”. However, this choice is arbitrary. An interpreter could produce either answer, but we need to choose one.
Project in 187
Operator Precedence: �Exploiting Convention To Resolve Ambiguity
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Operator_Precedence
Examples of Disambiguation by Operator Precedence
let a = new MyClass().toString();
let b = ++foo.bar;
c += ++d;
(Don't write such code!)
Why interp does not directly process inputString
Abstract syntax tree (AST): internal representation of programs, without ambiguity.
(1 + 2) * 3
1 + (2 * 3)
1
2
3
+
*
3
1
2
*
+
Concrete syntax: what programs look like to the programmer.
7
6
Result of interp.
Notice: no parenthesis in the AST. They are only needed in the concrete syntax.
parse (provided)
interp (project)
Parsing
This section introduces parsing. For the project, you will not need to write a parser yourself. However, you will need to understand what it does to write the interpreter.
A parser is a program that turns concrete syntax into an abstract syntax tree. If the input program is not parsable, the parser produces and error, along with an error message that describes what went wrong. A grammar describes the strings that the parser accepts.
A Grammar for Arithmetic
Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...
Expressions e ::= n
| e1 + e2
| e1 - e2
| e1 * e2
| ( e1 )
Grammar: Terminals
Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...
Terminals must match strings exactly. What about combinations?
Grammar: Combinations
Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...
Expressions e ::= n1 + n2
What about different forms of combinations?
4. Invalid Expressions:� 12 + � 12 + +12� 1 - 2�
Grammar: Alternative rules
Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...
Expressions e ::= n1 + n2
| n1 - n2
These expressions can only match strings with up to three tokens (number op number).
What about longer expressions?
Grammar: Recursive forms
Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...
Expressions e ::= n
| e1 + e2
| e1 - e2
Composition Via Grammar
Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...
Expressions e ::= n
| e1 + e2
| e1 - e2
| e1 * e2
| ( e1 )
Composition Via Grammar
Foo n ::= ... | -2 | -1 | 0 | 1 | 2 | ...
Bar e ::= n
| e1 + e2
| e1 - e2
| e1 * e2
| ( e1 )
Parsing Example 1
Numbers n ::= ...
Expressions e ::= n
| e1 + e2
| e1 - e2
| e1 * e2
| ( e1 )
e1 + e2 | ||||||
| | ( e1 ) | ||||
| | | e1 * e2 | | ||
n | | | n | | n | |
1 | + | ( | 2 | * | 3 | ) |
An example of a parse tree, that shows which rule is triggered at each level. The leaves of the tree must be terminals that appear in the program.
1 + (2 * 3)
Parsing Example 2
Numbers n ::= ...
Expressions e ::= n
| e1 + e2
| e1 - e2
| e1 * e2
| ( e1 )
e1 + e2 | ||||||
| | ( e1 ) | ||||
| | | e1 * e2 | | ||
n | | | n | | | |
1 | + | ( | 2 | * | * | ) |
Nothing “fits” here. We would get an eror.
1 + (2 * *)
Parsing Example 3
Numbers n ::= ...
Expressions e ::= n
| e1 + e2
| e1 - e2
| e1 * e2
| ( e1 )
e1 + e2 | ||||
| | e1 * e2 | ||
n | | n | | n |
1 | + | 2 | * | 3 |
e1 * e2 | ||||
e1 + e2 | | | ||
n | | n | | n |
1 | + | 2 | * | 3 |
Unfortunately, we can parse this program in two different ways. The problem is that the grammar for the language is ambiguous.
1 + 2 * 3
Source Of Ambiguity
Numbers n ::= ...
Expressions e ::= n
| e1 + e2
| e1 - e2
| e1 * e2
| ( e1 )
e1 * e2 | ||||
e1 + e2 | | | ||
n | | n | | n |
1 | + | 2 | * | 3 |
Grammar Disambiguation
Numbers n ::= ...
Atoms a ::= n
| ( e )
Multiplicands m ::= a
| m * a
Addends e ::= m
| e + m
| e - m
e + m | ||||
e | | m | ||
m | | m * a | ||
| | m | | |
a | | a | | a |
n | | n | | n |
1 | + | 2 | * | 3 |
Grammar Disambiguation
Numbers n ::= ...
Atoms a ::= n
| ( e )
Multiplicands m ::= a
| m * a
Addends e ::= m
| e + m
| e - m
e + m | ||||
e | | m | ||
m | | m * a | ||
| | m | | |
a | | a | | a |
n | | n | | n |
1 | + | 2 | * | 3 |
Things that can be multiplied
Grammar Disambiguation
Numbers n ::= ...
Atoms a ::= n
| ( e )
Multiplicands m ::= a
| m * a
Addends e ::= m
| e + m
| e - m
e + m | ||||
e | | m | ||
m | | m * a | ||
| | m | | |
a | | a | | a |
n | | n | | n |
1 | + | 2 | * | 3 |
Things that can be added
Grammar Disambiguation
Numbers n ::= ...
Atoms a ::= n
| ( e )
Multiplicands m ::= a
| m * a
Addends e ::= m
| e + m
| e - m
e + m | ||||
e | | m | ||
m | | m * a | ||
| | m | | |
a | | a | | a |
n | | n | | n |
1 | + | 2 | * | 3 |
You can no longer directly multiply �"1 + 2" by "3"
Why?
Grammar Disambiguation
Numbers n ::= ...
Atoms a ::= n
| ( e )
Multiplicands m ::= a
| m * a
Addends e ::= m
| e + m
| e - m
e + m | ||||
e | | m | ||
m | | m * a | ||
| | m | | |
a | | a | | a |
n | | n | | n |
1 | + | 2 | * | 3 |
You can no longer directly multiply �"1 + 2" by "3"�
"3" matches n, which matches m.
However, "1 + 2" only matches e, which cannot be multiplied by m.
Grammar Disambiguation
This grammar is not ambiguous. It accepts the same concrete syntax strings as the previous grammar, but it is designed so that strings are only parsed in one way. Designing a non-ambiguous grammar is a bit of a black art. To learn more, take a course on compilers.
Numbers n ::= ...
Atoms a ::= n
| ( e )
Multiplicands m ::= a
| m * a
Addends e ::= m
| e + m
| e - m
e + m | ||||
e | | m | ||
m | | m * a | ||
| | m | | |
a | | a | | a |
n | | n | | n |
1 | + | 2 | * | 3 |
Grammar Disambiguation
Numbers n ::= ...
Atoms a ::= n
| ( e )
Multiplicands m ::= a
| m * a
Addends e ::= m
| e + m
| e - m
e + m | ||||
e | | m | ||
m | | m * a | ||
| | m | | |
a | | a | | a |
n | | n | | n |
1 | + | 2 | * | 3 |
Is this completely disambiguated?
Can you come up with different parses for the same string?
Hint:
"1 + 2 + 3"
"4 - 5 - 6"
Associativity
"left-to-right": Match the left parts first, then the right.
"right-to-left": Match the right parts first, then the left.
Associativity
e1 + e2 | ||||
| | e1 + e2 | ||
n | | n | | n |
1 | + | 2 | + | 3 |
e1 + e2 | ||||
e1 + e2 | | | ||
n | | n | | n |
1 | + | 2 | + | 3 |
1 + 2 + 3
right-to-left
left-to-right:
Used by most languages for +, -
Grammar Disambiguation
This grammar is not ambiguous. It accepts the same concrete syntax strings as the previous grammar, but it is designed so that strings are only parsed in one way. Designing a non-ambiguous grammar is a bit of a black art. To learn more, take a course on compilers.
Numbers n ::= ...
Atoms a ::= n
| ( e )
Multiplicands m ::= a
| m * a
Addends e ::= m
| e + m
| e - m
e + m | ||||
e | | m | ||
m | | m * a | ||
| | m | | |
a | | a | | a |
n | | n | | n |
1 | + | 2 | * | 3 |
You do not have to worry about disambiguation in the project. The parser that we provide takes care of it for you.
Interpreting Abstract Syntax
Try this in Ocelot
> parser.parseExpression("1 + 2 * 3");
{
value: { ... },
kind: "ok"
}
> parser.parseExpression("1 + 2 * * 3");
{
message: "Parse error. Expected '(', 'false', 'true', integer, or variable name",
kind: "error"
}
The error message lists the terminals that would have worked here.
Recall this pattern from lecture on exception handling!
> parser.parseExpression("1 + 2 * 3");
{
value: {
kind: "operator",
op: "+",
e1: {
kind: "number",
value: 1
},
e2: {
kind: "operator",
op: "*",
e1: {
kind: "number",
value: 2
},
e2: {
kind: "number",
value: 3
}
}
},
kind: "ok"
}
Check for “ok”
1
2
3
+
*
Notice: no parenthesis in the AST. They are only needed in the concrete syntax.
The shape of this object
Note how numbers are parsed
> parser.parseExpression("1");
{� value: {� kind: "number",� value: 1� },� kind: "ok"�}
All sub-expressions have a kind. This is necessary for interp.
Nodes in the AST
{� kind: "operator" | "number"
...�}
Why no "bracket" ?
Other fields depend on kind
Operator Nodes in the AST
{� kind: "operator"
op: "+" | "-" | "*" | "/"
e1:
e2: �}
Exists only if kind === "operator"
These are children AST nodes
Exercise: What does this AST look like?
{
value: {
kind: "operator", op: "+",
e1: {
kind: "number", value: 1
},
e2: {
kind: "operator", op: "+",
e1: {
kind: "operator", op: "*",
e1: {
kind: "number", value: 2
},
e2: {
kind: "number", value: 3
}
},
e2: {
kind: "number", value: 4
}
}
},
kind: "ok"
}
Exercise: What does this AST look like?
{
value: {
kind: "operator", op: "+",
e1: {
kind: "number", value: 1
},
e2: {
kind: "operator", op: "+",
e1: {
kind: "operator", op: "*",
e1: {
kind: "number", value: 2
},
e2: {
kind: "number", value: 3
}
},
e2: {
kind: "number", value: 4
}
}
},
kind: "ok"
}
1
*
4
+
+
2
3
Exercise: What is the source expression?
let expression = ???;
let r = parser.parseExpression(expression);
console.log(r.kind);
console.log(r.value.kind);
console.log(r.value.op);
console.log(r.value.e1);
console.log(r.value.e2);
ok
operator
*
{
kind: "number",
value: 5
}
{
kind: "number",
value: 3
}
Exercise: What is the source expression?
let expression = 5 * 3;
let r = parser.parseExpression(expression);
console.log(r.kind);
console.log(r.value.kind);
console.log(r.value.op);
console.log(r.value.e1);
console.log(r.value.e2);
ok
operator
*
{
kind: "number",
value: 5
}
{
kind: "number",
value: 3
}
Implementing interpExpression
Input: AST for Expression
Output: Result of executing the AST
function interpExpression(expr) {
if (e.kind === 'number') {
return e.value;
} else if (e.kind === 'operator') {
???
} else {
???
}
}
Implementing interpExpression
Input: AST for Expression
Output: Result of executing the AST
function interpExpression(expr) {
if (e.kind === 'number') {
return e.value;
} else if (e.kind === 'operator') {
???
} else {
assert(false); // Not a valid AST node that we understand
}
}
Implementing interpExpression
Input: AST for Expression
Output: Result of executing the AST
function interpExpression(expr) {
if (e.kind === 'number') {
return e.value;
} else if (e.kind === 'operator') {
// Check e.op and accordingly do different things
} else {
assert(false);
}
}
Implementing interpExpression
function interpExpression(expr) {
if (e.kind === 'number') {
return e.value;
} else if (e.kind === 'operator') {
if (e.op === '+') {
return ???;
} else if (e.op === '-') {
return ???;
} else {
???
}
} else {
assert(false);
}
}
Implementing interpExpression
function interpExpression(expr) {
if (e.kind === 'number') {
return e.value;
} else if (e.kind === 'operator') {
if (e.op === '+') {
return e.e1 + e.e2;
} else if (e.op === '-') {
return ???;
} else {
???
}
} else {
assert(false);
}
}
Will this work?
Implementing interpExpression
function interpExpression(expr) {
if (e.kind === 'number') {
return e.value;
} else if (e.kind === 'operator') {
if (e.op === '+') {
return e.e1 + e.e2;
} else if (e.op === '-') {
return ???;
} else {
???
}
} else {
assert(false);
}
}
e1 and e2 are not numbers. They are expressions.
Implementing interpExpression
function interpExpression(expr) {
if (e.kind === 'number') {
return e.value;
} else if (e.kind === 'operator') {
if (e.op === '+') {
return interpExpression(e.e1) + interpExpression(e.e2);
} else if (e.op === '-') {
return ???;
} else {
???
}
} else {
assert(false);
}
}
Call interpExpression recursively to evaluate them
Examples: What should these print?
interpExpression({ kind: "number", value: 1 });
interpExpression({ kind: "operator", op: "+",
e1: { kind: "number", value: 2 },� e2: { kind: "number", value: 5 }
});
interpExpression({ kind: "operator", op: "+",
e1: { kind: "number", value: 4 },� e2: { kind: "operator", op: "*",
e1: { kind: "number", value: 2 },� e2: { kind: "number", value: 3 },
}
});
Examples
> interpExpression({ kind: "number", value: 1 });
1�
> interpExpression({
kind: "operator",
op: "*",
e1: { kind: "number", value: 2 },
e2: { kind: "number", value: 3 }
});
6
function interpExpression(expr) {
if (e.kind === 'number') {
return e.value;
}
else if (e.kind === 'operator') {
if (e.op === '+') {
return e.e1 + e.e2;
}
else if (e.op === '-') {
return e.e1 - e.e2;
}
else if (e.op === '*') {
return e.e1 * e.e2;
}
}
else {
assert(false);
}
}
Examples
Solution
e1 and e2 are not numbers. They are expressions.
JavaScript AST
So far, we have investigated the AST of a very small subscript of JavaScript.
What does the AST of full-feature JavaScript look like?
Next Lecture: �1. From Expressions to Programs
2. Variables
3. Flow control