1 of 52

Parsers and Interpreters

Part 1: Grammar and ASTs

2 of 52

Our View Of The World So Far

Program Source Code (string)

Compiler / Interpreter

(Magic Happens Here)

Results (string / image)

No, this is not magic
We can actually implement much of this with what we have learned so far in 220!

3 of 52

Overview

Definition: An interpreter is a program that runs other programs.

Example: Your web browser runs JavaScript programs, thus it has a JavaScript interpreter. (Strictly speaking, your web browser has a Just-in-time (JIT) compiler, but an interpreter is an essential part of a JIT.)

This week’s project: You will write an interpreter for a tiny portion of JavaScript. (If you want to learn more, take COMPSCI 497P.)

4 of 52

Example Behavior

function interpProgram(inputString) {

...

return resultValue;

}

interpProgram("print(2 + 3)"); // prints 5

interpProgram("let x = 2; print(x * 10)"); // prints 20

interpProgram("let n = 5; let r = 0;" +

"while (n > 0) { r = r * n; n = n - 1; }" +

"print(r);"); // prints 120

interp does not directly process inputString.

5 of 52

Why interp does not directly process inputString

1 + 2 * 3

= 3 * 3

= 9

What does “1 + 2 * 3” mean?

1 + 2 * 3

= 1 * 6

= 6

It is mathematical convention to treat “1 + 2 * 3” as “1 + (2 * 3)”. However, this choice is arbitrary. An interpreter could produce either answer, but we need to choose one.

6 of 52

Project in 187

7 of 52

Operator Precedence: �Exploiting Convention To Resolve Ambiguity

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Operator_Precedence

8 of 52

Examples of Disambiguation by Operator Precedence

let a = new MyClass().toString();

let b = ++foo.bar;

c += ++d;

(Don't write such code!)

9 of 52

Why interp does not directly process inputString

Abstract syntax tree (AST): internal representation of programs, without ambiguity.

(1 + 2) * 3

1 + (2 * 3)

1

2

3

+

*

3

1

2

*

+

Concrete syntax: what programs look like to the programmer.

7

6

Result of interp.

Notice: no parenthesis in the AST. They are only needed in the concrete syntax.

parse (provided)

interp (project)

10 of 52

Parsing

This section introduces parsing. For the project, you will not need to write a parser yourself. However, you will need to understand what it does to write the interpreter.

A parser is a program that turns concrete syntax into an abstract syntax tree. If the input program is not parsable, the parser produces and error, along with an error message that describes what went wrong. A grammar describes the strings that the parser accepts.

11 of 52

A Grammar for Arithmetic

The numbers and the symbols (, ), +, -, and * are called terminals.
The letters e and n are called non-terminals. Non-terminals do not actually appear in the concrete syntax. They are effectively variables of the grammar.
The grammar describes the shape of a tree that matches a string.

Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...

Expressions e ::= n

| e₁ + e₂

| e₁ - e₂

| e₁ * e₂

| ( e₁ )

12 of 52

Grammar: Terminals

Consider the simplified grammar, that consists only of numbers.
These strings are valid given the grammar:�12�11�-12�-100�
These strings are not valid:�12.0�12-�11-12�

Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...

Terminals must match strings exactly. What about combinations?

13 of 52

Grammar: Combinations

Non-terminals define how terminals (exact strings) and other non-terminals can be combined.
Note that we now have the string "+" as a terminal that must be matched for an expression, between two numbers.
Valid Expressions:�12 + 1�-1 + 2�-9 + -8

Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...

Expressions e ::= n₁ + n₂

What about different forms of combinations?

4. Invalid Expressions:� 12 + � 12 + +12� 1 - 2�

14 of 52

Grammar: Alternative rules

A grammar may optionally have several alternative rules
Expressions can be addition, or subtraction
Valid Expressions:�12 + 1�-1 + 2�3 - 2�-9 - -8
Invalid expressions:�1 + 2 + 3

Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...

Expressions e ::= n₁ + n₂

| n₁ - n₂

These expressions can only match strings with up to three tokens (number op number).

What about longer expressions?

15 of 52

Grammar: Recursive forms

A grammar may have recursive definitions
Note: An expression now consists of other expressions, not just numbers
A number is also an expression
Valid Expressions:�23�-45�12 + 1 - 3�-1 + 2�3 - 2�-9 - -8

Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...

Expressions e ::= n

| e₁ + e₂

| e₁ - e₂

16 of 52

Composition Via Grammar

This defines the grammar of simple set of arithmetic expressions ("Expressions")
Note: In the grammar definition, there is nothing special about the labels "Numbers" and "Expressions"
They can be changed to other names, and grammar can include other forms as well (as we shall see)

Numbers n ::= ... | -2 | -1 | 0 | 1 | 2 | ...

Expressions e ::= n

| e₁ + e₂

| e₁ - e₂

| e₁ * e₂

| ( e₁ )

17 of 52

Composition Via Grammar

This defines the grammar of simple set of arithmetic expressions ("Bar")

Foo n ::= ... | -2 | -1 | 0 | 1 | 2 | ...

Bar e ::= n

| e₁ + e₂

| e₁ - e₂

| e₁ * e₂

| ( e₁ )

18 of 52

Parsing Example 1

Numbers n ::= ...

Expressions e ::= n

| e₁ + e₂

| e₁ - e₂

| e₁ * e₂

| ( e₁ )

e1 + e2
		( e1 )
			e1 * e2
n			n		n
1	+	(	2	*	3	)

An example of a parse tree, that shows which rule is triggered at each level. The leaves of the tree must be terminals that appear in the program.

1 + (2 * 3)

19 of 52

Parsing Example 2

Numbers n ::= ...

Expressions e ::= n

| e₁ + e₂

| e₁ - e₂

| e₁ * e₂

| ( e₁ )

e1 + e2
		( e1 )
			e1 * e2
n			n
1	+	(	2	*	*	)

Nothing “fits” here. We would get an eror.

1 + (2 * *)

20 of 52

Parsing Example 3

Numbers n ::= ...

Expressions e ::= n

| e₁ + e₂

| e₁ - e₂

| e₁ * e₂

| ( e₁ )

e1 + e2
		e1 * e2
n		n		n
1	+	2	*	3

e1 * e2
e1 + e2
n		n		n
1	+	2	*	3

Unfortunately, we can parse this program in two different ways. The problem is that the grammar for the language is ambiguous.

1 + 2 * 3

21 of 52

Source Of Ambiguity

Numbers n ::= ...

Expressions e ::= n

| e₁ + e₂

| e₁ - e₂

| e₁ * e₂

| ( e₁ )

e1 * e2
e1 + e2
n		n		n
1	+	2	*	3

Multiplication expression defined as e₁ * e₂
Addition expression defined as e₁ + e₂
Hence note that the e₁in multiplication can be an addition expression, without brackets!
Can we disallow multiplication of addition expressions without brackets?

22 of 52

Grammar Disambiguation

Numbers n ::= ...

Atoms a ::= n

| ( e )

Multiplicands m ::= a

| m * a

Addends e ::= m

| e + m

| e - m

e + m
e		m
m		m * a
		m
a		a		a
n		n		n
1	+	2	*	3

23 of 52

Grammar Disambiguation

Numbers n ::= ...

Atoms a ::= n

| ( e )

Multiplicands m ::= a

| m * a

Addends e ::= m

| e + m

| e - m

e + m
e		m
m		m * a
		m
a		a		a
n		n		n
1	+	2	*	3

Things that can be multiplied

24 of 52

Grammar Disambiguation

Numbers n ::= ...

Atoms a ::= n

| ( e )

Multiplicands m ::= a

| m * a

Addends e ::= m

| e + m

| e - m

e + m
e		m
m		m * a
		m
a		a		a
n		n		n
1	+	2	*	3

Things that can be added

25 of 52

Grammar Disambiguation

Numbers n ::= ...

Atoms a ::= n

| ( e )

Multiplicands m ::= a

| m * a

Addends e ::= m

| e + m

| e - m

e + m
e		m
m		m * a
		m
a		a		a
n		n		n
1	+	2	*	3

You can no longer directly multiply �"1 + 2" by "3"

Why?

26 of 52

Grammar Disambiguation

Numbers n ::= ...

Atoms a ::= n

| ( e )

Multiplicands m ::= a

| m * a

Addends e ::= m

| e + m

| e - m

e + m
e		m
m		m * a
		m
a		a		a
n		n		n
1	+	2	*	3

You can no longer directly multiply �"1 + 2" by "3"�

"3" matches n, which matches m.

However, "1 + 2" only matches e, which cannot be multiplied by m.

27 of 52

Grammar Disambiguation

This grammar is not ambiguous. It accepts the same concrete syntax strings as the previous grammar, but it is designed so that strings are only parsed in one way. Designing a non-ambiguous grammar is a bit of a black art. To learn more, take a course on compilers.

Numbers n ::= ...

Atoms a ::= n

| ( e )

Multiplicands m ::= a

| m * a

Addends e ::= m

| e + m

| e - m

e + m
e		m
m		m * a
		m
a		a		a
n		n		n
1	+	2	*	3

28 of 52

Grammar Disambiguation

Numbers n ::= ...

Atoms a ::= n

| ( e )

Multiplicands m ::= a

| m * a

Addends e ::= m

| e + m

| e - m

e + m
e		m
m		m * a
		m
a		a		a
n		n		n
1	+	2	*	3

Is this completely disambiguated?

Can you come up with different parses for the same string?

Hint:

"1 + 2 + 3"

"4 - 5 - 6"

29 of 52

Associativity

"left-to-right": Match the left parts first, then the right.

"right-to-left": Match the right parts first, then the left.

30 of 52

Associativity

e1 + e2
		e1 + e2
n		n		n
1	+	2	+	3

e1 + e2
e1 + e2
n		n		n
1	+	2	+	3

1 + 2 + 3

right-to-left

left-to-right:

Used by most languages for +, -

31 of 52

Grammar Disambiguation

This grammar is not ambiguous. It accepts the same concrete syntax strings as the previous grammar, but it is designed so that strings are only parsed in one way. Designing a non-ambiguous grammar is a bit of a black art. To learn more, take a course on compilers.

Numbers n ::= ...

Atoms a ::= n

| ( e )

Multiplicands m ::= a

| m * a

Addends e ::= m

| e + m

| e - m

e + m
e		m
m		m * a
		m
a		a		a
n		n		n
1	+	2	*	3

You do not have to worry about disambiguation in the project. The parser that we provide takes care of it for you.

32 of 52

Interpreting Abstract Syntax

33 of 52

Try this in Ocelot

> parser.parseExpression("1 + 2 * 3");

{

value: { ... },

kind: "ok"

}

> parser.parseExpression("1 + 2 * * 3");

{

message: "Parse error. Expected '(', 'false', 'true', integer, or variable name",

kind: "error"

}

The error message lists the terminals that would have worked here.

Recall this pattern from lecture on exception handling!

34 of 52

> parser.parseExpression("1 + 2 * 3");

{

value: {

kind: "operator",

op: "+",

e1: {

kind: "number",

value: 1

},

e2: {

kind: "operator",

op: "*",

e1: {

kind: "number",

value: 2

},

e2: {

kind: "number",

value: 3

}

},

kind: "ok"

}

Check for “ok”

1

2

3

+

*

Notice: no parenthesis in the AST. They are only needed in the concrete syntax.

The shape of this object

35 of 52

Note how numbers are parsed

> parser.parseExpression("1");

{� value: {� kind: "number",� value: 1� },� kind: "ok"�}

All sub-expressions have a kind. This is necessary for interp.

36 of 52

Nodes in the AST

{� kind: "operator" | "number"

...�}

Why no "bracket" ?

Other fields depend on kind

37 of 52

Operator Nodes in the AST

{� kind: "operator"

op: "+" | "-" | "*" | "/"

e1:

e2: �}

Exists only if kind === "operator"

These are children AST nodes

38 of 52

Exercise: What does this AST look like?

{

value: {

kind: "operator", op: "+",

e1: {

kind: "number", value: 1

},

e2: {

kind: "operator", op: "+",

e1: {

kind: "operator", op: "*",

e1: {

kind: "number", value: 2

},

e2: {

kind: "number", value: 3

}

},

e2: {

kind: "number", value: 4

}

},

kind: "ok"

}

39 of 52

Exercise: What does this AST look like?

{

value: {

kind: "operator", op: "+",

e1: {

kind: "number", value: 1

},

e2: {

kind: "operator", op: "+",

e1: {

kind: "operator", op: "*",

e1: {

kind: "number", value: 2

},

e2: {

kind: "number", value: 3

}

},

e2: {

kind: "number", value: 4

}

},

kind: "ok"

}

1

*

4

+

2

3

40 of 52

Exercise: What is the source expression?

let expression = ???;

let r = parser.parseExpression(expression);

console.log(r.kind);

console.log(r.value.kind);

console.log(r.value.op);

console.log(r.value.e1);

console.log(r.value.e2);

ok

operator

*

{

kind: "number",

value: 5

}

{

kind: "number",

value: 3

}

41 of 52

Exercise: What is the source expression?

let expression = 5 * 3;

let r = parser.parseExpression(expression);

console.log(r.kind);

console.log(r.value.kind);

console.log(r.value.op);

console.log(r.value.e1);

console.log(r.value.e2);

ok

operator

*

{

kind: "number",

value: 5

}

{

kind: "number",

value: 3

}

42 of 52

Implementing interpExpression

Input: AST for Expression

Output: Result of executing the AST

function interpExpression(expr) {

if (e.kind === 'number') {

return e.value;

} else if (e.kind === 'operator') {

???

} else {

???

}

43 of 52

Implementing interpExpression

Input: AST for Expression

Output: Result of executing the AST

function interpExpression(expr) {

if (e.kind === 'number') {

return e.value;

} else if (e.kind === 'operator') {

???

} else {

assert(false); // Not a valid AST node that we understand

}

44 of 52

Implementing interpExpression

Input: AST for Expression

Output: Result of executing the AST

function interpExpression(expr) {

if (e.kind === 'number') {

return e.value;

} else if (e.kind === 'operator') {

// Check e.op and accordingly do different things

} else {

assert(false);

}

45 of 52

Implementing interpExpression

function interpExpression(expr) {

if (e.kind === 'number') {

return e.value;

} else if (e.kind === 'operator') {

if (e.op === '+') {

return ???;

} else if (e.op === '-') {

return ???;

} else {

???

}

} else {

assert(false);

}

46 of 52

Implementing interpExpression

function interpExpression(expr) {

if (e.kind === 'number') {

return e.value;

} else if (e.kind === 'operator') {

if (e.op === '+') {

return e.e1 + e.e2;

} else if (e.op === '-') {

return ???;

} else {

???

}

} else {

assert(false);

}

Will this work?

47 of 52

Implementing interpExpression

function interpExpression(expr) {

if (e.kind === 'number') {

return e.value;

} else if (e.kind === 'operator') {

if (e.op === '+') {

return e.e1 + e.e2;

} else if (e.op === '-') {

return ???;

} else {

???

}

} else {

assert(false);

}

e1 and e2 are not numbers. They are expressions.

48 of 52

Implementing interpExpression

function interpExpression(expr) {

if (e.kind === 'number') {

return e.value;

} else if (e.kind === 'operator') {

if (e.op === '+') {

return interpExpression(e.e1) + interpExpression(e.e2);

} else if (e.op === '-') {

return ???;

} else {

???

}

} else {

assert(false);

}

Call interpExpression recursively to evaluate them

49 of 52

Examples: What should these print?

interpExpression({ kind: "number", value: 1 });

interpExpression({ kind: "operator", op: "+",

e1: { kind: "number", value: 2 },� e2: { kind: "number", value: 5 }

});

interpExpression({ kind: "operator", op: "+",

e1: { kind: "number", value: 4 },� e2: { kind: "operator", op: "*",

e1: { kind: "number", value: 2 },� e2: { kind: "number", value: 3 },

}

});

50 of 52

Examples

> interpExpression({ kind: "number", value: 1 });

1�

> interpExpression({

kind: "operator",

op: "*",

e1: { kind: "number", value: 2 },

e2: { kind: "number", value: 3 }

});

6

function interpExpression(expr) {

if (e.kind === 'number') {

return e.value;

}

else if (e.kind === 'operator') {

if (e.op === '+') {

return e.e1 + e.e2;

}

else if (e.op === '-') {

return e.e1 - e.e2;

}

else if (e.op === '*') {

return e.e1 * e.e2;

}

else {

assert(false);

}

Examples

Solution

e1 and e2 are not numbers. They are expressions.

51 of 52

JavaScript AST

So far, we have investigated the AST of a very small subscript of JavaScript.

What does the AST of full-feature JavaScript look like?

https://astexplorer.net/

52 of 52

Next Lecture: �1. From Expressions to Programs

2. Variables

3. Flow control