1 of 25

Syntax Analysis

res = 14 + arg * 3 (characters in a file)

Lexer gives:

Parser gives:

ASSIGN(res,� PLUS(CONST(14),

TIMES(VAR(arg),CONST(3))))

“res”, “=”, “14”, “+”, “arg”, “*”, “3” (tokens)

res

arg

2 of 25

ASSIGN(res,� PLUS(CONST(14),

TIMES(VAR(arg),CONST(3))))

res

arg

interpreter for

syntax trees

world �(operating system)

program

(stuff happens)

3 of 25

ASSIGN(res,� PLUS(CONST(14),

TIMES(VAR(arg),CONST(3))))

res

arg

interpreter for

syntax trees

compiler for

syntax trees

world �(operating system)

program

generated code:

load #arg

push 3

i32.mul

(stuff happens)

4 of 25

Why do we need interpreters?

Interpreter is a simpler way to implement a language
Usually it is easier to build it than a compiler
Can be used to define the meaning of programs in a language: programs (even if compiled) should compute same as what the interpreter returned
Your first lab: build an interpreter

front end (lexer and parser) will be given to you

Compiler uses the fact that the program is known to prepare instructions and make later execution efficient

5 of 25

Formal Languages�and their algebra

Fundamental math, “the basic arithmetic” of programming languages.

6 of 25

Languages Formally

A word (akka string) is a finite (possibly empty) sequence of elements from some set A

A – alphabet, A^* - set of all words over A

If a₁,a₂,a₃ ∈ A, then a₁a₂a₃ denoes word with those 3 letters

u v denotes concatenation of words u and v

By a language we mean any set of words (any subset of A^*)

thus, for languages we have notions of:�union, intersection, and complement wrt. A^*

7 of 25

Examples of Languages

A = {a,b} alphabet of two symbols

A^* = {ε, a, b, aa, ab, ba, bb, aaa, aab, aba, ...}

Representation in math: aab can be represented as� e.g. (a, (a, (b, ())) or { (0, a), (1, a), (2, b) }

Examples of two languages, subsets of A^* :

L₁= {a, bb, ab} (finite)

L₂= {ab, abab, ababab, ... } = { (ab)ⁿ| n ≥ 1 } (infinite)

8 of 25

Formal language theory:

A – alphabet
A* - words over A
w₁∙w₂ or w₁ w₂(w₁,w₂∈A*)
w₍_i)∈A
ε – empty word
c ∈ A → c ∈ A*
|w| - word length
w_p..q = w_(p)w_(p+1)…w_(q-1)_�if w = w₍₀₎w₍₁₎ …w_(|w|-1)
L ⊆ A* - a language

Scala representation:

A – type (not just Char)
List[A]
w1 ++ w2 w1,w2:List[A]
w(i) : A
List()
if c:A then List(c):List[A]
w.length
w.slice(p,q)�
L : List[List[A]] (finite L)

L : List[A] => Boolean

9 of 25

Properties of Words

Concatenation is associative:

(u ∙ v)∙ w= u ∙ (v∙ w)

Empty word ε is left and right identity for ∙ :

w ∙ ε = w� ε ∙ w = w

We have associative operation and the identity

In the terminology of abstract algebra, the structure

(A*, ∙, ε) is a monoid.

10 of 25

Cancellation Laws

Property:

If u v = u w

then v = w

Property:

If v u = w u

then v= w

(u, v, w are arbitrary words)

We say we have a left- and right-cancellative monoid.

11 of 25

Fact about Indexing Concatenation

Concatenation of w and v has these letters:

w₍₀₎ … w_(|w|-1) v₍₀₎ … v_(|v|-1)

Thus, for every i where 0 ≤ i ≤ |w|+|v|-1 :

(wv)_(i)= w_(i) , if i < |w|

(wv)_(i)= v_(i-|w|) , if i ≥ |w|

12 of 25

Fact About Slicing

If u v = w

then |u|+|v| = |w|

and u = w_0..|u| and v = w_|u|..|w|

13 of 25

More Properties of Length

|ε|=0

|c|=1 iff c∈ A

|w₁ w₂| = |w₁|+|w₂| w,w_i∈A* �

Reverse of a Word: (abcd)^-1= dcba

ε^-1= ε

c^-1=c if c∈ A

(w₁ w₂)^-1= w₂^-1w₁^-1

14 of 25

Sets of Words are Languages

Formally, any set of words is a language�(there are as many of them as real numbers), e.g.:

-Empty set of words ∅

-Set of all words A^*

Language can be finite: {w₁,...,w_n}

Language can be infinite and impossible to describe �(e.g. “random”-like, “irrational” languages)

Can be infinite but with rules that can describe them -- those are of most interest to us here

15 of 25

Language Concatenation:

Concatenate all Pairs of Elements

Formal language theory:

L₁ ⊆ A* , L₂ ⊆ A*

L₁∙L₂ = {u₁u₂|u₁∈L₁, u₂∈L₂}

L⁰= {ε}

Lⁿ⁺¹ = L ∙ Lⁿ

Scala (for finite languages)

type Lang[A] = List[List[A]]

def concatAll[A](L1: Lang[A],

L2 : Lang[A]): Lang[A]= for (w1 <- L1; w2 <- L2)

yield (w1 ++ w2)

{ Peter, Paul, Mary} ∙ { France, Germany} = � {PeterFrance, PeterGermany, � PaulFrance, PaulGermany, � MaryFrance,MaryGermany}

16 of 25

Concatenation of Sets: Properties

Consider an alphabet A and all languages L ⊆ A*
Is this a monoid?

17 of 25

Concatenation of Sets: Properties

Consider an alphabet A and all languages L ⊆ A*
Is this a monoid? - Yes!

(L₁ L₂) L₃ = L₁ (L₂ L₃) = � { w₁w₂w₃ | w₁∈L₁, w₂∈L₂, w₃∈L₃}
L{ε} = {ε} L = L

Does the cancellation law hold?

18 of 25

Concatenation of Sets: Properties

Consider an alphabet A and all languages L ⊆ A*
Is this a monoid? - Yes!

(L₁ L₂) L₃ = L₁ (L₂ L₃) = � { w₁w₂w₃ | w₁∈L₁, w₂∈L₂, w₃∈L₃}
L{ε} = {ε} L = L

Does the cancellation law hold? - No!

if L L₁ = L L₂ is it then L₁ = L₂ ?

No, ∅ L₂ = ∅ L₃= ∅ for e.g. L₁={a}, L₂={aa}

19 of 25

Empty Language vs Empty String

Empty language is empty set ∅

contains no words!

∅ L = {w₁w₂ | w₁∈∅, w₂∈L} = ∅

Language containing only the empty string {ε}

{ε} L = {w₁w₂ | w₁∈{ε}, w₂∈L}

= {w₁w₂ | w₁=ε , w₂∈L}

= {ε w₂ | w₂∈L}

= {w₂ | w₂∈L} = L

Unlike, ∅, the language {ε} contains (one) word, ε

20 of 25

Definition of Star of a Language

L* = { w₁ … w_n | n ≥ 0, w₁ … w_n∈ L }

= U_n Lⁿwhere Lⁿ⁺¹= L Lⁿ , L⁰={ε}.

(obviously also Lⁿ⁺¹= Lⁿ L)

{a}* = {ε, a, aa, aaa, …}

{aa}* = {ε, aa, aaaa, aaaaaa,…}

{a,bb}*= {ε, a, bb, abb, bba, aa, bbbb, aabb,…}

Star allows us to define infinite languages starting from finite ones; we can use it to describe some of those infinite but reasonable languages. (Can L* be finite?)

21 of 25

When is L* finite?

Only in these two cases:

∅* = {ε} (because ∅⁰={ε})

{ε}* = {ε}

22 of 25

Further Examples

Let A = {a,b}

Let L = {a,ab}

L L = { aa, aab, aba, abab }

compute LLL

L* = {ε, a, ab, aa, aab, aba, abab, aaa, ... }

Is bb inside L* ?

23 of 25

Further Examples

Let A = {a,b}

Let L = {a,ab}

L L = { aa, aab, aba, abab }

compute LLL

L* = {ε, a, ab, aa, aab, aba, abab, aaa, ... }

Is bb inside L* ?

Question: Is it the case that

L*={ w | immediately left of each b is an a }

If yes, prove it. If no, give a counterexample.

24 of 25

Precise Statement and Proof

Reminder: L* = { w₁ … w_n | n ≥ 0, w₁ … w_n∈ L }

Claim: {a,ab}*= S where

S = {w ∈ {a,b}*|∀0≤i<|w|. if w_(i)=b then: i > 0 and w_(i-1)=a}

Proof. We show 1) {a,ab}*⊆S and 2) S⊆{a,ab}*.

1) {a,ab}* ⊆ S: We show: for all n, {a,ab}ⁿ ⊆ S, by induction on n

- Base case, n=0. {a,ab}⁰={ε}, so i<|w| is always false and '->' is true.

- Suppose {a,ab}ⁿ ⊆ S. Showing {a,ab}ⁿ⁺¹ ⊆ S. Let w∈{a,ab}ⁿ⁺¹ . �Then w = vw’ where w’∈{a,ab}ⁿ, v∈{a,ab}. Let i < |w| and w_(i)=b.�v₍₀₎=a, so w₍₀₎ =a and thus w₍₀₎ !=b. Therefore i > 0. Two cases:�1.1) v=a. Then w_(i)=w'_(i-1) . By I.H. i-1>0 and w'_(i-2)=a. Thus w_(i-1)=a.�1.2) v=ab. If i=1, then w_(i-1)=w₍₀₎=a, as needed. Else, i>1 so � w'_(i-2)=b and by I.H. w'_(i-3)=a. Thus w_(i-1)=(vw')_(i-1) = w'_(i-3) =a.

25 of 25

Proof Continued

recall: S = {w ∈ {a,b}*|∀0≤i<|w|. if w_(i)=b then: i > 0 and w_(i-1)=a}

For the second direction, we first prove:

(*) If w∈S and w=w'v then w'∈S.

Proof of (*): Let i<|w'|, w'_(i)=b. Then w_(i)=b so w_(i-1)=a and thus w'_(i-1)=a.

2) S ⊆{a,ab}*. We prove, by induction on n, that for all n,

for all w, if w∈S and n=|w| then w∈{a,ab}*.

- Base case: n=0. Then w is empty string and thus in {a,ab}*.

- Let n>0. Suppose property holds for all k < n. Let w∈S, |w|=n.

There are two cases, depending on the last letter of w.

2.1) w=w'a. Then w'∈S by (*), so by IH w'∈{a,ab}*, so w∈{a,ab}*.

2.2) w=vb. By w∈S , w_(|w|-2)=a, so w=w'ab. By (*), w'∈S, by IH w'∈{a,ab}*, so w∈{a,ab}*. In any case, w∈{a,ab}*. We proved the entire equality.