1 of 25

Syntax Analysis

res = 14 + arg * 3 (characters in a file)

Lexer gives:

Parser gives:

ASSIGN(res,� PLUS(CONST(14),

TIMES(VAR(arg),CONST(3))))

“res”, “=”, “14”, “+”, “arg”, “*”, “3” (tokens)

:=

+

*

res

arg

3

14

2 of 25

ASSIGN(res,� PLUS(CONST(14),

TIMES(VAR(arg),CONST(3))))

:=

+

*

res

arg

3

14

interpreter for

syntax trees

world �(operating system)

program

(stuff happens)

3 of 25

ASSIGN(res,� PLUS(CONST(14),

TIMES(VAR(arg),CONST(3))))

:=

+

*

res

arg

3

14

interpreter for

syntax trees

compiler for

syntax trees

world �(operating system)

program

generated code:

load #arg

push 3

i32.mul

(stuff happens)

(stuff happens)

4 of 25

Why do we need interpreters?

  • Interpreter is a simpler way to implement a language
  • Usually it is easier to build it than a compiler
  • Can be used to define the meaning of programs in a language: programs (even if compiled) should compute same as what the interpreter returned
  • Your first lab: build an interpreter
    • front end (lexer and parser) will be given to you
  • Compiler uses the fact that the program is known to prepare instructions and make later execution efficient

5 of 25

Formal Languages�and their algebra

Fundamental math, “the basic arithmetic” of programming languages.

6 of 25

Languages Formally

A word (akka string) is a finite (possibly empty) sequence of elements from some set A

A – alphabet, A* - set of all words over A

If a1,a2,a3 ∈ A, then a1a2a3 denoes word with those 3 letters

u v denotes concatenation of words u and v

By a language we mean any set of words (any subset of A* )

    • thus, for languages we have notions of:�union, intersection, and complement wrt. A*

7 of 25

Examples of Languages

A = {a,b} alphabet of two symbols

A* = {ε, a, b, aa, ab, ba, bb, aaa, aab, aba, ...}

Representation in math: aab can be represented as� e.g. (a, (a, (b, ())) or { (0, a), (1, a), (2, b) }

Examples of two languages, subsets of A* :

L1 = {a, bb, ab} (finite)

L2 = {ab, abab, ababab, ... } = { (ab)n | n ≥ 1 } (infinite)

8 of 25

Formal language theory:

  • A – alphabet
  • A* - words over A
  • w1∙w2 or w1 w2 ( w1,w2∈A*)
  • w(i)∈A
  • ε – empty word
  • c ∈ A → c ∈ A*
  • |w| - word length
  • wp..q = w(p)w(p+1) …w(q-1)if w = w(0)w(1) …w(|w|-1)
  • L ⊆ A* - a language

Scala representation:

  • A – type (not just Char)
  • List[A]
  • w1 ++ w2 w1,w2:List[A]
  • w(i) : A
  • List()
  • if c:A then List(c):List[A]
  • w.length
  • w.slice(p,q)�
  • L : List[List[A]] (finite L)

L : List[A] => Boolean

9 of 25

Properties of Words

Concatenation is associative:

(u ∙ v)∙ w = u ∙ (v ∙ w)

Empty word ε is left and right identity for :

w ∙ ε = w� ε ∙ w = w

We have associative operation and the identity

In the terminology of abstract algebra, the structure

(A*, ∙, ε) is a monoid.

10 of 25

Cancellation Laws

Property:

If u v = u w

then v = w

Property:

If v u = w u

then v = w

(u, v, w are arbitrary words)

We say we have a left- and right-cancellative monoid.

11 of 25

Fact about Indexing Concatenation

Concatenation of w and v has these letters:

w(0) … w(|w|-1) v(0) … v(|v|-1)

Thus, for every i where 0 ≤ i |w|+|v|-1 :

(wv)(i) = w(i) , if i < |w|

(wv)(i) = v(i-|w|) , if i ≥ |w|

12 of 25

Fact About Slicing

If u v = w

then |u|+|v| = |w|

and u = w0..|u| and v = w|u|..|w|

13 of 25

More Properties of Length

|ε|=0

|c|=1 iff c∈ A

|w1 w2| = |w1|+|w2| w,wi∈A* �

Reverse of a Word: (abcd)-1 = dcba

ε-1= ε

c-1 =c if c∈ A

(w1 w2)-1 = w2-1 w1-1

14 of 25

Sets of Words are Languages

Formally, any set of words is a language�(there are as many of them as real numbers), e.g.:

-Empty set of words

-Set of all words A*

Language can be finite: {w1,...,wn}

Language can be infinite and impossible to describe �(e.g. “random”-like, “irrational” languages)

Can be infinite but with rules that can describe them -- those are of most interest to us here

15 of 25

Language Concatenation:

Concatenate all Pairs of Elements

Formal language theory:

L1 ⊆ A* , L2 ⊆ A*

L1L2 = {u1u2|u1∈L1 , u2∈L2}

L0 = {ε}

Ln+1 = L Ln

Scala (for finite languages)

type Lang[A] = List[List[A]]

def concatAll[A](L1: Lang[A],

L2 : Lang[A]): Lang[A]= for (w1 <- L1; w2 <- L2)

yield (w1 ++ w2)

{ Peter, Paul, Mary} { France, Germany} = � {PeterFrance, PeterGermany, � PaulFrance, PaulGermany, � MaryFrance,MaryGermany}

16 of 25

Concatenation of Sets: Properties

  • Consider an alphabet A and all languages L ⊆ A*
  • Is this a monoid?

17 of 25

Concatenation of Sets: Properties

  • Consider an alphabet A and all languages L ⊆ A*
  • Is this a monoid? - Yes!
    • (L1 L2) L3 = L1 (L2 L3) = � { w1w2w3 | w1 L1, w2 L2, w3 L3}
    • L {ε} = {ε} L = L
  • Does the cancellation law hold?

18 of 25

Concatenation of Sets: Properties

  • Consider an alphabet A and all languages L ⊆ A*
  • Is this a monoid? - Yes!
    • (L1 L2) L3 = L1 (L2 L3) = � { w1w2w3 | w1 L1, w2 L2, w3 L3}
    • L {ε} = {ε} L = L
  • Does the cancellation law hold? - No!

if L L1 = L L2 is it then L1 = L2 ?

No, ∅ L2 = ∅ L3 = ∅ for e.g. L1={a}, L2={aa}

19 of 25

Empty Language vs Empty String

Empty language is empty set ∅

contains no words!

∅ L = {w1 w2 | w1 , w2 L } = ∅

Language containing only the empty string {ε}

{ε} L = {w1 w2 | w1 {ε}, w2 L }

= {w1 w2 | w1 , w2 L }

= {ε w2 | w2 L }

= {w2 | w2 L } = L

Unlike, ∅, the language {ε} contains (one) word, ε

20 of 25

Definition of Star of a Language

L* = { w1 … wn | n ≥ 0, w1 … wn ∈ L }

= Un Ln where Ln+1 = L Ln , L0 ={ε}.

(obviously also Ln+1 = Ln L)

{a}* = {ε, a, aa, aaa, …}

{aa}* = {ε, aa, aaaa, aaaaaa,…}

{a,bb}*= {ε, a, bb, abb, bba, aa, bbbb, aabb,…}

Star allows us to define infinite languages starting from finite ones; we can use it to describe some of those infinite but reasonable languages. (Can L* be finite?)

21 of 25

When is L* finite?

Only in these two cases:

∅* = {ε} (because0={ε})

{ε}* = {ε}

22 of 25

Further Examples

Let A = {a,b}

Let L = {a,ab}

L L = { aa, aab, aba, abab }

compute LLL

L* = {ε, a, ab, aa, aab, aba, abab, aaa, ... }

Is bb inside L* ?

23 of 25

Further Examples

Let A = {a,b}

Let L = {a,ab}

L L = { aa, aab, aba, abab }

compute LLL

L* = {ε, a, ab, aa, aab, aba, abab, aaa, ... }

Is bb inside L* ?

Question: Is it the case that

L*={ w | immediately left of each b is an a }

If yes, prove it. If no, give a counterexample.

24 of 25

Precise Statement and Proof

Reminder: L* = { w1 … wn | n ≥ 0, w1 … wn ∈ L }

Claim: {a,ab}*= S where

S = {w ∈ {a,b}*|∀0≤i<|w|. if w(i) =b then: i > 0 and w(i-1)=a}

Proof. We show 1) {a,ab}*⊆S and 2) S⊆{a,ab}*.

1) {a,ab}* ⊆ S: We show: for all n, {a,ab}n ⊆ S, by induction on n

- Base case, n=0. {a,ab}0={ε}, so i<|w| is always false and '->' is true.

- Suppose {a,ab}n ⊆ S. Showing {a,ab}n+1 ⊆ S. Let w∈{a,ab}n+1 . �Then w = vw’ where w’∈{a,ab}n, v∈{a,ab}. Let i < |w| and w(i)=b.�v(0)=a, so w(0) =a and thus w(0) !=b. Therefore i > 0. Two cases:�1.1) v=a. Then w(i)=w'(i-1) . By I.H. i-1>0 and w'(i-2)=a. Thus w(i-1)=a.�1.2) v=ab. If i=1, then w(i-1)=w(0)=a, as needed. Else, i>1 so � w'(i-2)=b and by I.H. w'(i-3)=a. Thus w(i-1) =(vw')(i-1) = w'(i-3) =a.

25 of 25

Proof Continued

recall: S = {w ∈ {a,b}*|∀0≤i<|w|. if w(i) =b then: i > 0 and w(i-1)=a}

For the second direction, we first prove:

(*) If w∈S and w=w'v then w'∈S.

Proof of (*): Let i<|w'|, w'(i)=b. Then w(i)=b so w(i-1)=a and thus w'(i-1)=a.

2) S ⊆{a,ab}*. We prove, by induction on n, that for all n,

for all w, if w∈S and n=|w| then w∈{a,ab}*.

- Base case: n=0. Then w is empty string and thus in {a,ab}*.

- Let n>0. Suppose property holds for all k < n. Let w∈S, |w|=n.

There are two cases, depending on the last letter of w.

2.1) w=w'a. Then w'∈S by (*), so by IH w'∈{a,ab}*, so w∈{a,ab}*.

2.2) w=vb. By w∈S , w(|w|-2)=a, so w=w'ab. By (*), w'∈S, by IH w'∈{a,ab}*, so w∈{a,ab}*. In any case, w∈{a,ab}*. We proved the entire equality.