Syntax Analysis
res = 14 + arg * 3 (characters in a file)
Lexer gives:
Parser gives:
ASSIGN(res,� PLUS(CONST(14),
TIMES(VAR(arg),CONST(3))))
“res”, “=”, “14”, “+”, “arg”, “*”, “3” (tokens)
:=
+
*
res
arg
3
14
ASSIGN(res,� PLUS(CONST(14),
TIMES(VAR(arg),CONST(3))))
:=
+
*
res
arg
3
14
interpreter for
syntax trees
world �(operating system)
program
(stuff happens)
ASSIGN(res,� PLUS(CONST(14),
TIMES(VAR(arg),CONST(3))))
:=
+
*
res
arg
3
14
interpreter for
syntax trees
compiler for
syntax trees
world �(operating system)
program
generated code:
load #arg
push 3
i32.mul
(stuff happens)
(stuff happens)
Why do we need interpreters?
Formal Languages�and their algebra
Fundamental math, “the basic arithmetic” of programming languages.
Languages Formally
A word (akka string) is a finite (possibly empty) sequence of elements from some set A
A – alphabet, A* - set of all words over A
If a1,a2,a3 ∈ A, then a1a2a3 denoes word with those 3 letters
u v denotes concatenation of words u and v
By a language we mean any set of words (any subset of A* )
Examples of Languages
A = {a,b} alphabet of two symbols
A* = {ε, a, b, aa, ab, ba, bb, aaa, aab, aba, ...}
Representation in math: aab can be represented as� e.g. (a, (a, (b, ())) or { (0, a), (1, a), (2, b) }
Examples of two languages, subsets of A* :
L1 = {a, bb, ab} (finite)
L2 = {ab, abab, ababab, ... } = { (ab)n | n ≥ 1 } (infinite)
Formal language theory:
Scala representation:
L : List[A] => Boolean
Properties of Words
Concatenation is associative:
(u ∙ v)∙ w = u ∙ (v ∙ w)
Empty word ε is left and right identity for ∙ :
w ∙ ε = w� ε ∙ w = w
We have associative operation and the identity
In the terminology of abstract algebra, the structure
(A*, ∙, ε) is a monoid.
Cancellation Laws
Property:
If u v = u w
then v = w
Property:
If v u = w u
then v = w
(u, v, w are arbitrary words)
We say we have a left- and right-cancellative monoid.
Fact about Indexing Concatenation
Concatenation of w and v has these letters:
w(0) … w(|w|-1) v(0) … v(|v|-1)
Thus, for every i where 0 ≤ i ≤ |w|+|v|-1 :
(wv)(i) = w(i) , if i < |w|
(wv)(i) = v(i-|w|) , if i ≥ |w|
Fact About Slicing
If u v = w
then |u|+|v| = |w|
and u = w0..|u| and v = w|u|..|w|
More Properties of Length
|ε|=0
|c|=1 iff c∈ A
|w1 w2| = |w1|+|w2| w,wi∈A* �
Reverse of a Word: (abcd)-1 = dcba
ε-1= ε
c-1 =c if c∈ A
(w1 w2)-1 = w2-1 w1-1
Sets of Words are Languages
Formally, any set of words is a language�(there are as many of them as real numbers), e.g.:
-Empty set of words ∅
-Set of all words A*
Language can be finite: {w1,...,wn}
Language can be infinite and impossible to describe �(e.g. “random”-like, “irrational” languages)
Can be infinite but with rules that can describe them -- those are of most interest to us here
Language Concatenation:
Concatenate all Pairs of Elements
Formal language theory:
L1 ⊆ A* , L2 ⊆ A*
L1∙L2 = {u1u2|u1∈L1 , u2∈L2}
L0 = {ε}
Ln+1 = L ∙ Ln
Scala (for finite languages)
type Lang[A] = List[List[A]]
def concatAll[A](L1: Lang[A],
L2 : Lang[A]): Lang[A]= for (w1 <- L1; w2 <- L2)
yield (w1 ++ w2)
{ Peter, Paul, Mary} ∙ { France, Germany} = � {PeterFrance, PeterGermany, � PaulFrance, PaulGermany, � MaryFrance,MaryGermany}
Concatenation of Sets: Properties
Concatenation of Sets: Properties
Concatenation of Sets: Properties
if L L1 = L L2 is it then L1 = L2 ?
No, ∅ L2 = ∅ L3 = ∅ for e.g. L1={a}, L2={aa}
Empty Language vs Empty String
Empty language is empty set ∅
contains no words!
∅ L = {w1 w2 | w1 ∈ ∅, w2 ∈ L } = ∅
Language containing only the empty string {ε}
{ε} L = {w1 w2 | w1 ∈ {ε}, w2 ∈ L }
= {w1 w2 | w1 =ε , w2 ∈ L }
= {ε w2 | w2 ∈ L }
= {w2 | w2 ∈ L } = L
Unlike, ∅, the language {ε} contains (one) word, ε
Definition of Star of a Language
L* = { w1 … wn | n ≥ 0, w1 … wn ∈ L }
= Un Ln where Ln+1 = L Ln , L0 ={ε}.
(obviously also Ln+1 = Ln L)
{a}* = {ε, a, aa, aaa, …}
{aa}* = {ε, aa, aaaa, aaaaaa,…}
{a,bb}*= {ε, a, bb, abb, bba, aa, bbbb, aabb,…}
Star allows us to define infinite languages starting from finite ones; we can use it to describe some of those infinite but reasonable languages. (Can L* be finite?)
When is L* finite?
Only in these two cases:
∅* = {ε} (because ∅0={ε})
{ε}* = {ε}
Further Examples
Let A = {a,b}
Let L = {a,ab}
L L = { aa, aab, aba, abab }
compute LLL
L* = {ε, a, ab, aa, aab, aba, abab, aaa, ... }
Is bb inside L* ?
Further Examples
Let A = {a,b}
Let L = {a,ab}
L L = { aa, aab, aba, abab }
compute LLL
L* = {ε, a, ab, aa, aab, aba, abab, aaa, ... }
Is bb inside L* ?
Question: Is it the case that
L*={ w | immediately left of each b is an a }
If yes, prove it. If no, give a counterexample.
Precise Statement and Proof
Reminder: L* = { w1 … wn | n ≥ 0, w1 … wn ∈ L }
Claim: {a,ab}*= S where
S = {w ∈ {a,b}*|∀0≤i<|w|. if w(i) =b then: i > 0 and w(i-1)=a}
Proof. We show 1) {a,ab}*⊆S and 2) S⊆{a,ab}*.
1) {a,ab}* ⊆ S: We show: for all n, {a,ab}n ⊆ S, by induction on n
- Base case, n=0. {a,ab}0={ε}, so i<|w| is always false and '->' is true.
- Suppose {a,ab}n ⊆ S. Showing {a,ab}n+1 ⊆ S. Let w∈{a,ab}n+1 . �Then w = vw’ where w’∈{a,ab}n, v∈{a,ab}. Let i < |w| and w(i)=b.�v(0)=a, so w(0) =a and thus w(0) !=b. Therefore i > 0. Two cases:�1.1) v=a. Then w(i)=w'(i-1) . By I.H. i-1>0 and w'(i-2)=a. Thus w(i-1)=a.�1.2) v=ab. If i=1, then w(i-1)=w(0)=a, as needed. Else, i>1 so � w'(i-2)=b and by I.H. w'(i-3)=a. Thus w(i-1) =(vw')(i-1) = w'(i-3) =a.
Proof Continued
recall: S = {w ∈ {a,b}*|∀0≤i<|w|. if w(i) =b then: i > 0 and w(i-1)=a}
For the second direction, we first prove:
(*) If w∈S and w=w'v then w'∈S.
Proof of (*): Let i<|w'|, w'(i)=b. Then w(i)=b so w(i-1)=a and thus w'(i-1)=a.
2) S ⊆{a,ab}*. We prove, by induction on n, that for all n,
for all w, if w∈S and n=|w| then w∈{a,ab}*.
- Base case: n=0. Then w is empty string and thus in {a,ab}*.
- Let n>0. Suppose property holds for all k < n. Let w∈S, |w|=n.
There are two cases, depending on the last letter of w.
2.1) w=w'a. Then w'∈S by (*), so by IH w'∈{a,ab}*, so w∈{a,ab}*.
2.2) w=vb. By w∈S , w(|w|-2)=a, so w=w'ab. By (*), w'∈S, by IH w'∈{a,ab}*, so w∈{a,ab}*. In any case, w∈{a,ab}*. We proved the entire equality.