2 of 43

Formal Language Theory

A branch of linguistics, mathematics, and computer science

It builds artificial mathematical languages

Investigates their properties and complexity

In linguistics we use them as tools to better understand properties and complexities of natural languages

Formal Grammar

Formal Language

Natural Language

English

Strings

Rules

Meta- Language

Object Language

Phenomena

Model

World

3 of 43

The Chomsky-Schützenberger Hierarchy

4 of 43

Regular Languages and Finite State Automata

{ε, 👀, 👀👀, 👀👀👀, 👀👀👀👀, 👀👀👀👀👀, 👀👀👀👀👀👀, 👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀, …}

(👀)*

S₀

👀

Regular Language

FSA

Regular Grammar (Expression)

5 of 43

Regular Grammars (Expressions)

6 of 43

Strings

First let’s define an alphabet as a set of symbols called letters:

My Alphabet = {👀, 👻, 👍, 👌, 🙉, 🙈, 😈, 😐, 😊}

A string as a sequence of the letters:

String₁ = 👻👀🙈

String₂ = 😊👍

If S₁ and S₂ are strings, let’s define the concatenation string S₁.S₂ as the first string followed by the second

String₁.String₂ = 👻👀🙈😊👍

7 of 43

Formal Languages

Given an alphabet Σ, the set of all strings over it is defined as Σ*

My Alphabet = {👀}

My Alphabet* = {ε, 👀, 👀👀, 👀👀👀, 👀👀👀👀, 👀👀👀👀👀, 👀👀👀👀👀👀, 👀👀👀👀👀👀👀, ...}

Σ* is always infinite!

A formal language L is simply a subset of Σ* (i.e. a set of strings over your alphabet)

My Language = {👻, 👻🙈, 👻😊, 👻😈😊, 👻👀👌😐}

If L₁ and L₂ are languages, then L₁.L₂ = {S₁.S₂ | S₁ ∈ L₁ & S₂ ∈ L₂}

If {👻, 👻🙈} and {👻😈😊} are languages, then {👻, 👻🙈}.{👻😈😊} = {👻👻😈😊, 👻🙈👻😈😊}

8 of 43

Exponents

If L is a language then:

L⁰ = ε

L¹ = L

L² = L . L

Lⁱ = L . L^i-1

Kleene Closure:

L* = U_i=0^∞ Lⁱ

MyLanguage = {👀}

MyLanguage⁰ = ε

MyLanguage¹ = {👀}

MyLanguage² = {👀👀}

MyLanguage³ = {👀👀👀}

MyLanguage* = {ε, 👀, 👀👀, 👀👀👀, 👀👀👀👀, 👀👀👀👀👀, ...}

MyLanguage+ = {👀, 👀👀, 👀👀👀, 👀👀👀👀, 👀👀👀👀👀, ...}

9 of 43

Exponents

If L is a language then:

L⁰ = ε

L¹ = L

L² = L . L

Lⁱ = L . L^i-1

Kleene Closure:

L* = U_i=0^∞ Lⁱ

MyLanguage = {👻,🙈}

MyLanguage⁰ = ε

MyLanguage¹ = {👻,🙈}

MyLanguage² = {👻👻,👻🙈,🙈👻,🙈🙈}

MyLanguage³ = {👻👻👻, 👻👻🙈, 👻🙈👻, 👻🙈🙈, 🙈👻👻, …}

MyLanguage* = {ε, 👻, 👻👻,👻🙈,🙈👻,🙈🙈, 👻👻👻, …}

10 of 43

Defining a Regular Grammar (Expression)

Definition: r ⩴ ∅, ε | L ∈ Σ | (r + r) | (r ᐧ r) | (r)*

∅ and ε are regular expressions
Every letter in the alphabet Σ is a regular expression
If r₁ and r₂ are regular expressions, so are (r₁+ r₂) and (r₁ᐧ r₂)
If r is a regular expression, so is (r)*
Nothing else is a regular expression over the alphabet Σ

Example: Σ = {👀, 👻, 👍, 👌, 🙉, 🙈, 😈, 😐, 😊}

∅, ε
👀, 👻, 👍, 👌, 🙉, 🙈, 😈, 😐, 😊
(👻+👀), (🙉+🙈), (😈+👍), ((😈+👍)+(🙉+🙈)), …
(👻ᐧ👀), (🙉ᐧ🙈), (😈ᐧ👍), ((😈ᐧ👍)+(🙉ᐧ🙈)), …
((👻 + 👀) + (🙈 ᐧ 😊)), ((👻 + 👀) + (🙈 ᐧ 😊))ᐧ 😈, …
(👻)*, (((👻 + 👀) + (🙈 ᐧ 😊))ᐧ 😈)*

Stephen Kleene

(1909-1994)

11 of 43

Regular Grammars vs. Regular Languages

{👀}

👀

{👻}

👻

Σ = {👀, 👻, 👍, 👌, 🙉, 🙈, 😈, 😐, 😊}

{👻👀}

👻ᐧ👀

{😐👀🙈}

😐ᐧ👀ᐧ🙈

{😐, 👀}

😐+👀

{😐, 👀, 🙈}

😐+👀+🙈

{ε, 👀, 👀👀, 👀👀👀, 👀👀👀👀, 👀👀👀👀👀, …}

(👀)*

{ε, 👻👀, 👻👀👻👀, 👻👀👻👀👻👀, …}

(👻ᐧ👀)*

12 of 43

Regular Languages

Interpretation Rules:

〚∅〛= ∅
〚ε〛= {ε}
〚a〛= {a}, if a ∈ Σ
〚(r₁+ r₂)〛= 〚r₁〛U〚r₂〛
〚(r₁ᐧ r₂)〛= 〚r₁〛ᐧ〚r₂〛
〚(r)*〛= 〚(r)〛*

Examples:

〚👀〛= {👀}
〚((👻 + 👀) + 😐)〛= 〚(👻 + 👀)〛U〚😐〛= 〚👻 〛U〚👀〛U〚😐〛= {👻} U {👀} U {😐} = {👻, 👀, 😐}
〚((👻 ᐧ 👀) ᐧ 😐)〛= 〚(👻 ᐧ 👀)〛ᐧ〚😐〛=〚👻〛ᐧ〚 👀〛ᐧ〚😐〛= {👻}ᐧ{👀}ᐧ{😐}={👻👀😐}
〚(👀)*〛= 〚(👀)〛*

13 of 43

Regular Grammars vs. Regular Languages

Σ = {👀, 👻, 👍, 👌, 🙉, 🙈, 😈, 😐}

{😐, 👍, 😐👀, 👍👀, 😐👀👀, 👍👀👀, …}

((😐+👍)ᐧ(👀)*)

{😐, 😐👻👀, 😐👻👀👻👀, 😐👻👀👻👀👻👀, …}

😐ᐧ(👻ᐧ👀)*

{😐👀,👍👀,👌👀,🙉👀,🙈👀,😈👀, 👻👀}

(😐+👍+👌+🙉+🙈+😈+👻)ᐧ(👀)

{😐👀,😐👍,😐👌,😐🙉,😐🙈,😐😈, 😐👻}

(😐)ᐧ(👀+👍+👌+🙉+🙈+😈+👻)

14 of 43

Example REs

What languages are described by these REs?

(<ᐧ3)*

((nᐧo)ᐧ(o)*)

(wᐧ(oᐧ((o)*ᐧw))

((c+m)ᐧ(aᐧ(t+p)))

((a+(e+(i+(o+u)))))*

What REs capture these sets of strings?

{ha, haha, hahaha, hahahaha, ...}

{maybe, maaybe, maaaybe, maaaaybe, … }

{tape, cape, ape, vape}

15 of 43

Grammars

Languages

{ε, a, aa, aaa, aaaa, aaaaa, aaaaaa, ...}

{ha, haha, hahaha, hahahaha, … }

{tape, cape, ape, vape}

{AlceHitBo, BoSawAlice, TheBookHitAlice, … }

…. (English? Japanese?) ...

{a, b, c, …, z}

Descriptions

(a)*

ha(ha)*

(t+(c+(v+ε)))ᐧ(aᐧ(pᐧe))

...

Machines

FSAs

Σ (alphabet)

Σ* (set of strings aka formal languages)

16 of 43

Exercise

Write a regular grammar that would describe the following formal language, assuming the alphabet is the English alphabet.

{do, doable, undo, undoable, touch, touchable, untouch, untouchable, see, unsee, seeable, unseeable, think, unthink, thinkable, unthinkable, make, unmake, makeable, unmakeable, label, unlabel, labelable, unlabelable, ravel, unravel, ravelable, unravelable}

17 of 43

Exercise

Write a regular grammar that would describe the following formal language, assuming the alphabet is the English alphabet.

{the student picked the pen, a man saw a tree, the cat killed a bird, every bird catches a worm, some humans eat some cheese, few books contain any pictures, many houses have no roof}

18 of 43

Finite State Automata

19 of 43

Finite State Automata

Definition: a finite state automaton consists of:

A finite set of states S = {s₀, s₁, s₂, …, s_N} with s₀ as the start state
A finite input alphabet Σ of symbols
A set of final states F ⊆ S
A transition relation between states that takes an initial state, a symbol, and returns a new state

S = {s₀, s₁, s₂}

Σ = {h, a}

F = {s₂}

TR(s₀,h) = s₁

TR(s₁,a) = s₂

S₀

S₁

S₂

20 of 43

Finite State Automata

{ha, haha, hahaha, ....}

{ε, a, aa, aaa, ....}

S₀

S₁

S₂

ha.(ha)*

S₀

(a)*

21 of 43

{tape, cape, vape, ape}

S₀

S₃

S₄

S₂

S₁

((t+c+v+ε)ᐧ(aᐧ(pᐧe)))

22 of 43

Finite-State / Regular Morphology

{play, playful, playing}

((p.l.a.y).((f.(u.l))+(i.(n.g))+ε))

(p.l.a.y).((f.u.l)+(i.n.g)+ε)

From Twitter account @happyautomata

23 of 43

{admire, admiration, admires}

(a.d.m.i.r).((e+(e.s))+(a.t.i.o.n))

{everybody, everyday, everyone}

(e.v.e.r.y).((b.o.d.y)+(d.a.y)+(o.n.e))

From Twitter account @happyautomata

24 of 43

{lock, lockable, locker, locking, locked, locks, unlock, unlockable, unlocker, unlocking, unlocks, unlocked, interlock, interlockable, interlocker, interlocking, interlocks, interlocked, deadlock, deadlockable, deadlocker, deadlocking, deadlocks, deadlocked, gridlock, gridlockable, gridlocker, gridlocking, gridlocks, gridlocked}

(ε+un+inter+dead+grid).(lock).(ε+able+er+ing+s+ed)

Overgeneration: any string not in English?
Undergeneration: any English word not generated?

{lockdown, locksmith, lockbox, ...}

3. Structural Ambiguity: “unlockable”?

S₀

S₁

S₃

lock

S₂

able

ing

inter

dead

grid

prefix

stem

suffix

((unlock)able)
(un(lockable))

25 of 43

Transitional Probabilities

Transitional probabilities can act as cues to language structure

S₀

S₁

S₃

lock

S₂

able

ing

inter

dead

grid

prefix

stem

suffix

S₀

S₁

1/5

S₂

1/6

S₃

The un.lock.able door can.not be open.ed

26 of 43

Finite State / Regular Syntax

{CaliforniansAreHappy, CaliforniansAreReallyHappy CaliforniansAreReallyReallyHappy, …, JoeIsHappy, JoeIsVeryVeryHappy, ...}

(Californians+Joe+You+MrBond+Sombodey).

(Is+Are+Was+Were).(Really+Very)*.(Happy)

S₀

S₁

S₄

Californians

Happy

Joe

Somebody

MrBond

S₂

You

Really

Very

Was

Are

Were

27 of 43

Finite State / Regular Syntax

{CaliforniansAreHappy, CaliforniansAreReallyHappy CaliforniansAreReallyReallyHappy, …, JoeIsHappy, JoeIsVeryVeryHappy, ...}

N.AUX.ADV*.ADJ

S₀

S₁

S₄

ADJ

S₂

ADV

AUX

N = {Californians, Joe, You, MrBond, Somebody}

AUX = {Are, Is, Was, Were}

ADV = {Really, Very}

ADJ = {Happy}

28 of 43

Finite State / Regular Syntax

{CalifornainsLoveTheStudents, YouLoveSomeCats, Somebody LovedAllDogs, JoeLovesTheDogs, YouLovedAllPoople, ... }

(Californians+Joe+You+MrBond+Sombodey).(Love+Loves+Loved).

(The+Some+All+Many+).(Students+Others+People+Cats+Dogs)

S₀

S₁

S₄

Californians

Loves

S₃

Students

Others

cats

Joe

People

Somebody

MrBond

dogs

S₂

You

The

Some

All

Many

Love

Loved

Note: Add one with subcategories

29 of 43

Examples with POSs

N = {Alice, Bo, book, car}

V = {hit, saw, stop}

A = {green, big, exciting}

D = {the, a}

What are the denotations of these REs? What FSAs do they correspond to?

(NᐧV)

((DᐧN)ᐧV)

((Dᐧ((A)* ᐧN))ᐧV)

((Dᐧ((A)* ᐧN))ᐧ(VᐧN))

((Dᐧ((A)* ᐧN))ᐧ(Vᐧ(Dᐧ((A)* ᐧN)))

30 of 43

Regular Languages

31 of 43

Defining Regular Languages

Any language that is definable by regular grammars (expressions) or finite state automata is a regular language.

∅ is a regular language

For all the symbols in our alphabet, a ∈ Σ, {a} is a regular language

If two languages are regular, their concatenation, disjunction, and Kleene closure are also regular

L ᐧ L

L + L

32 of 43

Properties of Regular Languages

Closure:

A class of languages ℒ is said to be closed under some operation ⦿ iff:

whenever two languages L₁ and L₂ are in the class (L₁ ∈ ℒ, L₂ ∈ ℒ)

L₁ ⦿ L₂ is also in the class (L₁ ⦿ L₂ ∈ ℒ)

Regular Languages are closed under:

Concatenation

Union

Kleene Star

33 of 43

Is Natural Language Regular?

34 of 43

Generative Capacity of a Grammar

Weak Generative Capacity: the language (set of strings) generated by the grammar

A grammar has weak generative capacity when it generates the strings it should

A grammar undergenerates when it does not generate all the grammatical strings of the language

A grammar overgenerates when it produces ungrammatical strings for the language

Strong Generative Capacity: the structures assigned by the grammar to the language (set of strings)

A grammar has strong generative capacity when it assigns the right structure descriptions

35 of 43

A small regular grammar

Let’s write a regular grammar or a finite state automata over the alphabet Σ = {They, are,V_gnd , Adj, N} that would capture these generalizations about English:

“They are NP” is a sentence of English.

“They are V_gnd NP” is a sentence of English, V_gnd = running, playing, …

An NP can consist of a N preceded by zero or more Adj.

A V_gnd can also act as an Adj.

36 of 43

S₀

S₁

S₅

S₂

They

Vgnd

are

S₃

S₄

Adj

37 of 43

Structural Ambiguity

Consider the sentence “They are flying planes” (Chomsky 1956):

How many meanings does it have? Why?

Does our grammar generate the sentence?

Does it assign appropriate structure analyses given the meanings?

The different paths in the FSA can represent different structures for the sentence.

38 of 43

A small regular grammar

Let’s write a regular grammar or a finite state automata over the alphabet Σ = {D, N, P} that would capture these generalizations about English:

A noun phrase consists of a determiner and a noun followed by zero or more prepositional phrases.
A Prepositional phrase consists of a preposition followed by a noun phrase.

S₀

S₁

S₄

39 of 43

Syntactic Ambiguity

No consider these phrases:

a book

a book about the teacher

a book about the teacher with an umbrella

a book about the teacher with an umbrella on the street

…

What are the syntactic ambiguities in these phrases?

What happens if you keep adding prepositional phrases?

Are these structural/syntactic properties captured by our regular grammar?

S₀

S₁

S₄

40 of 43

Strong Generative Capacity of RGs

Regular grammars fail to exhibit strong generative capacity for natural languages.

For example as we saw they cannot assign the right structural descriptions to noun phrases modified by prepositional phrases.

And this is not limited to noun phrases. Structural ambiguity is pervasive in natural languages and regular grammars typically fail to capture them.

41 of 43

Weak Generative Capacity of RGs

Regular grammars can successfully describe or generate many fragments of natural languages like nouns with prepositional phrases.

They have “weak generative capacity” for those fragments.

However, they fail at even successfully generating nested structures that are common in natural languages.

42 of 43

Nested Structures

If if students work hard then they pass, then the course is fair.

If If if students work hard then they pass, then the course is fair, then I’m happy.

The rock that the squirrel likes can be found in the garden.

The rock that the squirrel that the dog chases likes can be found in the garden.

43 of 43

Weak Generative Capacity of RGs

Therefore regular grammars are not the suitable grammars for the description of natural languages.

Next week we are going to look at phrase structure grammars and push-down automata.