Regular Languages and Grammars
LIN 103B - LInguistic Analysis
Masoud Jasbi
Formal Language Theory
A branch of linguistics, mathematics, and computer science
It builds artificial mathematical languages
Investigates their properties and complexity
In linguistics we use them as tools to better understand properties and complexities of natural languages
Formal Grammar
Formal Language
Natural Language
English
Strings
Rules
Meta- Language
Object Language
Phenomena
Model
World
The Chomsky-Schützenberger Hierarchy
Regular Languages and Finite State Automata
{ε, 👀, 👀👀, 👀👀👀, 👀👀👀👀, 👀👀👀👀👀, 👀👀👀👀👀👀, 👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀, 👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀, …}
(👀)*
S0
👀
Regular Language
FSA
Regular Grammar (Expression)
Regular Grammars (Expressions)
Strings
First let’s define an alphabet as a set of symbols called letters:
My Alphabet = {👀, 👻, 👍, 👌, 🙉, 🙈, 😈, 😐, 😊}
A string as a sequence of the letters:
String1 = 👻👀🙈
String2 = 😊👍
If S1 and S2 are strings, let’s define the concatenation string S1.S2 as the first string followed by the second
String1.String2 = 👻👀🙈😊👍
Formal Languages
Given an alphabet Σ, the set of all strings over it is defined as Σ*
My Alphabet = {👀}
My Alphabet* = {ε, 👀, 👀👀, 👀👀👀, 👀👀👀👀, 👀👀👀👀👀, 👀👀👀👀👀👀, 👀👀👀👀👀👀👀, ...}
Σ* is always infinite!
A formal language L is simply a subset of Σ* (i.e. a set of strings over your alphabet)
My Language = {👻, 👻🙈, 👻😊, 👻😈😊, 👻👀👌😐}
If L1 and L2 are languages, then L1.L2 = {S1.S2 | S1 ∈ L1 & S2 ∈ L2}
If {👻, 👻🙈} and {👻😈😊} are languages, then {👻, 👻🙈}.{👻😈😊} = {👻👻😈😊, 👻🙈👻😈😊}
Exponents
If L is a language then:
L0 = ε
L1 = L
L2 = L . L
Li = L . Li-1
Kleene Closure:
L* = Ui=0∞ Li
MyLanguage = {👀}
MyLanguage0 = ε
MyLanguage1 = {👀}
MyLanguage2 = {👀👀}
MyLanguage3 = {👀👀👀}
MyLanguage* = {ε, 👀, 👀👀, 👀👀👀, 👀👀👀👀, 👀👀👀👀👀, ...}
MyLanguage+ = {👀, 👀👀, 👀👀👀, 👀👀👀👀, 👀👀👀👀👀, ...}
Exponents
If L is a language then:
L0 = ε
L1 = L
L2 = L . L
Li = L . Li-1
Kleene Closure:
L* = Ui=0∞ Li
MyLanguage = {👻,🙈}
MyLanguage0 = ε
MyLanguage1 = {👻,🙈}
MyLanguage2 = {👻👻,👻🙈,🙈👻,🙈🙈}
MyLanguage3 = {👻👻👻, 👻👻🙈, 👻🙈👻, 👻🙈🙈, 🙈👻👻, …}
MyLanguage* = {ε, 👻, 👻👻,👻🙈,🙈👻,🙈🙈, 👻👻👻, …}
Defining a Regular Grammar (Expression)
Definition: r ⩴ ∅, ε | L ∈ Σ | (r + r) | (r ᐧ r) | (r)*
Example: Σ = {👀, 👻, 👍, 👌, 🙉, 🙈, 😈, 😐, 😊}
Stephen Kleene
(1909-1994)
Regular Grammars vs. Regular Languages
{👀}
👀
{👻}
👻
Σ = {👀, 👻, 👍, 👌, 🙉, 🙈, 😈, 😐, 😊}
{👻👀}
👻ᐧ👀
{😐👀🙈}
😐ᐧ👀ᐧ🙈
{😐, 👀}
😐+👀
{😐, 👀, 🙈}
😐+👀+🙈
{ε, 👀, 👀👀, 👀👀👀, 👀👀👀👀, 👀👀👀👀👀, …}
(👀)*
{ε, 👻👀, 👻👀👻👀, 👻👀👻👀👻👀, …}
(👻ᐧ👀)*
Regular Languages
Interpretation Rules:
Examples:
Regular Grammars vs. Regular Languages
Σ = {👀, 👻, 👍, 👌, 🙉, 🙈, 😈, 😐}
{😐, 👍, 😐👀, 👍👀, 😐👀👀, 👍👀👀, …}
((😐+👍)ᐧ(👀)*)
{😐, 😐👻👀, 😐👻👀👻👀, 😐👻👀👻👀👻👀, …}
😐ᐧ(👻ᐧ👀)*
{😐👀,👍👀,👌👀,🙉👀,🙈👀,😈👀, 👻👀}
(😐+👍+👌+🙉+🙈+😈+👻)ᐧ(👀)
{😐👀,😐👍,😐👌,😐🙉,😐🙈,😐😈, 😐👻}
(😐)ᐧ(👀+👍+👌+🙉+🙈+😈+👻)
Example REs
What languages are described by these REs?
(<ᐧ3)*
((nᐧo)ᐧ(o)*)
(wᐧ(oᐧ((o)*ᐧw))
((c+m)ᐧ(aᐧ(t+p)))
((a+(e+(i+(o+u)))))*
What REs capture these sets of strings?
{ha, haha, hahaha, hahahaha, ...}
{maybe, maaybe, maaaybe, maaaaybe, … }
{tape, cape, ape, vape}
Grammars
Languages
{ε, a, aa, aaa, aaaa, aaaaa, aaaaaa, ...}
{ha, haha, hahaha, hahahaha, … }
{tape, cape, ape, vape}
{AlceHitBo, BoSawAlice, TheBookHitAlice, … }
…. (English? Japanese?) ...
{a, b, c, …, z}
Descriptions
(a)*
ha(ha)*
(t+(c+(v+ε)))ᐧ(aᐧ(pᐧe))
...
Machines
FSAs
Σ (alphabet)
Σ* (set of strings aka formal languages)
Exercise
Write a regular grammar that would describe the following formal language, assuming the alphabet is the English alphabet.
{do, doable, undo, undoable, touch, touchable, untouch, untouchable, see, unsee, seeable, unseeable, think, unthink, thinkable, unthinkable, make, unmake, makeable, unmakeable, label, unlabel, labelable, unlabelable, ravel, unravel, ravelable, unravelable}
Exercise
Write a regular grammar that would describe the following formal language, assuming the alphabet is the English alphabet.
{the student picked the pen, a man saw a tree, the cat killed a bird, every bird catches a worm, some humans eat some cheese, few books contain any pictures, many houses have no roof}
Finite State Automata
Finite State Automata
Definition: a finite state automaton consists of:
S = {s0, s1, s2}
Σ = {h, a}
F = {s2}
TR(s0,h) = s1
TR(s1,a) = s2
S0
S1
S2
h
a
Finite State Automata
{ha, haha, hahaha, ....}
{ε, a, aa, aaa, ....}
S0
S1
S2
h
a
ε
ha.(ha)*
S0
a
(a)*
{tape, cape, vape, ape}
S0
S3
S4
e
S2
p
S1
a
t
c
v
ε
((t+c+v+ε)ᐧ(aᐧ(pᐧe)))
Finite-State / Regular Morphology
{play, playful, playing}
((p.l.a.y).((f.(u.l))+(i.(n.g))+ε))
(p.l.a.y).((f.u.l)+(i.n.g)+ε)
From Twitter account @happyautomata
{admire, admiration, admires}
(a.d.m.i.r).((e+(e.s))+(a.t.i.o.n))
{everybody, everyday, everyone}
(e.v.e.r.y).((b.o.d.y)+(d.a.y)+(o.n.e))
From Twitter account @happyautomata
{lock, lockable, locker, locking, locked, locks, unlock, unlockable, unlocker, unlocking, unlocks, unlocked, interlock, interlockable, interlocker, interlocking, interlocks, interlocked, deadlock, deadlockable, deadlocker, deadlocking, deadlocks, deadlocked, gridlock, gridlockable, gridlocker, gridlocking, gridlocks, gridlocked}
(ε+un+inter+dead+grid).(lock).(ε+able+er+ing+s+ed)
{lockdown, locksmith, lockbox, ...}
3. Structural Ambiguity: “unlockable”?
S0
S1
S3
un
lock
ε
S2
able
er
ing
s
inter
ε
dead
grid
ed
prefix
stem
suffix
Transitional Probabilities
Transitional probabilities can act as cues to language structure
S0
S1
S3
un
lock
ε
S2
able
er
ing
s
inter
ε
dead
grid
ed
prefix
stem
suffix
S0
S1
1/5
S2
1
1/6
S3
The un.lock.able door can.not be open.ed
Finite State / Regular Syntax
{CaliforniansAreHappy, CaliforniansAreReallyHappy CaliforniansAreReallyReallyHappy, …, JoeIsHappy, JoeIsVeryVeryHappy, ...}
(Californians+Joe+You+MrBond+Sombodey).
(Is+Are+Was+Were).(Really+Very)*.(Happy)
S0
S1
S4
Californians
Happy
Joe
Somebody
MrBond
S2
You
Really
Very
Is
Was
Are
Were
Finite State / Regular Syntax
{CaliforniansAreHappy, CaliforniansAreReallyHappy CaliforniansAreReallyReallyHappy, …, JoeIsHappy, JoeIsVeryVeryHappy, ...}
N.AUX.ADV*.ADJ
S0
S1
S4
ADJ
S2
N
ADV
AUX
N = {Californians, Joe, You, MrBond, Somebody}
AUX = {Are, Is, Was, Were}
ADV = {Really, Very}
ADJ = {Happy}
Finite State / Regular Syntax
{CalifornainsLoveTheStudents, YouLoveSomeCats, Somebody LovedAllDogs, JoeLovesTheDogs, YouLovedAllPoople, ... }
(Californians+Joe+You+MrBond+Sombodey).(Love+Loves+Loved).
(The+Some+All+Many+).(Students+Others+People+Cats+Dogs)
S0
S1
S4
Californians
Loves
S3
Students
Others
cats
Joe
People
Somebody
MrBond
dogs
S2
You
The
Some
All
Many
Love
Loved
Note: Add one with subcategories
Examples with POSs
N = {Alice, Bo, book, car}
V = {hit, saw, stop}
A = {green, big, exciting}
D = {the, a}
What are the denotations of these REs? What FSAs do they correspond to?
(NᐧV)
((DᐧN)ᐧV)
((Dᐧ((A)* ᐧN))ᐧV)
((Dᐧ((A)* ᐧN))ᐧ(VᐧN))
((Dᐧ((A)* ᐧN))ᐧ(Vᐧ(Dᐧ((A)* ᐧN)))
Regular Languages
Defining Regular Languages
Any language that is definable by regular grammars (expressions) or finite state automata is a regular language.
∅ is a regular language
For all the symbols in our alphabet, a ∈ Σ, {a} is a regular language
If two languages are regular, their concatenation, disjunction, and Kleene closure are also regular
L ᐧ L
L + L
L*
Properties of Regular Languages
Closure:
A class of languages ℒ is said to be closed under some operation ⦿ iff:
whenever two languages L1 and L2 are in the class (L1 ∈ ℒ, L2 ∈ ℒ)
L1 ⦿ L2 is also in the class (L1 ⦿ L2 ∈ ℒ)
Regular Languages are closed under:
Concatenation
Union
Kleene Star
Is Natural Language Regular?
Generative Capacity of a Grammar
Weak Generative Capacity: the language (set of strings) generated by the grammar
A grammar has weak generative capacity when it generates the strings it should
A grammar undergenerates when it does not generate all the grammatical strings of the language
A grammar overgenerates when it produces ungrammatical strings for the language
Strong Generative Capacity: the structures assigned by the grammar to the language (set of strings)
A grammar has strong generative capacity when it assigns the right structure descriptions
A small regular grammar
Let’s write a regular grammar or a finite state automata over the alphabet Σ = {They, are,Vgnd , Adj, N} that would capture these generalizations about English:
“They are NP” is a sentence of English.
“They are Vgnd NP” is a sentence of English, Vgnd = running, playing, …
An NP can consist of a N preceded by zero or more Adj.
A Vgnd can also act as an Adj.
S0
S1
S5
N
S2
They
Vgnd
ε
are
S3
S4
ε
Adj
Structural Ambiguity
Consider the sentence “They are flying planes” (Chomsky 1956):
How many meanings does it have? Why?
Does our grammar generate the sentence?
Does it assign appropriate structure analyses given the meanings?
The different paths in the FSA can represent different structures for the sentence.
A small regular grammar
Let’s write a regular grammar or a finite state automata over the alphabet Σ = {D, N, P} that would capture these generalizations about English:
S0
S1
S4
N
D
P
Syntactic Ambiguity
No consider these phrases:
a book
a book about the teacher
a book about the teacher with an umbrella
a book about the teacher with an umbrella on the street
…
What are the syntactic ambiguities in these phrases?
What happens if you keep adding prepositional phrases?
Are these structural/syntactic properties captured by our regular grammar?
S0
S1
S4
N
D
P
Strong Generative Capacity of RGs
Regular grammars fail to exhibit strong generative capacity for natural languages.
For example as we saw they cannot assign the right structural descriptions to noun phrases modified by prepositional phrases.
And this is not limited to noun phrases. Structural ambiguity is pervasive in natural languages and regular grammars typically fail to capture them.
Weak Generative Capacity of RGs
Regular grammars can successfully describe or generate many fragments of natural languages like nouns with prepositional phrases.
They have “weak generative capacity” for those fragments.
However, they fail at even successfully generating nested structures that are common in natural languages.
Nested Structures
If if students work hard then they pass, then the course is fair.
If If if students work hard then they pass, then the course is fair, then I’m happy.
The rock that the squirrel likes can be found in the garden.
The rock that the squirrel that the dog chases likes can be found in the garden.
Weak Generative Capacity of RGs
Therefore regular grammars are not the suitable grammars for the description of natural languages.
Next week we are going to look at phrase structure grammars and push-down automata.