Chomsky Normal Form
Robert Keller
April 2016
Prerequisite
This discussion assumes familiarity with context-free (type 2) grammars.
A Slight Issue with Context-Free Grammars
Consider the acceptance problem:��ACFG = �� {<G, x> | G is a context-free grammar and x ∈ L(G)}
Is ACFG recognizable? decidable?�
e.g. is there an algorithm that will determine, from <G> and x, whether or not x ∈ L(G)?
Example
G: x: bab
S → ASA l aB
A → B I S
B → b I 𝛆�
According to prover9, this string is in the language:�It is “recognized” by this derivation sequence:�[[S],[A,S,A],[A,S,B],[A,S],[B,S],[b,S],[b,a,B],[b,a,b]]
But what if a string is not generated? How do we know when to stop trying?
The Problem with Arbitrary CFG’s
Due to productions that can erase, we can’t simply bound the number of steps in generation to tell us when to stop generating:
B → 𝛆
We would like to have only productions that make progress,
i.e. each non-terminal in a derivation will ultimately generate a
non-empty string.
That way, when the length of the sentential form exceeds the length of the input, we know that we need not pursue further steps from that form.
Exception: S → 𝛆, where S is the start-symbol not appearing on any RHS.
Example
Grammar:
S → ASA l aB
A → B I S
B → b I 𝛆�
A “progressive” (no erasing) equivalent grammar
S0 → ASA l aB | a | AS | SA
S → ASA l aB | a | AS | SA
A → ASA l aB | a | AS | SA | b
B → b
Chomsky Normal Form
A Chomsky Normal Form (CNF) for a context-free grammar has a number of important applications, such as:
In a CNF grammar, each production is of one of three forms:� A → BC where B and C are non-terminals � (not necessarily distinct)
A → 𝛔 where 𝛔 is a terminal
S0 → 𝛆, but only if S0 is the start symbol
and S0 does not appear on the RHS of any production.
Chomsky Normal Form
In a CNF grammar, if a string of length n > 0 is derivable at all, it is derivable by applying n-1 productions of the first type below, and n productions of the second type.��Since the number of different sequences of length n is finite, we can, in the worst case, try them all. Thus ACFG is decidable.�
In a CNF grammar, each production is of one of three forms:� A → BC where B and C are non-terminals � (not necessarily distinct)
A → 𝛔 where 𝛔 is a terminal
S0 → 𝛆, but only if S0 is the start symbol
and S0 does not appear on the RHS of any production.
(A string of length 0 can only be produced by S0 → 𝛆 in a CNF grammar).
Conversion to Chomsky Normal Form
Any context-free grammar can be converted to Chomsky Normal Form. �
The steps are:
Eliminating 𝛆 Productions in a CFG
First step:
Add a new start symbol S0 and a production S0 → S to the original start symbol. S0 will never appear on a RHS.�
If 𝛆 is in the language, S0 → 𝛆 will end up being the only �𝛆 production. Otherwise there will be none.�
(If the original start symbol does not appear on any RHS, we can skip this step.)
Eliminating 𝛆 Productions in a CFG
In a context-free (not necessarily regular) grammar, it is useful to be able to eliminate productions of the form A → 𝛆, which we call 𝛆 Productions.
Obviously we cannot eliminate S0 → 𝛆 if present, but we can remove all others.
Eliminating 𝛆 Productions in a CFG
Repeat the following steps for each production of the form A → 𝛆,
one at a time.
Remove A → 𝛆 and for each production B → x where A appears on the RHS, add new productions for each possible removal of A.
The rationale is that we are preserving the strings that would be generated if A → 𝛆 were used.
For example, if B → aAbAc, and A → 𝛆, we would add to the B productions �
B → abAc replacing first A
B → aAbc replacing second A
B → abc replacing both A’s
still retaining the original B productions.
If A occurs n times on the RHS, then 2n-1 productions could be added.
Then we drop A → 𝛆.
Eliminating Unit Productions in a CFG
In a general context-free grammar, unit productions �A → B (both sides non-terminal) can be removed stepwise.
Remove A → B, and for each production of the form �B → x, add a production A → x, unless this introduces a unit production that was already removed.
Also, do not add any productions of the form A → A.
The rationale here is that we are eliminating the middle use of B while preserving the strings generated.
Final Touch-Ups: Add Cascade Productions as Needed
If A→x1x2x3…xn, where n > 2, we remove it and add new non-terminals B2, …, Bn-1 and add new productions� A→x1B2 � B2→x2B3� B3→x3B4 � . . .�� Bn-1→xn-1xn
So that each RHS has length 2.
A ⇒ x1B2⇒ x1x2B3 ⇒ … ⇒ x1x2…xn-1xn preserves the language.
Final Touch-Ups: Substitute Non-Terminals for Terminals on the RHS of length 2
If A→𝛔1𝛔2, where either 𝛔i ∈ Σ, we replace it with Bi ∈ N, where Bi is a new non-terminal, and add a new production�� Bi → 𝛔i�
So that A ⇒ B1B2⇒ 𝛔1B2 ⇒ 𝛔1𝛔2 thus preserving the language.
Sipser Example 2.10
Remove 𝛆 Productions
Original Grammar: S → ASA l aB A → B I S B → b I 𝛆 | Remove B → 𝛆: (new underlined) S0 → S S → ASA l aB | a A → B I S I 𝛆 B → b |
Add new start symbol S0: S0 → S S → ASA l aB A → B I S B → b I 𝛆 | Remove A → 𝛆: S0 → S S → ASA l aB | a | AS | SA A → B I S B → b |
Sipser Example 2.10
Remove Unit Productions
Grammar with no 𝛆 Productions S0 → S S → ASA l aB | a | AS | SA A → B I S B → b | Remove A → S: (new underlined) S0 → S S → ASA l aB | a | AS | SA A → ASA l aB | a | AS | SA | b B → b |
Remove A → B: S0 → S S → ASA l aB | a | AS | SA A → S | b B → b | Remove S0 → S: S0 → ASA l aB | a | AS | SA S → ASA l aB | a | AS | SA A → ASA l aB | a | AS | SA | b B → b No unit productions remain. |
Example Adding “Cascade” Productions
In the example, we handle all of these at once:
S0 → ASA, S → ASA, and A → ASA
Add a new non-terminal C and replace the productions with:
S0 → AC, S → AC, and A → AC
C → SA�
The example grammar is now transformed to:
S0 → AC l aB | a | AS | SA
S → AC l aB | a | AS | SA
A → AC l aB | a | AS | SA | b
C → SA
B → b
Final Step in Chomsky Normal Form
For any RHS that has two symbols, if either is a terminal, replace it with a new non-terminal and add a production from the non-terminal to the terminal.��Below we have to do this with the RHS aB. Add a new non-terminal D and a production D → a
S0 → AC l aB | a | AS | SA S → AC l aB | a | AS | SA A → AC l aB | a | AS | SA | b C → SA B → b | Final Grammar in CNF: S0 → AC l DB | a | AS | SA S → AC l DB | a | AS | SA A → AC l DB | a | AS | SA | b C → SA B → b D → a |
What’s so great about CNF?
CNF has properties, although not uniquely so, that make certain decision problems straightforward to answer.
Note that in CNF, we can bound the length of a derivation sequence that would derive a given string.
Each production produces either a terminal symbol or two non-terminals.
So for a string of length n, we need n productions to produce the terminals, and n-1 productions to produce the non-terminals that produce them, for a total of 2n-1 productions.
Example: For the previous grammar, is baa in the language? How about b?�baa, by the derivation sequence S0 ⇒ AC ⇒ bC ⇒ bSA ⇒ baA ⇒ baa
Prover9 baa: $F # answer([[S0],[A,S],[A,a],[A,S,a],[A,a,a],[b,a,a]])
Example: ECFG is decidable
ECFG = {<G> | G is a CFG and L(G) = ∅} is decidable.
Proof: Given <G>, convert G into Chomsky Normal Form G′ with non-terminal set N and start symbol S. Determine the set T of non-terminals of G′ that generate terminal strings working backward as follows:�� T := ∅
V := {A ∈ N | ∃𝛔∈𝚺 A → 𝛔}
while V ≠ ∅
T := T ⋃ V
V := {A ∈ N - T | ∃B∈T ∃C∈T A → BC}�
L(G) = ∅ iff S ∉ T
Example 1 of the Emptiness Algorithm
Consider this CNF grammar with 𝚺 = {a, b}, start symbol S, and productions:
S → DB, A → CB, A → a, B → SA, C → b, D → AC.
We trace the values of T and V starting with initialization and going through the loop:
T := ∅ V := {A, C}
T := {A, C} V := {D}
T := {A, C, D} V := ∅
As S ∉ T, we conclude that this grammar generates the empty language.
Example 2 of the Emptiness Algorithm
Consider this CNF grammar with 𝚺 = {a, b}, start symbol S, and productions:
S → DB, A → CB, A → a, B → DA, C → b, D → AC.
We trace the values of T and V starting with initialization and going through the loop:
T := ∅ V := {A, C}
T := {A, C} V := {D}
T := {A, C, D} V := {B}
T := {A, B, C, D} V := {S}
T := {A, B, C, D, S} V := ∅
As S ∈ T, we conclude that this grammar generates a non-empty language.
Application of CNF:�Pumping Lemma for Context-Free Languages
Let L be a context-free language. Then there is a number p = pump(L) such that ��if s ∈ L and |s| ≥ p then � there are strings u, v, x, y, z, such that�
Pumping Lemma as a Predicate Logic Formula
∀L (context-free(L) →� �(∃p ∀s (� (s ∈ L ∧ |s| ≥ p) →� � (∃u ∃v ∃x ∃y ∃z (� s = uvxyz� ∧ |vy| > 0� ∧ |vxy| < p� ∧ ∀m (m > 0 → u vm x ym z ∈ L)))))) �
Proof of the Pumping Lemma
Proof of the Pumping Lemma
CNF Derivation Tree Deriving aaabbb
S
A
B
C
D
D
E
B
C
a
a
b
b
D
B
a
b
Showing S ⇒* aCb
S
A
B
C
D
D
E
B
C
a
a
b
b
D
B
a
b
Showing C ⇒* aCb
S
A
B
C
D
D
E
B
C
a
a
b
b
D
B
a
b
Showing C ⇒ ab
S
A
B
C
D
D
E
B
C
a
a
b
b
D
B
a
b
How to Remember the Pumping Lemma
Think of this “nested wedges” picture (used in the proof):
S
C2
C1
u
v
x
y
z
∈ L |vxy| ≤ p
|uvxyz| ≥p
Start symbol
Two occurrences of�the same non-terminal C.
Bounds in the Pumping Lemma
Suppose there are n non-terminals.
In order for there to be a repeated non-terminal that is a descendant of itself, the tree height of the sub-tree for vxy does not have to be higher than n+1 in the worst case. But this translates into a string of length 2n+1-1.
So p = 2n+1-1 is an adequate value for the length of a string that can be pumped.
Impact of the Pumping Lemma (PL)
Proof that {ak bk ck | k > 0} is not context-free using the pumping lemma
Detailed case analysis
Proof that {ww | w ∈ {0, 1}*} is not context-free using the pumping lemma
Suppose L = {ww | w ∈ {0, 1}*} is context-free. We will derive a contradiction using the pumping lemma.
Let p be the integer that exists according to the pumping lemma. How do we choose an s ∈ L to get a contradiction?
One possible choice is s = 0p10p1 ∈ L. However, as Sipser points out, this choice does not work because it can be “pumped”: There is a way that uvxyz = 0p10p1 such that
∀k uvkxykz ∈ L with |vxy| ≤ p and |vy| > 0,
namely v = 0, x = 1, y = 0.
So we need try a different choice.
Proof that {ww | w ∈ {0, 1}*} is not context-free second choice
As s = 0p10p1 doesn’t work, we can try 0p1p0p1p∈ L.
Suppose 0p1p0p1p = uvxyz with |vxy| ≤ p and |vy| > 0.
Then we can break into 3 cases:
Case 1 gives a contradiction: uv2xy2z ∉ L because the second half of uv2xy2z would begin with 1 (as a result of 1’s in the left half being “pushed over”), and the 1 cannot match the 0 that begins the left half.
Case 2 gives a contradiction for the symmetric reason.
Proof that {ww | w ∈ {0, 1}*} is not context-free concluded
Therefore case 3 is the only remaining possibility:
vxy straddles the two halves of 0p1p0p1p.
In this case we select k = 0 (“pumping down”) to get a contradiction, i.e. show
uv0xy0z ∉ L.
Because vxy straddles, we know vxy = 1r0s for some r, s, and either r < p or s < p (otherwise no straddle).
These cases are symmetric, so focus on r < p. Then we have fewer than p 1’s in an initial prefix of uv0xy0z, but p 1’s in the final suffix, so this string is 0p1p-r0p-s1p where r < p, and is thus not of the form ww.