1 of 39

Chomsky Normal Form

Robert Keller

April 2016

2 of 39

Prerequisite

This discussion assumes familiarity with context-free (type 2) grammars.

3 of 39

A Slight Issue with Context-Free Grammars

Consider the acceptance problem:��ACFG = �� {<G, x> | G is a context-free grammar and x ∈ L(G)}

Is ACFG recognizable? decidable?�

e.g. is there an algorithm that will determine, from <G> and x, whether or not x ∈ L(G)?

4 of 39

Example

G: x: bab

S → ASA l aB

A → B I S

B → b I 𝛆�

According to prover9, this string is in the language:�It is “recognized” by this derivation sequence:�[[S],[A,S,A],[A,S,B],[A,S],[B,S],[b,S],[b,a,B],[b,a,b]]

But what if a string is not generated? How do we know when to stop trying?

5 of 39

The Problem with Arbitrary CFG’s

Due to productions that can erase, we can’t simply bound the number of steps in generation to tell us when to stop generating:

B → 𝛆

We would like to have only productions that make progress,

i.e. each non-terminal in a derivation will ultimately generate a

non-empty string.

That way, when the length of the sentential form exceeds the length of the input, we know that we need not pursue further steps from that form.

Exception: S → 𝛆, where S is the start-symbol not appearing on any RHS.

6 of 39

Example

Grammar:

S → ASA l aB

A → B I S

B → b I 𝛆�

A “progressive” (no erasing) equivalent grammar

S0 → ASA l aB | a | AS | SA

S → ASA l aB | a | AS | SA

A → ASA l aB | a | AS | SA | b

B → b

7 of 39

Chomsky Normal Form

A Chomsky Normal Form (CNF) for a context-free grammar has a number of important applications, such as:

  • The pumping lemma for context-free languages.
  • The CYK parsing algorithm.

In a CNF grammar, each production is of one of three forms:� A → BC where B and C are non-terminals � (not necessarily distinct)

A → 𝛔 where 𝛔 is a terminal

S0 → 𝛆, but only if S0 is the start symbol

and S0 does not appear on the RHS of any production.

8 of 39

Chomsky Normal Form

In a CNF grammar, if a string of length n > 0 is derivable at all, it is derivable by applying n-1 productions of the first type below, and n productions of the second type.��Since the number of different sequences of length n is finite, we can, in the worst case, try them all. Thus ACFG is decidable.�

In a CNF grammar, each production is of one of three forms:� A → BC where B and C are non-terminals � (not necessarily distinct)

A → 𝛔 where 𝛔 is a terminal

S0 → 𝛆, but only if S0 is the start symbol

and S0 does not appear on the RHS of any production.

(A string of length 0 can only be produced by S0 → 𝛆 in a CNF grammar).

9 of 39

Conversion to Chomsky Normal Form

Any context-free grammar can be converted to Chomsky Normal Form. �

The steps are:

  • Add a new start symbol.
  • Remove 𝛆-productions.
  • Remove unit productions.
  • Final touch-ups.

10 of 39

Eliminating 𝛆 Productions in a CFG

First step:

Add a new start symbol S0 and a production S0 → S to the original start symbol. S0 will never appear on a RHS.�

If 𝛆 is in the language, S0 → 𝛆 will end up being the only �𝛆 production. Otherwise there will be none.�

(If the original start symbol does not appear on any RHS, we can skip this step.)

11 of 39

Eliminating 𝛆 Productions in a CFG

In a context-free (not necessarily regular) grammar, it is useful to be able to eliminate productions of the form A → 𝛆, which we call 𝛆 Productions.

Obviously we cannot eliminate S0 → 𝛆 if present, but we can remove all others.

12 of 39

Eliminating 𝛆 Productions in a CFG

Repeat the following steps for each production of the form A → 𝛆,

one at a time.

Remove A → 𝛆 and for each production B → x where A appears on the RHS, add new productions for each possible removal of A.

The rationale is that we are preserving the strings that would be generated if A → 𝛆 were used.

For example, if B → aAbAc, and A → 𝛆, we would add to the B productions �

B → abAc replacing first A

B → aAbc replacing second A

B → abc replacing both A’s

still retaining the original B productions.

If A occurs n times on the RHS, then 2n-1 productions could be added.

Then we drop A → 𝛆.

13 of 39

Eliminating Unit Productions in a CFG

In a general context-free grammar, unit productions �A → B (both sides non-terminal) can be removed stepwise.

Remove A → B, and for each production of the form �B → x, add a production A → x, unless this introduces a unit production that was already removed.

Also, do not add any productions of the form A → A.

The rationale here is that we are eliminating the middle use of B while preserving the strings generated.

14 of 39

Final Touch-Ups: Add Cascade Productions as Needed

If A→x1x2x3…xn, where n > 2, we remove it and add new non-terminals B2, …, Bn-1 and add new productions� A→x1B2 � B2→x2B3� B3→x3B4 � . . .�� Bn-1→xn-1xn

So that each RHS has length 2.

A ⇒ x1B2⇒ x1x2B3 ⇒ … ⇒ x1x2…xn-1xn preserves the language.

15 of 39

Final Touch-Ups: Substitute Non-Terminals for Terminals on the RHS of length 2

If A→𝛔1𝛔2, where either 𝛔i ∈ Σ, we replace it with Bi ∈ N, where Bi is a new non-terminal, and add a new production�� Bi → 𝛔i�

So that A ⇒ B1B2⇒ 𝛔1B2 ⇒ 𝛔1𝛔2 thus preserving the language.

16 of 39

Sipser Example 2.10

Remove 𝛆 Productions

Original Grammar:

S → ASA l aB

A → B I S

B → b I 𝛆

Remove B𝛆: (new underlined)

S0 → S

S → ASA l aB | a

A → B I S I 𝛆

B → b

Add new start symbol S0:

S0 → S

S → ASA l aB

A → B I S

B → b I 𝛆

Remove A → 𝛆:

S0 → S

S → ASA l aB | a | AS | SA

A → B I S

B → b

17 of 39

Sipser Example 2.10

Remove Unit Productions

Grammar with no 𝛆 Productions

S0 → S

S → ASA l aB | a | AS | SA

AB I S

B → b

Remove AS: (new underlined)

S0 → S

S → ASA l aB | a | AS | SA

AASA l aB | a | AS | SA | b

B → b

Remove AB:

S0 → S

S → ASA l aB | a | AS | SA

A → S | b

B → b

Remove S0S:

S0ASA l aB | a | AS | SA

S → ASA l aB | a | AS | SA

A → ASA l aB | a | AS | SA | b

B → b

No unit productions remain.

18 of 39

Example Adding “Cascade” Productions

In the example, we handle all of these at once:

S0 → ASA, S → ASA, and A → ASA

Add a new non-terminal C and replace the productions with:

S0 → AC, S → AC, and A → AC

C → SA�

The example grammar is now transformed to:

S0 → AC l aB | a | AS | SA

S → AC l aB | a | AS | SA

A → AC l aB | a | AS | SA | b

C → SA

B → b

19 of 39

Final Step in Chomsky Normal Form

For any RHS that has two symbols, if either is a terminal, replace it with a new non-terminal and add a production from the non-terminal to the terminal.��Below we have to do this with the RHS aB. Add a new non-terminal D and a production D → a

S0 → AC l aB | a | AS | SA

S → AC l aB | a | AS | SA

A → AC l aB | a | AS | SA | b

C → SA

B → b

Final Grammar in CNF:

S0 → AC l DB | a | AS | SA

S → AC l DB | a | AS | SA

A → AC l DB | a | AS | SA | b

C → SA

B → b

D → a

20 of 39

What’s so great about CNF?

CNF has properties, although not uniquely so, that make certain decision problems straightforward to answer.

Note that in CNF, we can bound the length of a derivation sequence that would derive a given string.

Each production produces either a terminal symbol or two non-terminals.

So for a string of length n, we need n productions to produce the terminals, and n-1 productions to produce the non-terminals that produce them, for a total of 2n-1 productions.

Example: For the previous grammar, is baa in the language? How about b?�baa, by the derivation sequence S0 ⇒ AC ⇒ bC ⇒ bSA ⇒ baA ⇒ baa

Prover9 baa: $F # answer([[S0],[A,S],[A,a],[A,S,a],[A,a,a],[b,a,a]])

21 of 39

Example: ECFG is decidable

ECFG = {<G> | G is a CFG and L(G) = ∅} is decidable.

Proof: Given <G>, convert G into Chomsky Normal Form G′ with non-terminal set N and start symbol S. Determine the set T of non-terminals of G′ that generate terminal strings working backward as follows:�� T := ∅

V := {A ∈ N | ∃𝛔∈𝚺 A → 𝛔}

while V ≠ ∅

T := T ⋃ V

V := {A ∈ N - T | ∃B∈T ∃C∈T A → BC}�

L(G) = ∅ iff S ∉ T

22 of 39

Example 1 of the Emptiness Algorithm

Consider this CNF grammar with 𝚺 = {a, b}, start symbol S, and productions:

S → DB, A → CB, A → a, B → SA, C → b, D → AC.

We trace the values of T and V starting with initialization and going through the loop:

T := ∅ V := {A, C}

T := {A, C} V := {D}

T := {A, C, D} V := ∅

As S ∉ T, we conclude that this grammar generates the empty language.

23 of 39

Example 2 of the Emptiness Algorithm

Consider this CNF grammar with 𝚺 = {a, b}, start symbol S, and productions:

S → DB, A → CB, A → a, B → DA, C → b, D → AC.

We trace the values of T and V starting with initialization and going through the loop:

T := ∅ V := {A, C}

T := {A, C} V := {D}

T := {A, C, D} V := {B}

T := {A, B, C, D} V := {S}

T := {A, B, C, D, S} V := ∅

As S ∈ T, we conclude that this grammar generates a non-empty language.

24 of 39

Application of CNF:�Pumping Lemma for Context-Free Languages

Let L be a context-free language. Then there is a number p = pump(L) such that ��if s ∈ L and |s| ≥ p then � there are strings u, v, x, y, z, such that�

    • s = uvxyz
    • |vy| > 0 (at least one of v or y is non-empty)
    • |vxy| < p
    • (∀m > 0) u vm x ym z ∈ L

25 of 39

Pumping Lemma as a Predicate Logic Formula

∀L (context-free(L) →� �(∃p ∀s (� (s ∈ L ∧ |s| ≥ p) →� � (∃u ∃v ∃x ∃y ∃z (� s = uvxyz� ∧ |vy| > 0� ∧ |vxy| < p� ∧ ∀m (m > 0 → u vm x ym z ∈ L)))))) �

26 of 39

Proof of the Pumping Lemma

  • The proof is analogous to the proof of the Pumping Lemma for regular languages.�
  • However, we don’t have states to help us. �So we use the auxiliary symbols instead.

27 of 39

Proof of the Pumping Lemma

  • Let G be a Chomsky Normal Form grammar for L.
  • If a string in L is sufficiently long, then some non-terminal will be repeated as a descendant of itself in the derivation tree.
  • Let’s say that C is a non-terminal closest to the terminal string in the derivation tree that has this property. An example is shown on the next page.
  • aaabbb is the string derived in the tree, which can be broken down as S ⇒* aCb combined with C ⇒* aCb gives aaCbb, which then rewrites using C ⇒ ab as aaabbb.
  • By virtue of C ⇒* aCb, we also have C ⇒* akCbk for every .
  • Therefore S ⇒* akbk for every k > 0.�

28 of 39

CNF Derivation Tree Deriving aaabbb

S

A

B

C

D

D

E

B

C

a

a

b

b

D

B

a

b

29 of 39

Showing S ⇒* aCb

S

A

B

C

D

D

E

B

C

a

a

b

b

D

B

a

b

30 of 39

Showing C ⇒* aCb

S

A

B

C

D

D

E

B

C

a

a

b

b

D

B

a

b

31 of 39

Showing C ⇒ ab

S

A

B

C

D

D

E

B

C

a

a

b

b

D

B

a

b

32 of 39

How to Remember the Pumping Lemma

Think of this “nested wedges” picture (used in the proof):

S

C2

C1

u

v

x

y

z

∈ L |vxy| ≤ p

|uvxyz| ≥p

Start symbol

Two occurrences of�the same non-terminal C.

33 of 39

Bounds in the Pumping Lemma

Suppose there are n non-terminals.

In order for there to be a repeated non-terminal that is a descendant of itself, the tree height of the sub-tree for vxy does not have to be higher than n+1 in the worst case. But this translates into a string of length 2n+1-1.

So p = 2n+1-1 is an adequate value for the length of a string that can be pumped.

34 of 39

Impact of the Pumping Lemma (PL)

  • The pumping lemma can be used to show a language is not context free, since it provides a necessary condition �(L is CF → L pumpable), thus �(L not pumpable L → not CF).�
  • The PL cannot be used to show that a language is context-free.

35 of 39

Proof that {ak bk ck | k > 0} is not context-free using the pumping lemma

  • Suppose {ak bk ck | k > 0} were context-free.

  • Let p be the integer that exists according to the pumping lemma. Consider s = ap bp cp and decompose into uvxyz.�
  • At least one of v and y is not ε. Suppose v ≠ ε. The other case is symmetric. By the lemma, uv2xy2z must be in L.

  • Analyzing the cases for v as to whether it consists of all of one letter or of two different letters, in all cases we get a contradiction.

36 of 39

Detailed case analysis

  • ap bp cp = uvxyz ∈ L, �where |vxy| < p and |vy| > 0.�
  • Due to |vxy| < p, either � vxy ∈ {a}*{b}* or vxy ∈ {b}*{c}*.

  • If vxy ∈ {a}*{b}*, then uv2xy2z has the wrong number of c’s to be in L.�
  • If vxy ∈ {b}*{c}*, then uv2xy2z has the wrong number of a’s to be in L.
  • So in either case we have a contradiction.

37 of 39

Proof that {ww | w ∈ {0, 1}*} is not context-free using the pumping lemma

Suppose L = {ww | w ∈ {0, 1}*} is context-free. We will derive a contradiction using the pumping lemma.

Let p be the integer that exists according to the pumping lemma. How do we choose an s ∈ L to get a contradiction?

One possible choice is s = 0p10p1 ∈ L. However, as Sipser points out, this choice does not work because it can be “pumped”: There is a way that uvxyz = 0p10p1 such that

∀k uvkxykz ∈ L with |vxy| ≤ p and |vy| > 0,

namely v = 0, x = 1, y = 0.

So we need try a different choice.

38 of 39

Proof that {ww | w ∈ {0, 1}*} is not context-free second choice

As s = 0p10p1 doesn’t work, we can try 0p1p0p1p∈ L.

Suppose 0p1p0p1p = uvxyz with |vxy| ≤ p and |vy| > 0.

Then we can break into 3 cases:

  • vxy is a substring of the left half 0p1p.
  • vxy is a substring of the right half 0p1p.
  • vxy straddles the two halves of 0p1p0p1p.

Case 1 gives a contradiction: uv2xy2z ∉ L because the second half of uv2xy2z would begin with 1 (as a result of 1’s in the left half being “pushed over”), and the 1 cannot match the 0 that begins the left half.

Case 2 gives a contradiction for the symmetric reason.

39 of 39

Proof that {ww | w ∈ {0, 1}*} is not context-free concluded

Therefore case 3 is the only remaining possibility:

vxy straddles the two halves of 0p1p0p1p.

In this case we select k = 0 (“pumping down”) to get a contradiction, i.e. show

uv0xy0z ∉ L.

Because vxy straddles, we know vxy = 1r0s for some r, s, and either r < p or s < p (otherwise no straddle).

These cases are symmetric, so focus on r < p. Then we have fewer than p 1’s in an initial prefix of uv0xy0z, but p 1’s in the final suffix, so this string is 0p1p-r0p-s1p where r < p, and is thus not of the form ww.