3 of 39

A Slight Issue with Context-Free Grammars

Consider the acceptance problem:��A_CFG = �� {<G, x> | G is a context-free grammar and x ∈ L(G)}

Is A_CFG recognizable? decidable?�

e.g. is there an algorithm that will determine, from <G> and x, whether or not x ∈ L(G)?

4 of 39

Example

G: x: bab

S → ASA l aB

A → B I S

B → b I 𝛆�

According to prover9, this string is in the language:�It is “recognized” by this derivation sequence:�[[S],[A,S,A],[A,S,B],[A,S],[B,S],[b,S],[b,a,B],[b,a,b]]

But what if a string is not generated? How do we know when to stop trying?

5 of 39

The Problem with Arbitrary CFG’s

Due to productions that can erase, we can’t simply bound the number of steps in generation to tell us when to stop generating:

B → 𝛆

We would like to have only productions that make progress,

i.e. each non-terminal in a derivation will ultimately generate a

non-empty string.

That way, when the length of the sentential form exceeds the length of the input, we know that we need not pursue further steps from that form.

Exception: S → 𝛆, where S is the start-symbol not appearing on any RHS.

6 of 39

Example

Grammar:

S → ASA l aB

A → B I S

B → b I 𝛆�

A “progressive” (no erasing) equivalent grammar

S₀ → ASA l aB | a | AS | SA

S → ASA l aB | a | AS | SA

A → ASA l aB | a | AS | SA | b

B → b

7 of 39

Chomsky Normal Form

A Chomsky Normal Form (CNF) for a context-free grammar has a number of important applications, such as:

The pumping lemma for context-free languages.
The CYK parsing algorithm.

In a CNF grammar, each production is of one of three forms:� A → BC where B and C are non-terminals � (not necessarily distinct)

A → 𝛔 where 𝛔 is a terminal

S₀ → 𝛆, but only if S₀ is the start symbol

and S₀ does not appear on the RHS of any production.

8 of 39

Chomsky Normal Form

In a CNF grammar, if a string of length n > 0 is derivable at all, it is derivable by applying n-1 productions of the first type below, and n productions of the second type.��Since the number of different sequences of length n is finite, we can, in the worst case, try them all. Thus A_CFG is decidable.�

In a CNF grammar, each production is of one of three forms:� A → BC where B and C are non-terminals � (not necessarily distinct)

A → 𝛔 where 𝛔 is a terminal

S₀ → 𝛆, but only if S₀ is the start symbol

and S₀ does not appear on the RHS of any production.

(A string of length 0 can only be produced by S₀ → 𝛆 in a CNF grammar).

9 of 39

Conversion to Chomsky Normal Form

Any context-free grammar can be converted to Chomsky Normal Form. �

The steps are:

Add a new start symbol.
Remove 𝛆-productions.
Remove unit productions.
Final touch-ups.

10 of 39

Eliminating 𝛆 Productions in a CFG

First step:

Add a new start symbol S₀ and a production S₀→ S to the original start symbol. S₀will never appear on a RHS.�

If 𝛆 is in the language, S₀→ 𝛆 will end up being the only �𝛆 production. Otherwise there will be none.�

(If the original start symbol does not appear on any RHS, we can skip this step.)

11 of 39

Eliminating 𝛆 Productions in a CFG

In a context-free (not necessarily regular) grammar, it is useful to be able to eliminate productions of the form A → 𝛆, which we call 𝛆 Productions.

Obviously we cannot eliminate S₀→ 𝛆 if present, but we can remove all others.

12 of 39

Eliminating 𝛆 Productions in a CFG

Repeat the following steps for each production of the form A → 𝛆,

one at a time.

Remove A → 𝛆 and for each production B → x where A appears on the RHS, add new productions for each possible removal of A.

The rationale is that we are preserving the strings that would be generated if A → 𝛆 were used.

For example, if B → aAbAc, and A → 𝛆, we would add to the B productions �

B → abAc replacing first A

B → aAbc replacing second A

B → abc replacing both A’s

still retaining the original B productions.

If A occurs n times on the RHS, then 2ⁿ-1 productions could be added.

Then we drop A → 𝛆.

13 of 39

Eliminating Unit Productions in a CFG

In a general context-free grammar, unit productions �A → B (both sides non-terminal) can be removed stepwise.

Remove A → B, and for each production of the form �B → x, add a production A → x, unless this introduces a unit production that was already removed.

Also, do not add any productions of the form A → A.

The rationale here is that we are eliminating the middle use of B while preserving the strings generated.

14 of 39

Final Touch-Ups: Add Cascade Productions as Needed

If A→x₁x₂x₃…x_n, where n > 2, we remove it and add new non-terminals B₂, …, B_n-1and add new productions� A→x₁B_2 �B₂→x₂B_3�B₃→x₃B_{4 � . . .��}B_n-1→x_n-1x_n

So that each RHS has length 2.

A⇒ x₁B₂⇒ x₁x₂B₃ ⇒ … ⇒ x₁x₂…x_n-1x_n preserves the language.

15 of 39

Final Touch-Ups: Substitute Non-Terminals for Terminals on the RHS of length 2

If A→𝛔₁𝛔₂, where either 𝛔_i∈ Σ, we replace it with B_i∈ N, where B_iis a new non-terminal, and add a new production�� B_i→ 𝛔_i�

So that A⇒ B₁B₂⇒ 𝛔₁B₂ ⇒ 𝛔₁𝛔₂ thus preserving the language.

16 of 39

Sipser Example 2.10

Remove 𝛆 Productions

Original Grammar: S → ASA l aB A → B I S B → b I 𝛆	Remove B → 𝛆: (new underlined) S₀ → S S → ASA l aB \| a A → B I S I 𝛆 B → b
Add new start symbol S₀: S₀ → S S → ASA l aB A → B I S B → b I 𝛆	Remove A → 𝛆: S₀ → S S → ASA l aB \| a \| AS \| SA A → B I S B → b

17 of 39

Sipser Example 2.10

Remove Unit Productions

Grammar with no 𝛆 Productions S₀ → S S → ASA l aB \| a \| AS \| SA A → B I S B → b	Remove A → S: (new underlined) S₀ → S S → ASA l aB \| a \| AS \| SA A → ASA l aB \| a \| AS \| SA \| b B → b
Remove A → B: S₀ → S S → ASA l aB \| a \| AS \| SA A → S \| b B → b	Remove S₀ → S: S₀ → ASA l aB \| a \| AS \| SA S → ASA l aB \| a \| AS \| SA A → ASA l aB \| a \| AS \| SA \| b B → b No unit productions remain.

18 of 39

Example Adding “Cascade” Productions

In the example, we handle all of these at once:

S₀ → ASA, S → ASA, and A → ASA

Add a new non-terminal C and replace the productions with:

S₀ → AC, S → AC, and A → AC

C → SA�

The example grammar is now transformed to:

S₀ → AC l aB | a | AS | SA

S → AC l aB | a | AS | SA

A → AC l aB | a | AS | SA | b

C → SA

B → b

19 of 39

Final Step in Chomsky Normal Form

For any RHS that has two symbols, if either is a terminal, replace it with a new non-terminal and add a production from the non-terminal to the terminal.��Below we have to do this with the RHS aB. Add a new non-terminal D and a production D → a

S₀ → AC l aB \| a \| AS \| SA S → AC l aB \| a \| AS \| SA A → AC l aB \| a \| AS \| SA \| b C → SA B → b	Final Grammar in CNF: S₀ → AC l DB \| a \| AS \| SA S → AC l DB \| a \| AS \| SA A → AC l DB \| a \| AS \| SA \| b C → SA B → b D → a

20 of 39

What’s so great about CNF?

CNF has properties, although not uniquely so, that make certain decision problems straightforward to answer.

Note that in CNF, we can bound the length of a derivation sequence that would derive a given string.

Each production produces either a terminal symbol or two non-terminals.

So for a string of length n, we need n productions to produce the terminals, and n-1 productions to produce the non-terminals that produce them, for a total of 2n-1 productions.

Example: For the previous grammar, is baa in the language? How about b?�baa, by the derivation sequence S₀ ⇒ AC ⇒ bC ⇒ bSA ⇒ baA ⇒ baa

Prover9 baa: $F # answer([[S0],[A,S],[A,a],[A,S,a],[A,a,a],[b,a,a]])

21 of 39

Example: E_CFG is decidable

E_CFG = {<G> | G is a CFG and L(G) = ∅} is decidable.

Proof: Given <G>, convert G into Chomsky Normal Form G′ with non-terminal set N and start symbol S. Determine the set T of non-terminals of G′ that generate terminal strings working backward as follows:�� T := ∅

V := {A ∈ N | ∃𝛔∈𝚺 A → 𝛔}

while V ≠ ∅

T := T ⋃ V

V := {A ∈ N - T | ∃B∈T ∃C∈T A → BC}�

L(G) = ∅ iff S ∉ T

22 of 39

Example 1 of the Emptiness Algorithm

Consider this CNF grammar with 𝚺 = {a, b}, start symbol S, and productions:

S → DB, A → CB, A → a, B → SA, C → b, D → AC.

We trace the values of T and V starting with initialization and going through the loop:

T := ∅ V := {A, C}

T := {A, C} V := {D}

T := {A, C, D} V := ∅

As S ∉ T, we conclude that this grammar generates the empty language.

23 of 39

Example 2 of the Emptiness Algorithm

Consider this CNF grammar with 𝚺 = {a, b}, start symbol S, and productions:

S → DB, A → CB, A → a, B → DA, C → b, D → AC.

We trace the values of T and V starting with initialization and going through the loop:

T := ∅ V := {A, C}

T := {A, C} V := {D}

T := {A, C, D} V := {B}

T := {A, B, C, D} V := {S}

T := {A, B, C, D, S} V := ∅

As S ∈ T, we conclude that this grammar generates a non-empty language.

24 of 39

Application of CNF:�Pumping Lemma for Context-Free Languages

Let L be a context-free language. Then there is a number p = pump(L) such that ��if s ∈ L and |s| ≥ p then � there are strings u, v, x, y, z, such that�

s = uvxyz
|vy| > 0 (at least one of v or y is non-empty)
|vxy| < p
(∀m > 0) u v^mx y^mz ∈ L

25 of 39

Pumping Lemma as a Predicate Logic Formula

∀L (context-free(L) →� �(∃p ∀s (� (s ∈ L ∧ |s| ≥ p) →� � (∃u ∃v ∃x ∃y ∃z (� s = uvxyz� ∧ |vy| > 0� ∧ |vxy| < p� ∧ ∀m (m > 0 → u v^mx y^mz ∈ L)))))) �

26 of 39

Proof of the Pumping Lemma

The proof is analogous to the proof of the Pumping Lemma for regular languages.�
However, we don’t have states to help us. �So we use the auxiliary symbols instead.

27 of 39

Proof of the Pumping Lemma

Let G be a Chomsky Normal Form grammar for L.
If a string in L is sufficiently long, then some non-terminal will be repeated as a descendant of itself in the derivation tree.
Let’s say that C is a non-terminal closest to the terminal string in the derivation tree that has this property. An example is shown on the next page.
aaabbb is the string derived in the tree, which can be broken down as S ⇒* aCb combined with C ⇒* aCb gives aaCbb, which then rewrites using C ⇒ ab as aaabbb.
By virtue of C ⇒* aCb, we also have C ⇒* a^kCb^k for every .
Therefore S ⇒* a^kb^k for every k > 0.�

28 of 39

CNF Derivation Tree Deriving aaabbb

29 of 39

Showing S ⇒* aCb

30 of 39

Showing C ⇒* aCb

31 of 39

Showing C ⇒ ab

32 of 39

How to Remember the Pumping Lemma

Think of this “nested wedges” picture (used in the proof):

C₂

C₁

∈ L |vxy| ≤ p

|uvxyz| ≥p

Start symbol

Two occurrences of�the same non-terminal C.

33 of 39

Bounds in the Pumping Lemma

Suppose there are n non-terminals.

In order for there to be a repeated non-terminal that is a descendant of itself, the tree height of the sub-tree for vxy does not have to be higher than n+1 in the worst case. But this translates into a string of length 2ⁿ⁺¹-1.

So p = 2ⁿ⁺¹-1 is an adequate value for the length of a string that can be pumped.

34 of 39

Impact of the Pumping Lemma (PL)

The pumping lemma can be used to show a language is not context free, since it provides a necessary condition �(L is CF → L pumpable), thus �(L not pumpable L → not CF).�
The PL cannot be used to show that a language is context-free.

35 of 39

Proof that {a^kb^kc^k| k > 0} is not context-free using the pumping lemma

Suppose {a^kb^kc^k| k > 0} were context-free.

Let p be the integer that exists according to the pumping lemma. Consider s = a^pb^pc^p and decompose into uvxyz.�
At least one of v and y is not ε. Suppose v ≠ ε. The other case is symmetric. By the lemma, uv²xy²z must be in L.

Analyzing the cases for v as to whether it consists of all of one letter or of two different letters, in all cases we get a contradiction.

36 of 39

Detailed case analysis

a^pb^pc^p= uvxyz ∈ L, �where |vxy| < p and |vy| > 0.�
Due to |vxy| < p, either � vxy ∈ {a}*{b}* or vxy ∈ {b}*{c}*.

If vxy ∈ {a}*{b}*, then uv²xy²z has the wrong number of c’s to be in L.�
If vxy ∈ {b}*{c}*, then uv²xy²z has the wrong number of a’s to be in L.
So in either case we have a contradiction.

37 of 39

Proof that {ww| w ∈ {0, 1}*} is not context-free using the pumping lemma

Suppose L = {ww| w ∈ {0, 1}*} is context-free. We will derive a contradiction using the pumping lemma.

Let p be the integer that exists according to the pumping lemma. How do we choose an s ∈ L to get a contradiction?

One possible choice is s = 0^p10^p1 ∈ L. However, as Sipser points out, this choice does not work because it can be “pumped”: There is a way that uvxyz = 0^p10^p1 such that

∀k uv^kxy^kz ∈ L with |vxy| ≤ p and |vy| > 0,

namely v = 0, x = 1, y = 0.

So we need try a different choice.

38 of 39

Proof that {ww| w ∈ {0, 1}*} is not context-free second choice

As s = 0^p10^p1 doesn’t work, we can try 0^p1^p0^p1^p∈ L.

Suppose 0^p1^p0^p1^p = uvxyz with |vxy| ≤ p and |vy| > 0.

Then we can break into 3 cases:

vxy is a substring of the left half 0^p1^p.
vxy is a substring of the right half 0^p1^p.
vxy straddles the two halves of 0^p1^p0^p1^p.

Case 1 gives a contradiction: uv²xy²z ∉ L because the second half of uv²xy²z would begin with 1 (as a result of 1’s in the left half being “pushed over”), and the 1 cannot match the 0 that begins the left half.

Case 2 gives a contradiction for the symmetric reason.

39 of 39

Proof that {ww| w ∈ {0, 1}*} is not context-free concluded

Therefore case 3 is the only remaining possibility:

vxy straddles the two halves of 0^p1^p0^p1^p.

In this case we select k = 0 (“pumping down”) to get a contradiction, i.e. show

uv⁰xy⁰z ∉ L.

Because vxy straddles, we know vxy = 1^r0^s for some r, s, and either r < p or s < p (otherwise no straddle).

These cases are symmetric, so focus on r < p. Then we have fewer than p 1’s in an initial prefix of uv⁰xy⁰z, but p 1’s in the final suffix, so this string is 0^p1^p-r0^p-s1^pwhere r < p, and is thus not of the form ww.

Grammar with no 𝛆 Productions S₀ → S S → ASA l aB \| a \| AS \| SA A → B I S B → b	Remove A → S: (new underlined) S₀ → S S → ASA l aB \| a \| AS \| SA A → ASA l aB \| a \| AS \| SA \| b B → b
Remove A → B: S₀ → S S → ASA l aB \| a \| AS \| SA A → S \| b B → b	Remove S₀ → S: S₀ → ASA l aB \| a \| AS \| SA S → ASA l aB \| a \| AS \| SA A → ASA l aB \| a \| AS \| SA \| b B → b No unit productions remain.