1 of 45

Lex & Yacc

2 of 45

Lex

  • lex is a program (generator) that generates lexical analyzers, (widely used on Unix).
  • It is mostly used with Yacc parser generator.
  • Written by Eric Schmidt and Mike Lesk.
  • It reads the input stream (specifying the lexical analyzer ) and outputs source code implementing the lexical analyzer in the C programming language.
  • Lex will read patterns (regular expressions); then produces C code for a lexical analyzer that scans for identifiers.

3 of 45

Lex

    • A simple pattern: letter(letter|digit)*
  • Regular expressions are translated by lex to a computer program that mimics an FSA.
  • This pattern matches a string of characters that begins with a single letter followed by zero or more letters or digits.

4 of 45

Lex

  • Some limitations, Lex cannot be used to recognize nested structures such as parentheses, since it only has states and transitions between states.
  • So, Lex is good at pattern matching, while Yacc is for more challenging tasks.

5 of 45

Lex

Pattern Matching Primitives

6 of 45

Lex

  • Pattern Matching examples.

7 of 45

Lex

……..Definitions section……

%%

……Rules section……..

%%

……….C code section (subroutines)……..

  • The input structure to Lex.

  • Echo is an action and predefined macro in lex that writes code matched by the pattern.

8 of 45

Lex

Lex predefined variables.

9 of 45

Lex

  • Whitespace must separate the defining term and the associated expression.
  • Code in the definitions section is simply copied as-is to the top of the generated C file and must be bracketed with “%{“ and “%}” markers.
  • substitutions in the rules section are surrounded by braces ({letter}) to distinguish them from literals.

10 of 45

Yacc

  • Theory:
    • Yacc reads the grammar and generate C code for a parser .

    • Grammars written in Backus Naur Form (BNF) .

    • BNF grammar used to express context-free languages .

    • e.g. to parse an expression , do reverse operation( reducing the expression)

    • This known as bottom-up or shift-reduce parsing .

    • Using stack for storing (LIFO).

11 of 45

Yacc

  • Input to yacc is divided into three sections.

... definitions ...

%%

... rules ...

%%

... subroutines ...

12 of 45

Yacc

  • The definitions section consists of:
    • token declarations .
    • C code bracketed by “%{“ and “%}”.

    • the rules section consists of:
      • BNF grammar .

  • the subroutines section consists of:
    • user subroutines .

13 of 45

yacc& lex in Together

  • The grammar:

program -> program expr | ε

expr -> expr + expr | expr - expr | id

  • Program and expr are nonterminals.
  • Id are terminals (tokens returned by lex) .

  • expression may be :
    • sum of two expressions .
    • product of two expressions .
    • Or an identifiers

14 of 45

Lex file

15 of 45

Yacc file

16 of 45

Linking lex&yacc

17 of 45

lex & yacc

Authors

  • John R. Levine
  • Tony Mason
  • Doug Brown

18 of 45

The Simplest Lex Program

%%

. |\n ECHO;

%%

  • The above lex program copies standard input to its standard output.
  • .(period)->matches any single character other than newline.
  • \n ->matches newline character.

19 of 45

  • It acts very much like the UNIX cat command run with no arguments.
  • The special action ECHO prints the matched pattern on output, copying any

punctuations or other characters.

20 of 45

Recognizing Words with Lex

Word recognizer:

%{

/* The sample program demonstrates a verb or not a verb

*/

%}

%%

[\t]+ /* ignore whitespace*/;

21 of 45

is |

am |

are |

were |

was |

be |

being |

been |

do |

22 of 45

does |

can |

could |

has |

have |

had |

go { printf ("%s: is a verb\n”, yytext) ; }

[a-zA-Z]+ { printf("%s: is not a verb\n”, yytext); }

23 of 45

. | \n { ECHO; /* normal default anyway */ }

%%

main ( )

{

yylex( );

}

Input: did i have fun?

Output: did:is a verb

24 of 45

i: is not a verb

have: is a verb

fun:is not a verb

?

  • If the input matches any of the verbs in the list.Once we recognize a verb we execute the action ,a C printf state.
  • yytext contains text that matched the pattern

25 of 45

  • "[a-zA-Z]+ : it indicates any alphabetic string with at least one character.
  • "-“: it denotes a range of characters beginning with the character to the left of the "-" and ending with the character to its right.

26 of 45

Symbol Tables

  • A data structure used by compiler to keep track of identifiers used in the source program called as Symbol Table.
  • Symbol table is a table of words with common structure used in lex and yacc

applications.

  • Symbol table is a compile time data structure and not used at run time.
  • C Compilers stores variables,structure names,labels,enumeration tags and other

27 of 45

names used in program in its symbol table.

  • The possible entries in symbol table are

1.type of symbol

2.variable name or type

3.reserve word

4.constant name

5.declaration scope

  • Symbol table is concerned with storing identifiers and their attributes.

28 of 45

  • Symbol table is more required for semantic checking to check whether

1.variable assigned before referenced.

2.variable not declared multiple times.

Example: Lexer with symbol table

% {

/*

* Word recognizer with a symbol table.

*/

%}

29 of 45

Enum {

LOOKUP=0, /* default - looking rather than defining. */

VERB,

ADJ,

ADV,

NOUN,

PREP,

PRON,

CONJ,

30 of 45

};

int state;

int add_word(int type, char *word) ;

int lookup_word (char *word) ;

%}

add_word()->puts new word to symbol table.

lookup_word()->looks up a word which should already be entered.

31 of 45

Grammars

  • A Sequence of tokens needs to be recognized and action taken.
  • The description of such a set of actions called grammar.
  • Lets consider example of simple sentence type

noun verb

noun verb noun

The notation for describing grammar is->

32 of 45

Which means particular set of tokens can be replaced by new symbol .*

subject->noun/pronoun

indicates subject can either be noun or pronoun

object->noun

Sentence->subject verb object

Example: teacher teaches students.

33 of 45

Parser-Lexer Communication

  • A lex scanner and yacc parser used together called lexer-parser communication.
  • Parser calls lexer yylex() when it needs a token from input.
  • Lexer scan through input recognizing tokens returns tokens to parser.
  • Parser is higher level routine.
  • In most of programming language the parser doesnot want comments and whitespaces.

34 of 45

  • Lex produces function called yylex()
  • Yacc produces function called yyparse()

35 of 45

A Yacc Parser(Yet Another Compiler Compiler)

The structure of a yacc parser is similar to that of a lex lexer. First section, the definition section, has a literal code block, enclosed in "%{" and "%}". We use it here for a C comment (as with lex, C comments belong inside C code blocks, at least within the definition section) and a single include file.

Then come definitions of all the tokens we expect to receive from the lexical analyzer.

36 of 45

Universal custom dictates that token names be all uppercase and other names in the parser mostly or entirely lowercase.

The first %% indicates the beginning of the rules section. The second %% indicates the end of the rules and the beginning of the user subroutines section.

The most important subroutine is main() which repeatedly calls yyparse() until the lexer's input file runs out.

37 of 45

The routine yyparse() is the parser generated by yacc, so our main program repeatedly tries to parse sentences until the input runs out.Yacc is a tool for automatic generation of

parser.

38 of 45

The Rules Section

The rules section describes the actual grammar as a set of production rules or simply rules. (Some people also call them productions.) Each rule consists of a single name on the left-hand side of the ":" operator, a list of symbols and action code on the right-hand side, and a semicolon indicating the

end of the rule. By default, the first rule is the highest-level rule.

39 of 45

That is, the parser attempts to find a list of tokens which match this initial rule, or

more commonly, rules found from the initial rule. The expression on the right-hand side of the rule is a list of zero or more names. A typical simple rule has a single symbol on the right-hand side as in the object rule which is defined to be a NOUN. The symbol on the left-hand side of the rule can

40 of 45

then be used like a token in other rules.

41 of 45

Running Lex and Yacc

  • Lex program

$ gedit filename –l

type Pgm & save

$ lex filename.l

$ cc lex.yy.c -ll

$ ./a.out

42 of 45

Executing lex and Yacc together

$ gedit filename.l

$ gedit filename.y

$ lex filename.l

$ yacc –d filename.y

$ cc lex.yy.c y.tab.c -ll -ly

$ ./a.out

43 of 45

Lex vs. Hand-written Lexers

  • Writing a lexer in C is so easy there is no point in going to the effort to learn lex.
  • Lexer written in C is suitable for simple command language that handles commands,numbers,strings and newlines ignoring whitespaces and comments.
  • The program written in C takes 3 times more to debug compared to program in lex.

44 of 45

  • Given the rule of thumb the number of bugs in a program is roughly proportional

to its length.

For example program refer text book.

45 of 45

Regular Expressions

  • Regular expressions are widely used within the UNIX environment and lex uses a rich regular expression.
  • A regular expression is a pattern description using a "meta" language, a language used to describe particular patterns of interest.
  • The characters used in this metalanguage are part of the standard ASCII character set