1 of 68

Lex & Yacc

2 of 68

Text Book

*Levine, John R., Tony Mason and Doug Brown [1992]. Lex & Yacc. O’Reilly & Associates, Inc. Sebastopol, California.

2

Hathal & Ahmad

*

3 of 68

Compiler

Source program

Target program

3

Hathal & Ahmad

compiler

*

4 of 68

Interpreter

Source program Input

Output

4

Hathal & Ahmad

Interpreter

*

5 of 68

Structure of Compiler

5

Hathal & Ahmad

*

6 of 68

Lexical Analysis

Groups the characters into meaningful sequences called lexemes

For example :<token-name, attribute-value>

position = initial + rate * 60

token <id, 1>

token <=>

token <id, 2>

token <+>

token <*>

token <60>

<id,l> < = > <id, 2> <+> <id, 3> <*> <60>

6

Hathal & Ahmad

*

7 of 68

Lexical Analysis

7

Hathal & Ahmad

*

8 of 68

Syntax Analysis

Create a tree-like intermediate representation that depicts the grammatical structure of the token stream. A typical representation is a syntax tree

8

Hathal & Ahmad

*

9 of 68

Lex and Yacc

9

Hathal & Ahmad

*

10 of 68

Lex and Yacc

Tokens
Lexer/Scanner
Lex specification
Regular Expression
Grammer
Parser
Simple Lex Program

%%

.|\n ECHO;

%%

10

Hathal & Ahmad

*

11 of 68

Lex syntax

……..Declaration and Definitions section……

%%

……Rules section……..

%%

……….C code section (subroutines)……..

The input structure to Lex.l

Echo is an action and predefined macro in lex that writes code matched by the pattern.

11

Hathal & Ahmad

*

12 of 68

Lex syntax

Declaration section

%{

headerfiles, symbolic constants,

Declarations and comments ,

c/c++ function declarations or definitions

%}

Example::

%{

#include <stdio.h>

#define NUMBER 400

#define COMMENT 401

#define TEXT 402

#define COMMAND 403

%}

12

Hathal & Ahmad

*

13 of 68

Lex syntax

Rules section

P1 {A1}

P2 {A2}

….

Pn {An}

%%

Example::

%%

[\t] ;

[0-9]+ printf(“found an intereger:%s\n”,yytext);

%%

13

Hathal & Ahmad

*

14 of 68

Lex syntax

C code section

%%

Void main()

{

yylex();

}

14

Hathal & Ahmad

*

15 of 68

Lex program

%{

/*

* this sample demonstrates (very) simple recognition:

* a verb/not a verb.

*/

%}

%%

[\t ]+ ;

is | am | are | were | was | be | being | been | do | does | did | will | would | should | can | could | has | have | had | go { printf("%s: is a verb\n", yytext); }

[a-zA-Z]+ { printf("%s: is not a verb\n", yytext); }

.|\n { ECHO; /* normal default anyway */ }

%%

main()

{

yylex() ;

}

15

Hathal & Ahmad

*

16 of 68

Lex program con…

Example 1-1

did I have fun?

did: is a verb

I: is not a verb

have: is a verb

fun: is not a verb ?

^D

%

Symbol Table

noun dog cat horse cow

verb chew eat lick

Reserved words

Add_word()

Lookup()_word()

16

Hathal & Ahmad

*

17 of 68

Symbol table example

%{

/*

*

Word recognizer with a symbol table.

*/

enum { LOOKUP =0, /* default - looking rather than defining. */

VERB,

ADJ,

ADV,

NOUN,

REP,

PRON,

CONJ

};

int state;

int add_word(int type, char *word);

int lookup_word(char *word);

%}

17

Hathal & Ahmad

*

18 of 68

Grammar

Set of actions.

Example

noun verb.

noun verb noun

For instance:

subject → noun | pronoun

object → noun

sentence → subject verb object

Parser-lexer communication

It calls the lexer yylex() whenever it needs a token from the input .

The lexer then scans through the input recognizing tokens

Not all tokens are of interest to the parser

# define NOUN 257

# define PRONOUN 258

# define VERB 259

# define ADVERB 260 …….

18

Hathal & Ahmad

*

19 of 68

The parts of speech lexer

19

Hathal & Ahmad

*

20 of 68

Rules section

20

Hathal & Ahmad

*

21 of 68

Running lex and Yacc

21

Hathal & Ahmad

*

22 of 68

Lex vs handwritten lex

22

Hathal & Ahmad

*

23 of 68

Lex vs handwritten lex

23

Hathal & Ahmad

*

24 of 68

Lex

lex is a program (generator) that generates lexical analyzers, (widely used on Unix).
It is mostly used with Yacc parser generator.
Written by Eric Schmidt and Mike Lesk.
It reads the input stream (specifying the lexical analyzer ) and outputs source code implementing the lexical analyzer in the C programming language.
Lex will read patterns (regular expressions); then produces C code for a lexical analyzer that scans for identifiers.

24

Hathal & Ahmad

*

25 of 68

Lex

A simple pattern: letter(letter|digit)*

Regular expressions are translated by lex to a computer program that mimics an FSA.
This pattern matches a string of characters that begins with a single letter followed by zero or more letters or digits.

25

Hathal & Ahmad

*

26 of 68

Lex

Pattern Matching Primitives

26

Hathal & Ahmad

*

27 of 68

Regular Expression

Pattern Matching examples.

27

Hathal & Ahmad

.	Any single character
*	Zero or more copoies of preceding expression
[ ]	Any character within the bracket
^	Beginning of line as first character
$	End of line as last character
{ }	How many times previous pattern allowed
\	Escape metacharacters
+	One or more occurance
?	Zero or more occurance
\|	Either preceding or following
“…”	Marks literally
/	Preceding expression
( )	Groups a series of expression

*

28 of 68

Example of Regular expression

Pattern Matching examples.

28

Hathal & Ahmad

*

29 of 68

Lex syntax

……..Declaration and Definitions section……

%%

……Rules section……..

%%

……….C code section (subroutines)……..

The input structure to Lex.

Echo is an action and predefined macro in lex that writes code matched by the pattern.

29

Hathal & Ahmad

*

30 of 68

Lex syntax

Declaration section

%{

headerfiles,symbolic constants,

Declarations and comments ,

c/c++ function declarations or definitions

%}

Example::

%{

#include <stdio.h>

#define NUMBER 400

#define COMMENT 401

#define TEXT 402

#define COMMAND 403

%}

30

Hathal & Ahmad

*

31 of 68

Lex syntax

Rules section

P1 {A1}

P2 {A2}

….

Pn {An}

%%

Example::

%%

[\t] ;

[0-9]+ printf(“found an intereger:%s\n”,yytext);

%%

31

Hathal & Ahmad

*

32 of 68

Lex syntax

C code section

%%

Void main()

{

yylex();

}

32

Hathal & Ahmad

*

33 of 68

Lex variables

yyin : point to standard input device keyboard.can read data from the file
yyout : point to standard output device .can write data into the file
yytext :matched pattern is copied into yytext autoatically
yyleng :string length of yytext is copied in variable yyleng
yylval :integer value assosicated with token is returned by lexical analyzer
ECHO : write the matched character

33

Hathal & Ahmad

*

34 of 68

Lex Functions

yylex() : It starts the analysis, automatically generated by LEX. This function accepts the data from the keyboard or the input file and search for patterns specified in the rules section.
yywrap() :Called when yylex() encounters end of file . It return 0 or 1.
yyerror() :Called when lenxical analyzer encounters error

34

Hathal & Ahmad

*

35 of 68

Lex

Lex predefined variables.

35

Hathal & Ahmad

*

36 of 68

The word counting program

The definition section for our word count example is:

%{

unsigned charCount = 0, wordCount = 0, lineCount = 0;

%}

word [^ \t\n]+

eol \n

Hathal & Ahmad

36

*

37 of 68

The word counting program

The rules section contains the patterns and actions that specify the lexer. Here is our

sample word count’s rules section:

%%

{word} { wordCount++; charCount += yyleng; }

{eol} { charCount++; lineCount++; }

. charCount++;

Hathal & Ahmad

37

*

38 of 68

The word counting program

%%

main()

{

yylex();

printf("%d %d %d\n",lineCount, wordCount, charCount);

}

Hathal & Ahmad

38

*

39 of 68

The word counting program

main(argc,argv)

int argc;

char **argv;

{

if (argc > 1) {

FILE *file;

file = fopen(argv[1], "r");

if (!file) {

fprintf(stderr,"could not open %s\n",argv[1]);

exit(1);

}

yyin = file;

}

yylex();

printf("%u %u %u\n",charCount, wordCount, lineCount);

return 0;

}

Hathal & Ahmad

39

*

40 of 68

Parsing a command line

%{

unsigned verbose;

char *progName;

%}

%%

-h |

"-?" |

-help { printf("usage is: %s [-help | -h | -? ] [-verbose | -v] "

"[(-file| -f) filename]\n", progName);

}

-v |

-verbose { printf("verbose mode is on\n"); verbose = 1; }

%%

main(argc, argv)

int argc;

char **argv;

{

progName = *argv;

yylex();

}

Hathal & Ahmad

40

*

41 of 68

Parsing a command line

Hathal & Ahmad

41

*

42 of 68

Parsing a command line

Hathal & Ahmad

42

*

43 of 68

Start state

Hathal & Ahmad

43

*

44 of 68

Start state

Example1 :

Hathal & Ahmad

44

%s MAGIC

%%

<MAGIC>.+ { BEGIN 0; printf("Magic:"); ECHO; }

Magic BEGIN MAGIC;

%%

main()

{

yylex();

}

Example 2:

%s MAGIC

%%

magic BEGIN MAGIC;

.+ ECHO;

<MAGIC>.+ { BEGIN 0; printf ("Magic:"); ECHO; }

%%

main()

{

yylex();

}

*

45 of 68

Yacc

Theory:

Yacc reads the grammar and generate C code for a parser .

Grammars written in Backus Naur Form (BNF) .

BNF grammar used to express context-free languages .

e.g. to parse an expression , do reverse operation( reducing the expression)

This known as bottom-up or shift-reduce parsing .

Using stack for storing (LIFO).

45

Hathal & Ahmad

*

46 of 68

Yacc

Grammar

statement NAME = expression

expression NUMBER + NUMBER | NUMBER − NUMBER

Recursive Rules

expression 🡪NUMBER

| expression + NUMBER

| expression − NUMBER

46

Hathal & Ahmad

*

47 of 68

Yacc

47

Hathal & Ahmad

*

48 of 68

Shift/Reduce Parsing

As the parser reads tokens, each time it reads a token that doesn’t complete a rule it pushes the token on an internal stack and switches to a new state reflecting the token it just read-shift
When it has found all the symbols that constitute the right-hand side of a rule, it pops the right-hand side symbols off the stack, pushes the left-hand side symbol onto the stack, and switches to a new state reflecting the new symbol on the stack.-reduction
“fred = 12 + 13”
fred
fred =
fred = 12
fred = 12 +
fred = 12 + 13 fred = expression
“statement → NAME = expression”, so it pops fred, =, and expression and replaces them with statement

48

Hathal & Ahmad

*

49 of 68

Yacc

49

Hathal & Ahmad

*

50 of 68

What Yacc cannot parse

It cannot deal with ambiguous grammars.
Consider this ex‐ tremely contrived example:

phrase cart_animal AND CART

| work_animal AND PLOW

cart_animal HORSE | GOAT

work_animal HORSE | OX

If we changed the first rule to this:

phrase cart_animal CART

| work_animal PLOW

50

Hathal & Ahmad

*

51 of 68

Yacc parser

Input to yacc is divided into three sections.

... definitions ...

%%

... rules ...

%%

... subroutines ...

51

Hathal & Ahmad

*

52 of 68

Yacc

The definitions section consists of:

token declarations .
C declaration code bracketed by “%{“ and “%}”.
Start symbol
%token NAME NUMBER

%%

the rules section consists of:

%token NAME NUMBER
%%
statement: NAME '=' expression
| expression ;
expression: NUMBER '+' NUMBER
| NUMBER '−' NUMBER
;

the subroutines section consists of:

user subroutines .

52

Hathal & Ahmad

*

53 of 68

Symbol values and Actions

Every symbol in a yacc parser has a value-
a number, the value would be the particular number
a literal text string, the value would probably be a pointer to a copy of the string.
a variable in a program, the value would be a pointer to a symbol table entry describing the variable
Non-terminal symbols can have any values
. If you have multiple value types, you have to list all the value types used in a parser so that yacc can create a C union typedef called YYSTYPE to contain them
statement: NAME '=' expression |
expression { printf("= %d\n", $1); }

;

expression: expression '+' NUMBER { $$ = $1 + $3; }
| expression '−' NUMBER { $$ = $1 − $3; }
| NUMBER { $$ = $1; }
;

53

Hathal & Ahmad

*

54 of 68

The Lexer

54

Hathal & Ahmad

*

55 of 68

Running Simple Parser

55

Hathal & Ahmad

*

56 of 68

Arithmetic Expressions and Ambiguity

56

Hathal & Ahmad

*

57 of 68

Arithmetic Expressions and Ambiguity

57

Hathal & Ahmad

*

58 of 68

Arithmetic Expressions and Ambiguity

58

Hathal & Ahmad

Explicitly by:

%left '+' '-‘

%left '*' '/‘

%nonassoc UMINUS

When not to use prescedence rule??

two situations: in expression grammars, and to resolve the “dangling else” conflict in grammars for if-then-else language con‐ structs.

*

59 of 68

Variables and Typed Tokens

59

Hathal & Ahmad

*

60 of 68

Variables and Typed Tokens

60

Hathal & Ahmad

*

61 of 68

Variables and Typed Tokens

61

Hathal & Ahmad

To define the possible symbol types, in the definition section we add a %union decla‐ ration:

%union { double dval;

int vblno;

}

Here is the y.tab.h generated from this grammar: #define NAME 257

#define NUMBER 258

#define UMINUS 259

typedef union {

double dval;

int vblno;

}YYSTYPE;

extern YYSTYPE yylval;

*

62 of 68

Symbol Table

Hathal & Ahmad

62

*

63 of 68

Symbol Table

Hathal & Ahmad

63

*

64 of 68

Symbol Table

Hathal & Ahmad

64

*

65 of 68

Functions and reserve words

Hathal & Ahmad

65

*

66 of 68

Reserve words in symbol table

struct symtab {

char *name;

double (*funcptr)();

double value;

} symtab[NSYMS];

Hathal & Ahmad

66

*

67 of 68

Reserve words in symbol table

Hathal & Ahmad

67

*

68 of 68

Reserve words in symbol table

Hathal & Ahmad

68

*