1 of 68

Lex & Yacc

2 of 68

Text Book

*Levine, John R., Tony Mason and Doug Brown [1992]. Lex & Yacc. O’Reilly & Associates, Inc. Sebastopol, California.

2

Hathal & Ahmad

*

3 of 68

Compiler

Source program

Target program

3

Hathal & Ahmad

compiler

*

4 of 68

Interpreter

Source program Input

Output

4

Hathal & Ahmad

Interpreter

*

5 of 68

Structure of Compiler

5

Hathal & Ahmad

*

6 of 68

Lexical Analysis

  • Groups the characters into meaningful sequences called lexemes

For example :<token-name, attribute-value>

position =  initial +  rate  *  60 

  token <id, 1>

 token <=>

token <id, 2>

token <+>

 token <*>

token <60>

 <id,l>  < = > <id, 2> <+> <id, 3> <*> <60> 

6

Hathal & Ahmad

*

7 of 68

Lexical Analysis

7

Hathal & Ahmad

*

8 of 68

Syntax Analysis

  • Create a tree-like intermediate representation that depicts the grammatical structure of the token stream. A typical representation is a syntax tree 

8

Hathal & Ahmad

*

9 of 68

Lex and Yacc

9

Hathal & Ahmad

*

10 of 68

Lex and Yacc

  • Tokens
  • Lexer/Scanner
  • Lex specification
  • Regular Expression
  • Grammer
  • Parser
  • Simple Lex Program

%%

.|\n ECHO;

%%

10

Hathal & Ahmad

*

11 of 68

Lex syntax

……..Declaration and Definitions section……

%%

……Rules section……..

%%

……….C code section (subroutines)……..

  • The input structure to Lex.l

Echo is an action and predefined macro in lex that writes code matched by the pattern.

11

Hathal & Ahmad

*

12 of 68

Lex syntax

  • Declaration section

%{

headerfiles, symbolic constants,

Declarations and comments ,

c/c++ function declarations or definitions

%}

Example::

%{

#include <stdio.h>

#define NUMBER 400

#define COMMENT 401

#define TEXT 402

#define COMMAND 403

%}

12

Hathal & Ahmad

*

13 of 68

Lex syntax

  • Rules section

P1 {A1}

P2 {A2}

….

Pn {An}

%%

Example::

%%

[\t] ;

[0-9]+ printf(“found an intereger:%s\n”,yytext);

%%

13

Hathal & Ahmad

*

14 of 68

Lex syntax

  • C code section

%%

Void main()

{

yylex();

}

14

Hathal & Ahmad

*

15 of 68

Lex program

%{

/*

* this sample demonstrates (very) simple recognition:

* a verb/not a verb.

*/

%}

%%

[\t ]+ ;

is | am | are | were | was | be | being | been | do | does | did | will | would | should | can | could | has | have | had | go { printf("%s: is a verb\n", yytext); }

[a-zA-Z]+ { printf("%s: is not a verb\n", yytext); }

.|\n { ECHO; /* normal default anyway */ }

%%

main()

{

yylex() ;

}

15

Hathal & Ahmad

*

16 of 68

Lex program con…

Example 1-1

did I have fun?

did: is a verb

I: is not a verb

have: is a verb

fun: is not a verb ?

^D

%

Symbol Table

noun dog cat horse cow

verb chew eat lick

Reserved words

Add_word()

Lookup()_word()

16

Hathal & Ahmad

*

17 of 68

Symbol table example

%{

/*

*

Word recognizer with a symbol table.

*/

enum { LOOKUP =0, /* default - looking rather than defining. */

VERB,

ADJ,

ADV,

NOUN,

REP,

PRON,

CONJ

};

int state;

int add_word(int type, char *word);

int lookup_word(char *word);

%}

17

Hathal & Ahmad

*

18 of 68

Grammar

Set of actions.

Example

noun verb.

noun verb noun

For instance:

subject → noun | pronoun

object → noun

sentence → subject verb object

Parser-lexer communication

It calls the lexer yylex() whenever it needs a token from the input .

The lexer then scans through the input recognizing tokens

Not all tokens are of interest to the parser

# define NOUN 257

# define PRONOUN 258

# define VERB 259

# define ADVERB 260 …….

18

Hathal & Ahmad

*

19 of 68

The parts of speech lexer

19

Hathal & Ahmad

*

20 of 68

Rules section

20

Hathal & Ahmad

*

21 of 68

Running lex and Yacc

21

Hathal & Ahmad

*

22 of 68

Lex vs handwritten lex

22

Hathal & Ahmad

*

23 of 68

Lex vs handwritten lex

23

Hathal & Ahmad

*

24 of 68

Lex

  • lex is a program (generator) that generates lexical analyzers, (widely used on Unix).
  • It is mostly used with Yacc parser generator.
  • Written by Eric Schmidt and Mike Lesk.
  • It reads the input stream (specifying the lexical analyzer ) and outputs source code implementing the lexical analyzer in the C programming language.
  • Lex will read patterns (regular expressions); then produces C code for a lexical analyzer that scans for identifiers.

24

Hathal & Ahmad

*

25 of 68

Lex

    • A simple pattern: letter(letter|digit)*
  • Regular expressions are translated by lex to a computer program that mimics an FSA.
  • This pattern matches a string of characters that begins with a single letter followed by zero or more letters or digits.

25

Hathal & Ahmad

*

26 of 68

Lex

Pattern Matching Primitives

26

Hathal & Ahmad

*

27 of 68

Regular Expression

  • Pattern Matching examples.

27

Hathal & Ahmad

.

Any single character

*

Zero or more copoies of preceding expression

[ ]

Any character within the bracket

^

Beginning of line as first character

$

End of line as last character

{ }

How many times previous pattern allowed

\

Escape metacharacters

+

One or more occurance

?

Zero or more occurance

|

Either preceding or following

“…”

Marks literally

/

Preceding expression

( )

Groups a series of expression

*

28 of 68

Example of Regular expression

  • Pattern Matching examples.

28

Hathal & Ahmad

*

29 of 68

Lex syntax

……..Declaration and Definitions section……

%%

……Rules section……..

%%

……….C code section (subroutines)……..

  • The input structure to Lex.

  • Echo is an action and predefined macro in lex that writes code matched by the pattern.

29

Hathal & Ahmad

*

30 of 68

Lex syntax

  • Declaration section

%{

headerfiles,symbolic constants,

Declarations and comments ,

c/c++ function declarations or definitions

%}

Example::

%{

#include <stdio.h>

#define NUMBER 400

#define COMMENT 401

#define TEXT 402

#define COMMAND 403

%}

30

Hathal & Ahmad

*

31 of 68

Lex syntax

  • Rules section

P1 {A1}

P2 {A2}

….

Pn {An}

%%

Example::

%%

[\t] ;

[0-9]+ printf(“found an intereger:%s\n”,yytext);

%%

31

Hathal & Ahmad

*

32 of 68

Lex syntax

  • C code section

%%

Void main()

{

yylex();

}

32

Hathal & Ahmad

*

33 of 68

Lex variables

  • yyin : point to standard input device keyboard.can read data from the file
  • yyout : point to standard output device .can write data into the file
  • yytext :matched pattern is copied into yytext autoatically
  • yyleng :string length of yytext is copied in variable yyleng
  • yylval :integer value assosicated with token is returned by lexical analyzer
  • ECHO : write the matched character

33

Hathal & Ahmad

*

34 of 68

Lex Functions

  • yylex() : It starts the analysis, automatically generated by LEX. This function accepts the data from the keyboard or the input file and search for patterns specified in the rules section.
  • yywrap() :Called when yylex() encounters end of file . It return 0 or 1.
  • yyerror() :Called when lenxical analyzer encounters error

34

Hathal & Ahmad

*

35 of 68

Lex

Lex predefined variables.

35

Hathal & Ahmad

*

36 of 68

The word counting program

The definition section for our word count example is:

%{

unsigned charCount = 0, wordCount = 0, lineCount = 0;

%}

word [^ \t\n]+

eol \n

Hathal & Ahmad

36

*

37 of 68

The word counting program

  • The rules section contains the patterns and actions that specify the lexer. Here is our

sample word count’s rules section:

%%

{word} { wordCount++; charCount += yyleng; }

{eol} { charCount++; lineCount++; }

. charCount++;

Hathal & Ahmad

37

*

38 of 68

The word counting program

%%

main()

{

yylex();

printf("%d %d %d\n",lineCount, wordCount, charCount);

}

Hathal & Ahmad

38

*

39 of 68

The word counting program

main(argc,argv)

int argc;

char **argv;

{

if (argc > 1) {

FILE *file;

file = fopen(argv[1], "r");

if (!file) {

fprintf(stderr,"could not open %s\n",argv[1]);

exit(1);

}

yyin = file;

}

yylex();

printf("%u %u %u\n",charCount, wordCount, lineCount);

return 0;

}

Hathal & Ahmad

39

*

40 of 68

Parsing a command line

%{

unsigned verbose;

char *progName;

%}

%%

-h |

"-?" |

-help { printf("usage is: %s [-help | -h | -? ] [-verbose | -v] "

"[(-file| -f) filename]\n", progName);

}

-v |

-verbose { printf("verbose mode is on\n"); verbose = 1; }

%%

main(argc, argv)

int argc;

char **argv;

{

progName = *argv;

yylex();

}

Hathal & Ahmad

40

*

41 of 68

Parsing a command line

Hathal & Ahmad

41

*

42 of 68

Parsing a command line

Hathal & Ahmad

42

*

43 of 68

Start state

Hathal & Ahmad

43

*

44 of 68

Start state

Example1 :

Hathal & Ahmad

44

%s MAGIC

%%

<MAGIC>.+ { BEGIN 0; printf("Magic:"); ECHO; }

Magic BEGIN MAGIC;

%%

main()

{

yylex();

}

Example 2:

%s MAGIC

%%

magic BEGIN MAGIC;

.+ ECHO;

<MAGIC>.+ { BEGIN 0; printf ("Magic:"); ECHO; }

%%

main()

{

yylex();

}

*

45 of 68

Yacc

  • Theory:
    • Yacc reads the grammar and generate C code for a parser .

    • Grammars written in Backus Naur Form (BNF) .

    • BNF grammar used to express context-free languages .

    • e.g. to parse an expression , do reverse operation( reducing the expression)

    • This known as bottom-up or shift-reduce parsing .

    • Using stack for storing (LIFO).

45

Hathal & Ahmad

*

46 of 68

Yacc

Grammar

statement NAME = expression

expression NUMBER + NUMBER | NUMBER − NUMBER

Recursive Rules

expression 🡪NUMBER

| expression + NUMBER

| expression − NUMBER

46

Hathal & Ahmad

*

47 of 68

Yacc

47

Hathal & Ahmad

*

48 of 68

Shift/Reduce Parsing

  • As the parser reads tokens, each time it reads a token that doesn’t complete a rule it pushes the token on an internal stack and switches to a new state reflecting the token it just read-shift
  • When it has found all the symbols that constitute the right-hand side of a rule, it pops the right-hand side symbols off the stack, pushes the left-hand side symbol onto the stack, and switches to a new state reflecting the new symbol on the stack.-reduction
  • fred = 12 + 13”
  • fred
  • fred =
  • fred = 12
  • fred = 12 +
  • fred = 12 + 13 fred = expression
  • “statement → NAME = expression”, so it pops fred, =, and expression and replaces them with statement

48

Hathal & Ahmad

*

49 of 68

Yacc

49

Hathal & Ahmad

*

50 of 68

What Yacc cannot parse

  • It cannot deal with ambiguous grammars.
  • Consider this ex‐ tremely contrived example:

phrase cart_animal AND CART

| work_animal AND PLOW

cart_animal HORSE | GOAT

work_animal HORSE | OX

If we changed the first rule to this:

phrase cart_animal CART

| work_animal PLOW

50

Hathal & Ahmad

*

51 of 68

Yacc parser

  • Input to yacc is divided into three sections.

... definitions ...

%%

... rules ...

%%

... subroutines ...

51

Hathal & Ahmad

*

52 of 68

Yacc

  • The definitions section consists of:
    • token declarations .
    • C declaration code bracketed by “%{“ and “%}”.
    • Start symbol
    • %token NAME NUMBER

%%

    • the rules section consists of:
      • %token NAME NUMBER
      • %%
      • statement: NAME '=' expression
      • | expression ;
      • expression: NUMBER '+' NUMBER
      • | NUMBER '−' NUMBER
      • ;
  • the subroutines section consists of:
    • user subroutines .

52

Hathal & Ahmad

*

53 of 68

Symbol values and Actions

  • Every symbol in a yacc parser has a value-
  • a number, the value would be the particular number
  • a literal text string, the value would probably be a pointer to a copy of the string.
  • a variable in a program, the value would be a pointer to a symbol table entry describing the variable
  • Non-terminal symbols can have any values
  • . If you have multiple value types, you have to list all the value types used in a parser so that yacc can create a C union typedef called YYSTYPE to contain them
  • statement: NAME '=' expression |
  • expression { printf("= %d\n", $1); }

;

  • expression: expression '+' NUMBER { $$ = $1 + $3; }
  • | expression '−' NUMBER { $$ = $1 − $3; }
  • | NUMBER { $$ = $1; }
  • ;

53

Hathal & Ahmad

*

54 of 68

The Lexer

54

Hathal & Ahmad

*

55 of 68

Running Simple Parser

55

Hathal & Ahmad

*

56 of 68

Arithmetic Expressions and Ambiguity

56

Hathal & Ahmad

*

57 of 68

Arithmetic Expressions and Ambiguity

57

Hathal & Ahmad

*

58 of 68

Arithmetic Expressions and Ambiguity

58

Hathal & Ahmad

Explicitly by:

%left '+' '-‘

%left '*' '/‘

%nonassoc UMINUS

When not to use prescedence rule??

two situations: in expression grammars, and to resolve the “dangling else” conflict in grammars for if-then-else language con‐ structs.

*

59 of 68

Variables and Typed Tokens

59

Hathal & Ahmad

*

60 of 68

Variables and Typed Tokens

60

Hathal & Ahmad

*

61 of 68

Variables and Typed Tokens

61

Hathal & Ahmad

To define the possible symbol types, in the definition section we add a %union decla‐ ration:

%union { double dval;

int vblno;

}

Here is the y.tab.h generated from this grammar: #define NAME 257

#define NUMBER 258

#define UMINUS 259

typedef union {

double dval;

int vblno;

}YYSTYPE;

extern YYSTYPE yylval;

*

62 of 68

Symbol Table

Hathal & Ahmad

62

*

63 of 68

Symbol Table

Hathal & Ahmad

63

*

64 of 68

Symbol Table

Hathal & Ahmad

64

*

65 of 68

Functions and reserve words

Hathal & Ahmad

65

*

66 of 68

Reserve words in symbol table

struct symtab {

char *name;

double (*funcptr)();

double value;

} symtab[NSYMS];

Hathal & Ahmad

66

*

67 of 68

Reserve words in symbol table

Hathal & Ahmad

67

*

68 of 68

Reserve words in symbol table

Hathal & Ahmad

68

*