1 of 76

COMP 520: Compilers!

Lecture 1: Intro to PA1 + Scanning

1

2 of 76

Announcements + Logistics

  • Programming Assignment 1 is Live!

2

3 of 76

The course project: a preview (1/7)

  • Goal: Build a compiler that accepts a large subset of Java and outputs a binary file that will run natively on modern Linux.�
  • Java normally compiles for the JVM and is cross-platform.. we are targeting x86 Linux instead.�
  • Your project will compile java code and the output will be runnable on modern x86 machines!

3

4 of 76

The course project: a preview (2/7)

  • Five milestones, each associated with a Programming Assignment (or PA for short)
    • Each worth 10% of your final grade!�

  1. PA1- Syntactic Analysis
  2. PA2- ASTs/Parsing
  3. PA3- Contextual Analysis
  4. PA4- Code Generation
  5. PA5- Final Submission

4

5 of 76

The course project: a preview (3/7)

  • PA1- Syntactic Analysis�
  • Scan and interpret the source code as Tokens�
  • Syntax: some tokens expect more tokens, was the correct token given?

  • PA1- Syntactic Analysis
  • PA2- ASTs
  • PA3- Contextual Analysis
  • PA4- Code Generation
  • PA5- Final Submission

5

Token

Token Type

>

Relational operator

&&

Logical operator

y

Identifier

6 of 76

The course project: a preview (4/7)

  • PA2 - Abstract Syntax Trees

  • PA1- Syntactic Analysis
  • PA2- ASTs/Parsing
  • PA3- Contextual Analysis
  • PA4- Code Generation
  • PA5- Final Submission

6

7 of 76

The course project: a preview (5/7)

  • PA3 - Contextual Analysis�
  • Context is important!
  • Variable can only be used after it is declared.
  • Type-checking

  • PA1- Syntactic Analysis
  • PA2- ASTs/Parsing
  • PA3- Contextual Analysis
  • PA4- Code Generation
  • PA5- Final Submission

7

8 of 76

The course project: a preview (6/7)

  • PA4 - Code Generation�
  • Generate ELF files
  • Generate simple x86
  • Your compiler now generates bytecode that will run on real hardware!

  • PA1- Syntactic Analysis
  • PA2- ASTs/Parsing
  • PA3- Contextual Analysis
  • PA4- Code Generation
  • PA5- Final Submission

8

9 of 76

The course project: a preview (7/7)

  • PA5- Documentation & Extra Credit Opportunities�
  • Properly document your compiler.
  • Add features for extra credit.

  • PA1- Syntactic Analysis
  • PA2- ASTs/Parsing
  • PA3- Contextual Analysis
  • PA4- Code Generation
  • PA5- Final Submission

9

10 of 76

Programming Assignment 1

  • Programming Assignment 1 is Live!
  • PA1 has two parts, Lexical Analysis and Syntax Checking.
  • Start early, and visit office hours if you have any questions.�
  • Due: 7/2/24 at 11:59pm

10

11 of 76

Goals

  • miniJava is a subset of the Java programming language.
  • A miniJava program is a valid Java program.�
  • Semester Goal: Create a compiler that can take in a miniJava source code file and output a binary file that can be executed.

11

12 of 76

Review: Java Runtime (1/2)

  • Executable binaries for Java run on the JVM

12

13 of 76

Review: Java Runtime (2/2)

  • This allows Java programs to run “cross-platform”
  • The JVM interprets the compiled code and runs it on the underlying OS.

13

14 of 76

miniJava

  • The miniJava compiler will NOT output JVM bytecode.
  • Instead, the compiler will output bytecode that will run natively on a modern x86_64 Linux operating system.�
  • We lose cross-platform capabilities
  • We gain learning how compilers target physical processors/OSes

14

15 of 76

Lexical vs. Syntactic Analysis

Lexical Analysis

  • Create the language’s lexicon
  • The lexical unit is the Token�
  • Take in a few letters at a time, and return a Token
  • Responsible for removing whitespace and comments
  • Output is a stream of tokens (PA1)

Syntactic Analysis

  • Input is a stream of tokens
  • Only care about syntax
  • This analyzer doesn’t see whitespace, nor comments
  • Only concerned about code syntax�
  • Output is an AST (PA2)

15

16 of 76

Whitespace!

  • Consider the following C code:

���

  • Pre-decrement, Post-decrement, Subtraction, Negation
  • --x x-- x-y -x

16

17 of 76

Scanning C++

  • Consider the following C++ code:

���

  • Operator >>�
  • Question: How would you correctly scan >> in the example above?

17

18 of 76

Lexical Analysis

18

19 of 76

Lexical Analysis: Goal

  • The input is just a string of characters:
    • “\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;”
  • Goal: Partition input string into substrings
    • Where the substrings are called tokens

19

if (i == j)

Z = 0;

else

Z = 1;

20 of 76

Tokens

  • A syntactic category
    • In English: a noun, verb, adjective
    • In a programming language: Identifier, Integer, Keyword, Whitespace
  • A token class corresponds to a set of strings
  • Classify program substrings according to role
  • Lexical analysis produces a stream of tokens which is input to the parser
  • Parser relies on token distinctions
    • Ex: An identifier is treated differently than a keyword

20

21 of 76

Designing a Lexer: Step 1

  • Define a finite set of tokens
    • Tokens describe all items of interest: Identifiers, integers, keywords
    • Choice of tokens depends on language + design of parser

21

22 of 76

Back to our example…

  • \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
  • Goal: Partition input string into substrings
    • Where the substrings are called tokens

22

if (i == j)

Z = 0;

else

Z = 1;

23 of 76

PA1- TokenType.java

  • This is the equivalent of Java’s TokenKind class (more on this soon)�
  • In the TokenType enumeration, we list the possible types of tokens we want to stream to our syntactic analyzer.
  • But what are the types of tokens that we have??

23

24 of 76

Taking a look at Java’s TokenKind

  • Pretty much everything is in there!
    • If you want, you can organize your TokenType similarly!

24

25 of 76

Designing a Lexer: Step 2

  • Describe which strings belong to each token

25

26 of 76

Lexical Analyzer: Implementation Requirements

  1. Classify each substring as a token
  2. Return the value or lexeme (value) of the token

  • Lexer returns token-lexeme pairs

26

27 of 76

“Lookahead”

  • The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time
  • Lookahead” may be required to decide where one token ends and the next token begins

Examples:

  • i vs. if
  • = vs. ==

27

28 of 76

Summary: Lexing

  • The goal of lexical analysis is to
    • Partition the input string into lexemes
    • Identify the token of each lexeme�
  • Left-to-right scan => lookahead sometimes required

28

29 of 76

Expressing Syntax

29

30 of 76

Review: Languages

  • Def. Let alphabet Σ be a set of characters. A language over Σ is a set of strings of characters drawn from Σ.
  • Languages are sets of strings.

30

31 of 76

Languages Covered in 455

31

32 of 76

Regular Languages

  • Simple and useful theory
  • Easy to understand
  • Efficient implementations

32

33 of 76

Regular Expressions

  • Can anyone give me the regular expression for a US telephone number?

33

34 of 76

Regular Expressions

  • Can anyone give me the regular expression for a US telephone number?
    • 1?\h?(\d){3}\h*-?\h*(\d){3}\h*-?\h*(\d){4}

34

35 of 76

Regular Expressions

  • What about email addresses?

35

36 of 76

Regular Expressions

  • What about email addresses?
    • A good guess: [\w-\._]+@[\w-]+\.[\w-]+

36

37 of 76

Regular Expressions

  • What about email addresses?
    • A good guess: [\w-\._]+@[\w-]+\.[\w-]+

37

38 of 76

RFC-822 Compliant Email Regex

/(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)/

38

39 of 76

Better option? Yes!

  • Formally describing the syntax of a programming language is best done through a context-free grammar
  • Additional lexical rules needed (comments, whitespace, etc.)
  • Note identifiers cannot be reserved words, the lexical analyzer will output such words as their reserved TokenType and the syntax will throw an error when it got a reserved word instead of an identifier.

39

40 of 76

Context Free Grammars

40

41 of 76

CFGs

  • CFGs are an excellent way to describe the syntax of a language.
  • They are clear, easier to understand than regular expressions, and give you a bit more flexibility

41

42 of 76

CFGs

  • CFGs are an excellent way to describe the syntax of a language.
  • Consider the following definition of a sentence:

42

Sentence

::= Subject Predicate Object

Subject

::= Article Noun

Predicate

::= Verb

Object

::= Article Noun

Article

::= a | the

Noun

::= dog | cat | mice

Verb

::= chase | chases

43 of 76

Language generated from CFG

  • The language generated from our CFG is a set of sentences
  • Each sentence can be generated by repeated application of the rules

43

Sentence

::= Subject Predicate Object

Subject

::= Article Noun

Predicate

::= Verb

Object

::= Article Noun

Article

::= a | the

Noun

::= dog | cat | mice

Verb

::= chase | chases

44 of 76

Recursion

  • What if we want to allow “the dog and the cat” as the subject?
  • Currently it is:�

Sentence ::= Subject Predicate Object

44

45 of 76

Recursion

45

46 of 76

Limitations of CFGs

  • Can generate incorrect sentences.�
  • “The cat chase a mice”�
  • Thus, simple CFGs are suitable for describing syntax, but “chase” contextually needs to be “chases” and “mice” cannot be preceded with “a”.

46

47 of 76

MiniTriangle

47

48 of 76

MiniTriangle Language

  • Expressions for mini-Triangle, for simple arithmetic operations
  • Exp ::= PrimExp | Exp Oper PrimExp
  • PrimExp ::= intlit | id | Oper PrimExp | ( Exp )
  • Oper ::= + | - | * | / | < | > | =

48

49 of 76

MiniTriangle Language

  • Commands for mini-Triangle, for simple arithmetic operations
  • Program ::= Cmd
  • Cmd ::= id := Exp | let Decl in Cmd
  • Decl ::= var id : type

49

50 of 76

Examples

  • Is this a valid command for mini-Triangle?�
  • let var x: Integer in x := 5 + (2*10)
  • For each token, classify it according to the grammar.
    • Use Oper, intlit, id, let, : (colon), type, in, := (assign), var, lparen, rparen�
  • Now group according to grammar symbols
    • What tokens constitute a Decl, Cmd, Exp, PrimExp?

50

51 of 76

PA1

51

52 of 76

Simple Scanner/Parser Provided for miniTriangle

  • Courtesy of Dr. Prins, a simple Arithmetic Scanner and Parser is on the course website.
  • This corresponds to arithmetic parsing in miniTriangle�
  • You can use this as an example on how Scanners and Parsers are structured.

52

53 of 76

Starter Code

  • PA1 starter code is on the course website�
  • You do not have to use the starter code!
  • The only thing that is important is that the Compiler class contains your main function, and the Compiler class is contained in the miniJava package

53

54 of 76

ErrorReporter

  • Start with something easy, the error reporter is an object that records errors as they arrive.�
  • Complete the methods:
  • hasErrors- check to see if the errorQueue is non-empty�
  • outputErrors- iterate through the errorQueue and output all strings

54

55 of 76

ErrorReporter -- Extra Credit

  • The error reporter contains one of few PA5 extra credits that can be applied in PA1 and not impact the autograder.�
  • Consider augmenting the error reporter by requiring any call to reportError to also specify a line and column number.�
  • This will prove extremely helpful when debugging your code!

55

56 of 76

Token Class

  • Another easy class to complete is the Token class.
  • The token is classified by String and a TokenType.
  • A token has this underlying string to represent the original text that resulted in a specific TokenType.�
  • Implement the constructor, getTokenType and getTokenText methods.
  • And thus ends the Java warmup, now we get into the fun stuff!

56

57 of 76

TokenType

  • What types of tokens do you want?�
  • List these token types in the TokenType enumeration.�
  • Note: You can optionally follow Java’s TokenKind implementation where nothing is consolidated, but it will make your Parser slightly lengthier.

57

58 of 76

Compiler.java

  • How the miniArith example works is that a FileInputStream is created, and the PA1 starter files are structured similarly.
  • Our file in question is in args[0]
  • The autograder will always specify a file path, but it is good practice to error check to make sure whether the argument has been specified or not.�
  • The Scanner object takes such an input stream to create tokens.
  • The Parser object takes a Scanner object and the ErrorReporter. It will report syntax errors.

58

59 of 76

PA1 - Scanner

59

60 of 76

Scanner Design

60

61 of 76

Scanner Design

61

62 of 76

Scanner Design

62

63 of 76

Scanner Design

63

64 of 76

Scanner Design

64

65 of 76

Scanner Design

65

66 of 76

Scanner Design

66

67 of 76

PA1 - Parser

67

68 of 76

Parser

  • Consider a conveyer belt of Tokens provided by Scanner

68

69 of 76

Parser

  • Assume we are in the Grammar: Reference ::= id | this | Reference . id

69

70 of 76

Error Checking

  • The parser may “want” a certain TokenType, but the scanner provided a different one!�
  • What does this look like, and how should the Compiler proceed?

70

71 of 76

Parser Errors

  • Consider: Type ::= int | boolean | id | (int|id)[]
  • ParameterList ::= Type id (,Type id)*

71

72 of 76

Parser Errors

  • Consider: Type ::= int | boolean | id | (int|id)[]
  • ParameterList ::= Type id (,Type id)*

72

73 of 76

Parser Errors

  • Consider: Type ::= int | boolean | id | (int|id)[]
  • ParameterList ::= Type id (,Type id)*

73

74 of 76

Parser Errors

  • Consider: Type ::= int | boolean | id | (int|id)[]
  • ParameterList ::= Type id (,Type id)*

74

75 of 76

Error! Expected Identifier, but got IntToken

  • Should we go to the end of the ParameterList?
    • How would we know we ended the ParameterList?
  • It would involve scanning until you find an RParen “)” because ParameterList always has Parens around it.
    • But that sounds like it can get complicated quickly….
  • What if we only report the first error?
    • Afterall, a syntax error means the program is non-functional, so why check the remaining program at all?

75

76 of 76

When a Syntax Error Happens

  • Use exception handlers to “unwind” and get out of however deep you are in your parse methods.�
  • Finally, in PA1, report “Error” on ONE line (println) if any errors exist, THEN proceed to output errors that are relevant towards helping debug the input code.
  • If there are no errors, output “Success” on ONE line (println)

76