1 of 76

COMP 520: Compilers!

Lecture 1: Intro to PA1 + Scanning

1

2 of 76

Announcements + Logistics

Programming Assignment 1 is Live!

2

3 of 76

The course project: a preview (1/7)

Goal: Build a compiler that accepts a large subset of Java and outputs a binary file that will run natively on modern Linux.�
Java normally compiles for the JVM and is cross-platform.. we are targeting x86 Linux instead.�
Your project will compile java code and the output will be runnable on modern x86 machines!

�

3

4 of 76

The course project: a preview (2/7)

Five milestones, each associated with a Programming Assignment (or PA for short)

Each worth 10% of your final grade!�

PA1- Syntactic Analysis
PA2- ASTs/Parsing
PA3- Contextual Analysis
PA4- Code Generation
PA5- Final Submission

4

5 of 76

The course project: a preview (3/7)

PA1- Syntactic Analysis�
Scan and interpret the source code as Tokens�
Syntax: some tokens expect more tokens, was the correct token given?

�

PA1- Syntactic Analysis
PA2- ASTs
PA3- Contextual Analysis
PA4- Code Generation
PA5- Final Submission

5

Token	Token Type
>	Relational operator
&&	Logical operator
y	Identifier

6 of 76

The course project: a preview (4/7)

PA2 - Abstract Syntax Trees

�

PA1- Syntactic Analysis
PA2- ASTs/Parsing
PA3- Contextual Analysis
PA4- Code Generation
PA5- Final Submission

6

7 of 76

The course project: a preview (5/7)

PA3 - Contextual Analysis�
Context is important!
Variable can only be used after it is declared.
Type-checking

�

PA1- Syntactic Analysis
PA2- ASTs/Parsing
PA3- Contextual Analysis
PA4- Code Generation
PA5- Final Submission

7

8 of 76

The course project: a preview (6/7)

PA4 - Code Generation�
Generate ELF files
Generate simple x86
Your compiler now generates bytecode that will run on real hardware!

�

PA1- Syntactic Analysis
PA2- ASTs/Parsing
PA3- Contextual Analysis
PA4- Code Generation
PA5- Final Submission

8

9 of 76

The course project: a preview (7/7)

PA5- Documentation & Extra Credit Opportunities�
Properly document your compiler.
Add features for extra credit.

�

PA1- Syntactic Analysis
PA2- ASTs/Parsing
PA3- Contextual Analysis
PA4- Code Generation
PA5- Final Submission

9

10 of 76

Programming Assignment 1

Programming Assignment 1 is Live!
PA1 has two parts, Lexical Analysis and Syntax Checking.
Start early, and visit office hours if you have any questions.�
Due: 7/2/24 at 11:59pm

10

11 of 76

Goals

miniJava is a subset of the Java programming language.
A miniJava program is a valid Java program.�
Semester Goal: Create a compiler that can take in a miniJava source code file and output a binary file that can be executed.

11

12 of 76

Review: Java Runtime (1/2)

Executable binaries for Java run on the JVM

12

13 of 76

Review: Java Runtime (2/2)

This allows Java programs to run “cross-platform”
The JVM interprets the compiled code and runs it on the underlying OS.

13

14 of 76

miniJava

The miniJava compiler will NOT output JVM bytecode.
Instead, the compiler will output bytecode that will run natively on a modern x86_64 Linux operating system.�
We lose cross-platform capabilities
We gain learning how compilers target physical processors/OSes

14

15 of 76

Lexical vs. Syntactic Analysis

Lexical Analysis

Create the language’s lexicon
The lexical unit is the Token�
Take in a few letters at a time, and return a Token
Responsible for removing whitespace and comments
Output is a stream of tokens (PA1)

Syntactic Analysis

Input is a stream of tokens
Only care about syntax
This analyzer doesn’t see whitespace, nor comments
Only concerned about code syntax�
Output is an AST (PA2)

15

16 of 76

Whitespace!

Consider the following C code:

��

Pre-decrement, Post-decrement, Subtraction, Negation
--x x-- x-y -x

16

17 of 76

Scanning C++

Consider the following C++ code:

��

Operator >>�
Question: How would you correctly scan >> in the example above?

17

18 of 76

Lexical Analysis

18

19 of 76

Lexical Analysis: Goal

The input is just a string of characters:

“\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;”

Goal: Partition input string into substrings

Where the substrings are called tokens

19

if (i == j)

Z = 0;

else

Z = 1;

20 of 76

Tokens

A syntactic category

In English: a noun, verb, adjective
In a programming language: Identifier, Integer, Keyword, Whitespace

A token class corresponds to a set of strings
Classify program substrings according to role
Lexical analysis produces a stream of tokens which is input to the parser
Parser relies on token distinctions

Ex: An identifier is treated differently than a keyword

20

21 of 76

Designing a Lexer: Step 1

Define a finite set of tokens

Tokens describe all items of interest: Identifiers, integers, keywords
Choice of tokens depends on language + design of parser

21

22 of 76

Back to our example…

“\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;”
Goal: Partition input string into substrings

Where the substrings are called tokens

22

if (i == j)

Z = 0;

else

Z = 1;

23 of 76

PA1- TokenType.java

This is the equivalent of Java’s TokenKind class (more on this soon)�
In the TokenType enumeration, we list the possible types of tokens we want to stream to our syntactic analyzer.
But what are the types of tokens that we have??

23

24 of 76

Taking a look at Java’s TokenKind

https://www.javadoc.io/static/org.kohsuke.sorcerer/sorcerer-javac/0.11/com/sun/tools/javac/parser/Tokens.TokenKind.html

Pretty much everything is in there!

If you want, you can organize your TokenType similarly!

24

25 of 76

Designing a Lexer: Step 2

Describe which strings belong to each token

25

26 of 76

Lexical Analyzer: Implementation Requirements

Classify each substring as a token
Return the value or lexeme (value) of the token

Lexer returns token-lexeme pairs

26

27 of 76

“Lookahead”

The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time
“Lookahead” may be required to decide where one token ends and the next token begins

Examples:

i vs. if
= vs. ==

27

28 of 76

Summary: Lexing

The goal of lexical analysis is to

Partition the input string into lexemes
Identify the token of each lexeme�

Left-to-right scan => lookahead sometimes required

28

29 of 76

Expressing Syntax

29

30 of 76

Review: Languages

Def. Let alphabet Σ be a set of characters. A language over Σ is a set of strings of characters drawn from Σ.
Languages are sets of strings.

30

31 of 76

Languages Covered in 455

31

32 of 76

Regular Languages

Simple and useful theory
Easy to understand
Efficient implementations

32

33 of 76

Regular Expressions

Can anyone give me the regular expression for a US telephone number?

33

34 of 76

Regular Expressions

Can anyone give me the regular expression for a US telephone number?

1?\h?(\d){3}\h*-?\h*(\d){3}\h*-?\h*(\d){4}

34

35 of 76

Regular Expressions

What about email addresses?

35

36 of 76

Regular Expressions

What about email addresses?

A good guess: [\w-\._]+@[\w-]+\.[\w-]+

36

37 of 76

Regular Expressions

What about email addresses?

A good guess: [\w-\._]+@[\w-]+\.[\w-]+

37

38 of 76

RFC-822 Compliant Email Regex

/(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)/

38

39 of 76

Better option? Yes!

Formally describing the syntax of a programming language is best done through a context-free grammar
Additional lexical rules needed (comments, whitespace, etc.)
Note identifiers cannot be reserved words, the lexical analyzer will output such words as their reserved TokenType and the syntax will throw an error when it got a reserved word instead of an identifier.

39

40 of 76

Context Free Grammars

40

41 of 76

CFGs

CFGs are an excellent way to describe the syntax of a language.
They are clear, easier to understand than regular expressions, and give you a bit more flexibility

41

42 of 76

CFGs

CFGs are an excellent way to describe the syntax of a language.
Consider the following definition of a sentence:

42

Sentence	::= Subject Predicate Object
Subject	::= Article Noun
Predicate	::= Verb
Object	::= Article Noun
Article	::= a \| the
Noun	::= dog \| cat \| mice
Verb	::= chase \| chases

43 of 76

Language generated from CFG

The language generated from our CFG is a set of sentences
Each sentence can be generated by repeated application of the rules

43

Sentence	::= Subject Predicate Object
Subject	::= Article Noun
Predicate	::= Verb
Object	::= Article Noun
Article	::= a \| the
Noun	::= dog \| cat \| mice
Verb	::= chase \| chases

44 of 76

Recursion

What if we want to allow “the dog and the cat” as the subject?
Currently it is:�

Sentence ::= Subject Predicate Object

44

45 of 76

Recursion

45

46 of 76

Limitations of CFGs

Can generate incorrect sentences.�
“The cat chase a mice”�
Thus, simple CFGs are suitable for describing syntax, but “chase” contextually needs to be “chases” and “mice” cannot be preceded with “a”.

46

47 of 76

MiniTriangle

47

48 of 76

MiniTriangle Language

Expressions for mini-Triangle, for simple arithmetic operations
Exp ::= PrimExp | Exp Oper PrimExp
PrimExp ::= intlit | id | Oper PrimExp | ( Exp )
Oper ::= + | - | * | / | < | > | =

48

49 of 76

MiniTriangle Language

Commands for mini-Triangle, for simple arithmetic operations
Program ::= Cmd
Cmd ::= id := Exp | let Decl in Cmd
Decl ::= var id : type

49

50 of 76

Examples

Is this a valid command for mini-Triangle?�
let var x: Integer in x := 5 + (2*10)�
For each token, classify it according to the grammar.

Use Oper, intlit, id, let, : (colon), type, in, := (assign), var, lparen, rparen�

Now group according to grammar symbols

What tokens constitute a Decl, Cmd, Exp, PrimExp?

50

51 of 76

PA1

51

52 of 76

Simple Scanner/Parser Provided for miniTriangle

Courtesy of Dr. Prins, a simple Arithmetic Scanner and Parser is on the course website.
This corresponds to arithmetic parsing in miniTriangle�
You can use this as an example on how Scanners and Parsers are structured.

52

53 of 76

Starter Code

PA1 starter code is on the course website�
You do not have to use the starter code!
The only thing that is important is that the Compiler class contains your main function, and the Compiler class is contained in the miniJava package

53

54 of 76

ErrorReporter

Start with something easy, the error reporter is an object that records errors as they arrive.�
Complete the methods:
hasErrors- check to see if the errorQueue is non-empty�
outputErrors- iterate through the errorQueue and output all strings

54

55 of 76

ErrorReporter -- Extra Credit

The error reporter contains one of few PA5 extra credits that can be applied in PA1 and not impact the autograder.�
Consider augmenting the error reporter by requiring any call to reportError to also specify a line and column number.�
This will prove extremely helpful when debugging your code!

55

56 of 76

Token Class

Another easy class to complete is the Token class.
The token is classified by String and a TokenType.
A token has this underlying string to represent the original text that resulted in a specific TokenType.�
Implement the constructor, getTokenType and getTokenText methods.
And thus ends the Java warmup, now we get into the fun stuff!

56

57 of 76

TokenType

What types of tokens do you want?�
List these token types in the TokenType enumeration.�
Note: You can optionally follow Java’s TokenKind implementation where nothing is consolidated, but it will make your Parser slightly lengthier.

57

58 of 76

Compiler.java

How the miniArith example works is that a FileInputStream is created, and the PA1 starter files are structured similarly.
Our file in question is in args[0]
The autograder will always specify a file path, but it is good practice to error check to make sure whether the argument has been specified or not.�
The Scanner object takes such an input stream to create tokens.
The Parser object takes a Scanner object and the ErrorReporter. It will report syntax errors.

58

59 of 76

PA1 - Scanner

59

60 of 76

Scanner Design

60

61 of 76

Scanner Design

61

62 of 76

Scanner Design

62

63 of 76

Scanner Design

63

64 of 76

Scanner Design

64

65 of 76

Scanner Design

65

66 of 76

Scanner Design

66

67 of 76

PA1 - Parser

67

68 of 76

Parser

Consider a conveyer belt of Tokens provided by Scanner

68

69 of 76

Parser

Assume we are in the Grammar: Reference ::= id | this | Reference . id

69

70 of 76

Error Checking

The parser may “want” a certain TokenType, but the scanner provided a different one!�
What does this look like, and how should the Compiler proceed?

70

71 of 76

Parser Errors

Consider: Type ::= int | boolean | id | (int|id)[]
ParameterList ::= Type id (,Type id)*

71

72 of 76

Parser Errors

Consider: Type ::= int | boolean | id | (int|id)[]
ParameterList ::= Type id (,Type id)*

72

73 of 76

Parser Errors

Consider: Type ::= int | boolean | id | (int|id)[]
ParameterList ::= Type id (,Type id)*

73

74 of 76

Parser Errors

Consider: Type ::= int | boolean | id | (int|id)[]
ParameterList ::= Type id (,Type id)*

74

75 of 76

Error! Expected Identifier, but got IntToken

Should we go to the end of the ParameterList?

How would we know we ended the ParameterList?

It would involve scanning until you find an RParen “)” because ParameterList always has Parens around it.

But that sounds like it can get complicated quickly….

What if we only report the first error?

Afterall, a syntax error means the program is non-functional, so why check the remaining program at all?

75

76 of 76

When a Syntax Error Happens

Use exception handlers to “unwind” and get out of however deep you are in your parse methods.�
Finally, in PA1, report “Error” on ONE line (println) if any errors exist, THEN proceed to output errors that are relevant towards helping debug the input code.
If there are no errors, output “Success” on ONE line (println)

76