1 of 80

CS 3200 - Fall 2024

The Relational Model of Data

Mark Fontenot, PhD

Northeastern University

These notes are based upon and adapted via the generosity of Prof. Nate Derbinsky.

2 of 80

Reminders

  • Join Slack & CampusWire!!
  • Make sure you can see the course in GradeScope

  • Mini HW00
    • Early EC Deadline: Sunday 9/8 11:59 pm
    • Regular Deadline: Monday 9/10 11:59 pm

2

3 of 80

Applications and Data

  • Many applications are data intensive compared to compute intensive.
  • Many apps need to…
    • store data for itself or another app to use in the future (databases)
    • remember the results of prior expensive operations (caches)
    • allow users to search for data efficiently (indexes)
    • send messages with data to another application to handle (stream processing)
    • periodically process a large accumulation of data (batch processing)

3

From Kleppmann, Designing Data Intensive Applications, O’Reilly, 2017.

4 of 80

Stores of data?

4

5 of 80

What is a database?

  • Structured collection of related data
    • usually related to something in the real world
    • usually created…
      • for a specific group of users
      • to help these users perform some kind of tasks
      • to hopefully complete those tasks with some performance, redundancy, concurrency, and/or security considerations in mind

5

6 of 80

Notion

  • Intended users?
  • Intended tasks?
  • Considerations of
    • Performance?
    • Concurrency?
    • Redundancy?
    • Security?

6

7 of 80

Database Management Systems (DBMS)

  • Software that allow the creation and maintenance of databases
    • Support the encoding of some type of structure for the data
    • Persists the data
    • Support adding new data and updating existing data
    • Protects against failures and unauthorized access

7

8 of 80

Some Categories of DBMSs

Document-Oriented Databases

- Organizes and queries data based on the concept of a “document” often in JSON

- usually considered semi-structured

Graph Databases

- Organizes data by nodes, edges, labels

- Query about paths between nodes and node relationships

8

9 of 80

Some Categories of DBMSs

Key-Value Databases

- Everything is a key/value pair

- Based on associative array

Spatial Databases

- Stores data related to 2D/3D locations

- Query example: are 2 cars about to collide?

9

10 of 80

Some Categories of DBMSs

Vector Databases

  • unit of storage is a vector represent high-dimensional data
  • highly performant similarity searches
  • used extensively in LLMs

10

11 of 80

The category for this course…

  • Relational Database Management Systems
    • Based on storing data in tables and connections between those tables
    • Original concept developed in early 70s by EF Codd and colleagues

11

12 of 80

Relational Model of Data:

Overview

12

13 of 80

The Relational Database: Relation/Table

13

Image borrowed from this Medium post.

Customers

Relation/Table Name Attributes/Columns

Rows/

Tuples

Relation - the core construct in a relational database.

Relation Schema - represents the attributes and data types for a particular relation

Relation Instance - represents the state of the data in the relation at a particular point in time.

Row/Tuple - values for each relation’s attribute for one element of a relation instance

14 of 80

The Relational Database: Constraints

14

Image borrowed from this Medium post.

cannot be

null

Must be a valid dorm id in the dorm relation

Example

Constraints:

Constraint - conditions that must hold on all valid relation instances

Types:

  • Key Constraints
  • Entity Integrity Constraints
  • Referential Integrity constraints

ID

Name

Phone

Dorm

Age

GPA

1123141

Mark

555-1234

1

19

3.21

2323411

Kim

555-9876

2

25

3.53

17642352

Sam

555-6758

1

19

3.25

Student

15 of 80

The Relational Database: Relationships

15

Name

ID

Phone

Dorm

Age

GPA

Mark

1123141

555-1234

1

19

3.21

Kim

2323411

555-9876

2

25

3.53

Sam

17642352

555-6758

1

19

3.25

Dorm

Name

1

555 Huntington

2

Baker

ID

Class

1123141

COMP355

2323411

COMP355

17642352

MATH650

1123141

MATH650

2323411

BIOL110

Student

Dorm

Class

Some values in one table are related (by design) to values in another table

Referential Integrity Constraint Examples

16 of 80

The Relational Database: Queries

16

Name

ID

Phone

Dorm

Age

GPA

Mark

1123141

555-1234

1

19

3.21

Kim

2323411

555-9876

2

25

3.53

Sam

17642352

555-6758

1

19

3.25

Dorm

Name

1

555 Huntington

2

Baker

ID

Class

1123141

COMP355

2323411

COMP355

17642352

MATH650

1123141

MATH650

2323411

BIOL110

Student

Dorm

Class

Questions (Queries):

“Provide a list of student names and IDs with GPA between 3.0 and 3.5.”

17 of 80

The Relational Database: Queries

17

Name

ID

Phone

Dorm

Age

GPA

Mark

1123141

555-1234

1

19

3.21

Kim

2323411

555-9876

2

25

3.53

Sam

17642352

555-6758

1

19

3.25

Dorm

Name

1

555 Huntington

2

Baker

ID

Class

1123141

COMP355

2323411

COMP355

17642352

MATH650

1123141

MATH650

2323411

BIOL110

Student

Dorm

Class

Questions (Queries):

“Which students live in Baker Dorm?”

18 of 80

Databases in the Context of a Software System

18

Database

Tier

Application

Tier

Client

Tier

19 of 80

The Relational Model of Data:�Digging In

19

20 of 80

History 101

Codd, Edgar F. "A relational model of data for large shared data banks." Communications of the ACM 13.6 (1970): 377-387.

20

“Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation)…”

21 of 80

History 101

  • The relational model provides a formal mathematical basis for the structure of and interaction with a relational database
    • Based on set theory and first-order predicate logic
    • The formal basis allows for robust scientific development of the model.
  • The (eternal) struggle of theory vs. practice…
    • Most modern RDBMSs don’t strictly adhere to the purest mathematical formalisms in the relational model.

21

22 of 80

A Little Set Theory Review…

  • What’s a set?�

  • What is the cardinality of a set? (denoted |S|)

22

A set is a collection of unique objects. The things (objects) inside a set are called elements.

cardinality → the number of unique elements in the set.

What are some examples of sets?

?

23 of 80

A Little Set Theory Review…

  • What is the Union of sets A and B (A B)?

  • What is the intersection of sets A and B (A B)?�
  • What is the set difference of A and B (A - B)?

23

A = {1, 3, 5} B = {3, 4, 5, 6}

24 of 80

A Little Set Theory Review…

  • What is the cartesian product of two sets?

A = {123, 435} B = {Chul, Kev, Sal}

  • What is the cartesian product of A and B (aka A x B)?

24

{ (123, Chul), (123, Kev), (123, Sal), (435, Chul), (435, Kev), (435, Sal) }

25 of 80

What’s a Relational Database?

  • A Relational Database consists of
    • a collection of relations, and
    • a collection of constraints.

  • A relational database is in a valid/consistent state if it satisfies all constraints (else, invalid/inconsistent state).

25

26 of 80

Relations

  • A relation consists of
    • its schemaa description of the structure of the relation (relation schema)
    • its statethe current data that is populated in the relation (relation instance)
  • The schema of a relation includes
    • the name of the relation
    • an list of n attributes each with an associated domain (what values that attribute can take on).
  • Notation: REL_NAME(Attrib1:Dom1, Attrib2:Dom2…)

26

27 of 80

More formally…

  • Let A1, A2, …, An be names of attributes of relation R with associated domains D1, D2, …, Dn, then � R(A1:D1, A2:D2,...An:Dn) �is a relation schema and n, the degree of R, represents the number of attributes of R.�
  • An instance of Relation R is a subset of the cartesian product of the domains of the attributes of R.

27

28 of 80

Relations - Example

  • Assume we have the following domains:
    • names → {‘Jared’, ‘Sakshi’}
    • id_nums → {all 9 digit positive integers starting with 00}
    • majors → {‘CS’, ‘DS’, ‘CY’}
  • Defining the TA relation schema:� TA(name: names, id: id_nums, major: majors)

28

Is the following a valid instance of TA?

{

(‘Jared’, 001928374, ‘CS’)� (‘Sakshi’, 001122334, ‘DS’)

}

?

29 of 80

Relations - Example

  • Assume we have the following domains:
    • names → {‘Jared’, ‘Sakshi’}
    • id_nums → {all 9 digit positive integers starting with 00}
    • majors → {‘CS’, ‘DS’, ‘CY’}
  • Defining the TA relation schema:� TA(name: names, id: id_nums, major: majors)

29

Is the following a valid instance of TA?

{

(‘Sakshi’, 001928374, ‘CS’)� (‘Sakshi’, 001122334, ‘CY’)

}

?

30 of 80

Relations - Example

  • Assume we have the following domains:
    • names → {‘Jared’, ‘Sakshi’}
    • id_nums → {all 9 digit positive integers starting with 00}
    • majors → {‘CS’, ‘DS’, ‘CY’}
  • Defining the TA relation schema:� TA(name: names, id: id_nums, major: majors)

30

Is the following a valid instance of TA?

{

(‘Sakshi’, 001928374, ‘CS’)� (‘Dylan’, 001122334, ‘DS’)

}

?

31 of 80

Relations - Example

  • Assume we have the following domains:
    • names → {‘Jared’, ‘Sakshi’}
    • id_nums → {all 9 digit positive integers starting with 00}
    • majors → {‘CS’, ‘DS’, ‘CY’}
  • Defining the TA relation schema:� TA(name: names, id: id_nums, major: majors)

31

Is the following a valid instance of TA?

{

(‘Sakshi’, 001928374, ‘CS’)� (‘Jared’, 001122334)

}

?

32 of 80

Relation Instance

  • A relation instance is a set of tuples (rows) from a relation at a particular point in time.
  • Each tuple (row) is an ordered sequence of values, one for each attribute (possibly null)
    • usually enclosed in < and >

32

Name

ID

Phone

Dorm

Age

GPA

Mark

1123141

555-1234

1

19

3.21

Kim

2323411

555-9876

2

25

3.53

Sam

17642352

555-6758

1

19

3.25

Student

1 Tuple

33 of 80

Tuples: Theory vs. Practice

  • Theory: a relation instance is defined as a set of tuples.
    • tuples are unordered
    • tuples cannot be duplicated
  • Practice: rows in a table are implemented as multisets and stored in some physical medium (disk)
    • if they’re on a disk, they are in some order
      • later.. physical design might dictate some specific order
    • bags/multisets imply duplicate tuples allowed.

33

34 of 80

Null Value

  • Null is a special value that may exist in the domain of an attribute
  • Could mean different things
    • value unknown
    • value unavailable right now
    • attribute doesn’t apply to this tuple
  • Does NOT mean:
    • zero (0)
    • the empty string (‘’)
  • (NULL != NULL) Comparing two values of NULL does NOT return true

34

35 of 80

Value of an Attr in a Tuple

  • Values should be atomic
    • Say NO to composite attributes (ex: address that includes city, state and zip)
    • Say NO to multi-valued attributes (ex: all email addresses for 1 person)

35

Name

ID

Address

Phone

Dorm

Age

GPA

Mark

1123141

121 Anystreet�Boston MA 02212

555-1234

555-1876

1

19

3.21

Kim

2323411

235 Huntington�Boston MA 02215

555-9876

2

25

3.53

Student

36 of 80

Super and Candidate Keys

  • key - a subset of attributes of a relation used to uniquely identify one tuple in a relation from other tuples

  • A super key of a relation R is a subset of the attributes of R such that no two distinct tuples in any possible relation instance will have the same values for the subset of attributes.
    • may not be minimal - could contain attributes that aren’t needed for unique determination
  • A candidate key of relation r is a minimal super key.
    • A relation may have more than one candidate keys.

36

37 of 80

Keys - Superkey

37

Some Possible Superkeys:

(customerNumber, customerName)

(customerNumber, salesRepEmployeeNumber)

(customerNumber)

Not Minimal

Minimal

Customers

38 of 80

Keys - Candidate Keys

38

Some Possible Superkeys:

(customerNumber, customerName)

(customerNumber, salesRepEmployeeNumber)

(customerNumber)

Not Candidate Keys

Candidate Key

Customers

39 of 80

The Primary Key

  • The primary key (PK) of relation R is chosen from the set of candidate keys
    • If a relation has only 1 candidate key, it becomes the PK.
    • If a relation has > 1 candidate key, the database designer chooses one based on business requirements
    • Every relation must have a PK
    • Entity Integrity Constraint → PK values must be unique and may NOT be null
    • (Usually) the PK is underlined in a relation schema or table

39

40 of 80

Keys - Primary Key

40

Some Possible Superkeys:

(customerNumber, customerName)

(customerNumber, salesRepEmployeeNumber)

(customerNumber)

Chosen as Primary Key

If there are 2+ Candidate Keys, DB Designer will choose one as the Primary key

Customers

41 of 80

Foreign Keys

  • Foreign Key (FK) - An attribute ai in one relation RC (the child relation) refers to/references the PK aj in another relation RP (parent relation) suck that all values of ai must either be NULL or contain a value from aj.

  • Self-Referential RelationRC and RP are the same relation
  • Foreign Key == Referential Integrity Constraint
  • Foreign Keys are the operationalization of relationships in a relational database

41

42 of 80

Foreign Key Example

42

Students

cID

sID

Grade

332

234

B

332

332

A

535

336

C

221

134

A

535

332

C

221

332

A

221

336

B

Grades

sID

Name

Year

134

Chul

1

234

Kev

3

332

Sam

2

336

Ashwin

2

Parent Relation Child Relation

(the one being referenced) (the one referencing another relation)

Primary Keys Foreign Key

Note: in this example, Grades.sID cannot contain null values because it is part of the PK of Grades Relation.

43 of 80

FKs - but why???

43

Customers

Notice: empLast, empFirst, and empEmail attributes contain duplicated/repeated data. Opens up the possibility of insert or update anomalies.

So, put them in a separate table along with empNum and refer back to the new table (becomes the parent table)

44 of 80

Relational Algebra

44

45 of 80

Relational Algebra

  • Relational Algebra (RA) : a procedural query language for relations that allow us to retrieve information from a relational database
    • A query is an operation or set of operations applied to one or more relation instances
    • RA is closed, meaning the result/output of each query is another relation
    • In RA, order of operations matters
  • Forms the basis for the SELECT statement in SQL.

45

46 of 80

Quick Aside: Predicate Functions

  • Predicate functions - functions that return true or false.
  • We will use predicate functions (or just “predicates”) to determine if tuples in a relation instance should be returned by a query or not.
    • in the form attr <op> attr OR attr <op> <constant>
    • <op> can be any standard relational operator (=, !=, <, >, <=, etc)
    • predicates can be composed with ^ (and), v (or), (not).

Examples:

  • empID = 143
  • empID > 400
  • firstName = ‘Bob’ ^ empID = 333

46

empID

firstName

333

Bob

143

Sam

Employees

47 of 80

The First 6 Basic Operations of RA

  1. Select σ (Tuple Filtering)
  2. Project π (Attribute Filtering)
  3. Join ⨝
  4. Cartesian Product ✕
  5. Set-difference –
  6. Union/Intersection U
  7. Rename ρ

47

48 of 80

Relational Algebra: Select Operator

  • Select - return a relation containing tuples from relation R that satisfy predicate pred.

Notation:�

  • Can be thought of as a horizontal subset (subset of tuples/rows) of a relation instance

48

49 of 80

Relational Algebra: Select Operator

49

Dept

Class

Enrollment

CS

3200

40

CS

2500

643

CS

1800

680

DS

2000

412

Enrollments

Dept

Class

Enrollment

CS

2500

643

CS

1800

680

Enrollments

Result:

50 of 80

Relational Algebra: Select Operator

50

Dept

Class

Taught_by

CS

3200

CS

CS

2500

CS

CS

1800

Math

DS

2000

Math

Enrollments

Dept

Class

Taught_by

CS

3200

CS

Enrollments

Result:

51 of 80

Relational Algebra: Project Operator

Project - returns a relation with a subset of attributes (A1… Ak) from R.

�Notation:

Duplicate tuples will be removed from the resulting relation (because relations are sets).

51

52 of 80

Relational Algebra: Project Operator

52

Dept

Class

Enrollment

CS

3200

40

CS

2500

643

CS

1800

680

DS

2000

412

Enrollments

Dept

Class

CS

3200

CS

2500

CS

1800

DS

2000

Enrollments

Dept

CS

DS

Enrollments

Result:

Result:

53 of 80

Relational Algebra: Cartesian Product

Same operation we know from set theory.

53

Attr_1

Attr_2

123

abc

456

def

R

Attr_1

Attr_3

123

CS

789

DS

111

Cyber

S

R.Attr_1

R.Attr_2

S.Attr_1

S.Attr_3

123

abc

123

CS

123

abc

789

DS

123

abc

111

Cyber

456

def

123

CS

456

def

789

DS

456

def

111

Cyber

54 of 80

Union, Intersection & Difference

  • Essentially the same as what we know from set theory.

  • One small difference: relations must be schema compatible
    • same number of attributes
    • attributes’ domains must be compatible

54

55 of 80

More Complex RA Expressions

  • Simple RA expression can be composed into more complex expressions
    • Remember: output of each RA operation is another relation

55

56 of 80

Relational Algebra: Temporary Relation Names

For more complex RA queries, you can have:

  • one long query expression
  • an ordered list of smaller expressions, the result of each is given a temporary name with the ← operator

Example:

  • TEMP_NAME ←

TEMP_NAME can then be used as a relation in subsequent steps of the same query.

56

57 of 80

Relational Algebra: Rename Operator

Rename Operator (rho) – Allows us to “rename” a relation, the attributes of a relation, or both.

  • If only name is provided, the relation is being renamed (all attributes retain their original name.)

  • List of attributes in parentheses means renaming attributes, but not relation (assume attributes originally (employeeID, lastName, firstName)

  • Rename both.

57

58 of 80

Developing Relational Algebra Expressions

58

59 of 80

Writing RA Queries

  • Sometimes we need to evaluate an RA query against a database instance
    • Result is usually another relation instance/set of tuples/table
  • Other times we need to convert the narrative form of a query into a RA query
    • Example: “Provide a list of all info from the Employee relation where the empID is less than 400.”
      • Answer:

59

60 of 80

Writing RA Queries

60

Stu_ID

Name

Year

134

Chul

1

234

Kev

3

332

Sam

2

336

Ashwin

2

Cou_ID

Dept

Number

Prof

332

CS

3345

Lawrimore

221

CS

2341

Fontenot

535

MATH

2339

Norris

Students

Courses

C_ID

S_ID

Grade

332

234

B

332

332

A

535

336

C

221

134

A

535

332

C

221

332

A

221

336

B

Grades

Next few examples use this database schema.

61 of 80

Writing RA Queries

61

Stu_ID

Name

Year

134

Chul

1

234

Kev

3

332

Sam

2

336

Ashwin

2

Cou_ID

Dept

Number

Prof

332

CS

3345

Lawrimore

221

CS

2341

Fontenot

535

MATH

2339

Norris

Students

Courses

C_ID

S_ID

Grade

332

234

B

332

332

A

535

336

C

221

134

A

535

332

C

221

332

A

221

336

B

Grades

Write a RA query that returns the names of all students.

62 of 80

Writing RA Queries

62

Stu_ID

Name

Year

134

Chul

1

234

Kev

3

332

Sam

2

336

Ashwin

2

Cou_ID

Dept

Number

Prof

332

CS

3345

Lawrimore

221

CS

2341

Fontenot

535

MATH

2339

Norris

Students

Courses

C_ID

S_ID

Grade

332

234

B

332

332

A

535

336

C

221

134

A

535

332

C

221

332

A

221

336

B

Grades

Provide a list of all course numbers taught by the CS Department.

63 of 80

Writing RA Queries

63

Stu_ID

Name

Year

134

Chul

1

234

Kev

3

332

Sam

2

336

Ashwin

2

Cou_ID

Dept

Number

Prof

332

CS

3345

Lawrimore

221

CS

2341

Fontenot

535

MATH

2339

Norris

Students

Courses

C_ID

S_ID

Grade

332

234

B

332

332

A

535

336

C

221

134

A

535

332

C

221

332

A

221

336

B

Grades

List all 2nd year student names.

64 of 80

Writing RA Queries

64

Stu_ID

Name

Year

134

Chul

1

234

Kev

3

332

Sam

2

336

Ashwin

2

Cou_ID

Dept

Number

Prof

332

CS

3345

Lawrimore

221

CS

2341

Fontenot

535

MATH

2339

Norris

Students

Courses

C_ID

S_ID

Grade

332

234

B

332

332

A

535

336

C

221

134

A

535

332

C

221

332

A

221

336

B

Grades

What course or courses (dept and number) are taught by Professor Norris?

65 of 80

Writing RA Queries

65

Stu_ID

Name

Year

134

Chul

1

234

Kev

3

332

Sam

2

336

Ashwin

2

Cou_ID

Dept

Number

Prof

332

CS

3345

Lawrimore

221

CS

2341

Fontenot

535

MATH

2339

Norris

Students

Courses

C_ID

S_ID

Grade

332

234

B

332

332

A

535

336

C

221

134

A

535

332

C

221

332

A

221

336

B

Grades

List the letter grades that Sam has earned (don’t need to include course info).

66 of 80

Joining Data from Two Relations

66

67 of 80

Relational Algebra: The Join Operator

  • Join - allows us to combine data from two relations.
    • Subset of the cartesian product of the two argument relations based on explicit or implicit join predicate.

  • 2 Versions:
    • Natural Join - Join relations on attributes with the same name

    • Theta Join (or condition join or simply Join) - Join relations with explicitly supplied join predicate

67

68 of 80

Natural Join

  • Given: A(R, S, T, U, V) and B(S, V, W, X)
  • Query:
  • Notice: A and B both have attributes named S & V.
  • Result:
    • Schema of resulting relation is

where attr(R) returns a set containing the attributes of R

  • so, attributes used in the implicit join condition are not duplicated in the result
    • contains any tuple from A x B where values for attributes S and V are equal
    • If A & B have no common attribute names, result is A X B.

68

69 of 80

Example

69

Students

cID

sID

Grade

332

234

B

332

332

A

535

336

C

221

134

A

535

332

C

221

332

A

221

336

B

Grades

sID

Name

Year

134

Chul

1

234

Kev

3

332

Sam

2

336

Ashwin

2

Query:

cID

sID

Grade

Name

Year

332

234

B

Kev

3

332

332

A

Sam

2

535

336

C

Ashwin

2

221

134

A

Chul

1

535

332

C

Sam

2

221

332

A

Sam

2

221

336

B

Ashwin

2

Result:

70 of 80

theta-Join (or Join… or Condition(al) Join)

  • Operator has an explicit join predicate (condition) (doesn’t rely upon attribute names)
  • Result is a subset of the cartesian product where provided predicate holds true
  • Assume: S(sID, name) and G(stu-ID, course, semester, grade)
  • Query:
  • Result:
    • Relation with schema (sID, name, stu-ID, course, semester, grade)
    • all tuples where S.sID = G.stu-ID
  • Note: join condition can use other operators besides =.

70

71 of 80

theta-Join (or Join… or Condition(al) Join)

71

A

B

1

Cat

2

Dog

3

Bird

C

D

1

Meow

3

Chirp

S

T

A

B

1

Cat

3

Bird

C

D

1

Meow

3

Chirp

72 of 80

New: Entity Relationship Diagram

72

Northwind Traders Data Model

- sample database model that has been around

for years; originally developed by Microsoft

- represents data model for fake Northwind Traders company, an importer/exporter of specialty foods

73 of 80

New: ER Diagram

73

Relation Name

Attributes

Primary Key Attributes

Relationships

74 of 80

Relations in Northwind

  • Suppliers: Suppliers and vendors of Northwind
  • Customers: Customers who buy products from Northwind
  • Employees: Employee details of Northwind traders
  • Products: Product information
  • Shippers: The details of the shippers who ship the products from the traders to the end-customers
  • Orders and Order_Details: Sales Order transactions taking place between the customers & the company
  • Categories: The categories a product can fall into

74

75 of 80

Practice Time!

75

For each of the following, compose a relational algebra query that satisfies the narrative-form of the question.

76 of 80

Query Time!

76

1. For each supplier, provide its name, contact person, and phone number.

77 of 80

Query Time!

77

2. Which products are supplied by FoodsRUs? Please include the product name and unit price.

78 of 80

Query Time!

78

3. Provide a list of all product names ordered by World Market (a customer’s name).

79 of 80

Query Time!

79

4. Which customers has employee Sam Johnson worked with? Provide the complete customer information.

80 of 80

Further Reading

  • Harrington Ch5 (OReilly)
  • Foundations of Computer Science - Ch 8 - The Relational Model

80