1 of 38

Data Engineering

Updates

1

2 of 38

It’s not sufficient to query…

  • We also need to be able to modify the data in relations
  • A modification command doesn’t return the result in the same way a query does, but it changes the relational instance

  • Three kinds of modifications:
    • Insert a tuple or tuples
    • Delete an existing tuple (or tuples)
    • Update the value(s) of an existing tuple (or tuples)

2

3 of 38

Inserting One or More Tuples

INSERT INTO Relation VALUES (<list of values>);

Example:

  • Stops (id, race, age, location)
  • INSERT INTO Stops VALUES (5000, ‘Asian’, 24, ‘West Oakland’);
  • INSERT INTO Stops VALUES

(5001, ‘Hispanic’, 37, ‘Rockridge'),

(5002, ‘White’, 68, ‘MacArthur');

3

4 of 38

Specifying Attributes in Insert

INSERT INTO Stops VALUES (5000, ‘Asian’, 24, ‘West Oakland’);

  • The previous syntax requires you to remember order of attributes

  • Can be more explicit:

INSERT INTO Stops (id, race, age, location) VALUES

(5000, ‘Asian’, 24, ‘West Oakland’);

  • Another benefit beyond remembering the order of attributes:
    • We don’t have values for all attributes, and we want the system to fill in NULL values for the rest

4

5 of 38

Inserting Many Tuples

  • We may insert the entire result of a query into a relation

INSERT INTO Relation (<Subquery>);

  • Stops (id, race, age, location) and Zips (location, zips)

INSERT INTO Zips(location) VALUES (

SELECT DISTINCT S.location

FROM Stops S

WHERE NOT EXISTS

(SELECT * FROM Zips Z WHERE S.location = Z.location));

  • Q: What do you think this query does?

5

6 of 38

Deleting Tuples

To delete tuples satisfying a condition from some relation:

DELETE FROM Relation WHERE <condition>;

Stops (id, race, age, location)

Remove all juvenile stops

DELETE FROM Stops WHERE age <18;

To delete all tuples from a relation simply say:

DELETE FROM Stops;

6

7 of 38

Updates

To change certain attributes of certain tuples in a relation:

UPDATE Relation

SET <list of attribute assignments>

WHERE <condition>

7

8 of 38

Example: Update

Stops (id, race, age, location)

Set NULL ages to be 18:

UPDATE Stops

SET age = 18

WHERE age IS NULL;

8

9 of 38

A Tricky Issue of Semantics

  • Now, we know at a high level how INSERTS, DELETES, and UPDATES work
  • There’s a special issue that impacts both DELETEs and UPDATEs

  • Suppose we have a relation Employee (name, salary, address, title)
  • We run the following update to increase salaries of all database designers who have a salary < 100k by 20%

UPDATE Employee SET salary = salary * 1.2

WHERE title= ‘Database Designer’ AND salary <100000

9

10 of 38

A Tricky Issue of Semantics

UPDATE Employee SET salary = salary * 1.2

WHERE title= ‘Database Designer’ AND salary <100000

      • Suppose the Employee relation has a tuple corresponding to a Database Designer “Bob” with salary <100,000, say $50,000
      • When the UPDATE statement hits Bob’s tuple, it changes that salary to 120% of the original one, $60,000.
      • If we’re not careful, we could keep increasing Bob’s salary…
      • Q: when will we stop?
      • We will stop when salary hits >= $100,000.

10

11 of 38

A Tricky Issue of Semantics

UPDATE Employee SET Salary = Salary * 1.2

WHERE Title= ‘Database Designer’ AND Salary <100000

How do we deal with this issue?

    • We do UPDATE (and DELETE) in two phases:
      • Phase 1:
        • We look each tuple and evaluate their WHERE condition
        • If the tuple matches the WHERE condition, we mark it for an update/delete
      • Phase 2:
        • We actually update/delete all the marked tuples
    • Convince yourselves that this always returns the same answer independent of the order of processing of tuples!

11

12 of 38

Data Engineering

Relational Design

12

13 of 38

So far…

  • We considered
    • RA: a core set of primitives, supported across multiple data systems: databases, Spark (DF API), dataframes
    • SQL queries on one more more relations that builds on/extends RA
    • SQL updates that allow you to populate and manipulate the data in your relations
  • Next, we’ll consider how to set up the relations so that you can then query them effectively

13

14 of 38

Defining a Relation

  • Simplest form of adding a relation is:

CREATE TABLE <name> (

<list of attributes and domains>

);

Recall that we created a table based on a subquery previously; this allows us to create an unrelated table with an entirely new schema

  • And removing a relation is:

DROP TABLE <name>;

14

15 of 38

List of Attributes and Domains

  • For each attribute in the relation we list the corresponding domain
  • Common types include:
    • INTEGER
    • REAL
    • CHAR (n) = fixed length string of n characters
    • VARCHAR (n) = variable length string of up to n characters
    • TEXT = variable length string
    • DATE = ‘yyyy-mm-dd’
    • TIME = ‘hr:min:sec’
    • TIMESTAMP = ‘yyyy-mm-dd hr:min:sec’
    • BOOLEAN

15

16 of 38

Example

  • Create a table called Zips

CREATE TABLE Zips (

location VARCHAR(20),

zipcode INTEGER

);

16

17 of 38

The Notion of Keys

  • It is very useful to be able to identify tuples in a relation
  • Also true in the real world:
    • CalnetID or email address to identify students
    • SSN or passport number to identify people
  • A set of attributes is said to form a key if no two tuples can have the same values for that combination of attributes

  • We do so using the PRIMARY KEY or UNIQUE keyword
    • (distinction later)

17

18 of 38

Example Primary Key Usage

CREATE TABLE Stops(

stopID INTEGER, race VARCHAR(10), location VARCHAR(20),

age INTEGER, arrest BOOLEAN,

PRIMARY KEY (stopID)

);

CREATE TABLE Zips (

location VARCHAR(20),zipcode INTEGER, area REAL,

PRIMARY KEY (location, zipcode)

);

Q: How do we interpret this?

A: combo of location and zipcode is unique across Zips tuples

18

19 of 38

UNIQUE vs. PRIMARY KEY

  • Difference 1:
    • There can be only one PRIMARY KEY (hence the name!)
    • There can be many UNIQUEs – aka keys

  • Difference 2:
    • No attribute of a PRIMARY KEY can be NULL
    • Attributes within UNIQUEs can be NULL, and there can be several tuples with NULL values for UNIQUE attributes
      • So NULL is excluded from enforcement of UNIQUE

19

20 of 38

Example Unique Usage

CREATE TABLE Stops(

stopID INTEGER, personID INTEGER, stopTime TIMESTAMP,

race VARCHAR(10), location VARCHAR(20),

age INTEGER, arrest BOOLEAN,

PRIMARY KEY (stopID),

UNIQUE (personID, stopTime)

);

20

21 of 38

Other Declarations

  • Two other declarations we can make for attributes:
    • NOT NULL means that the value for the attribute can never be NULL
      • Any changes that violate this declaration will be rejected
    • DEFAULT <value> means that if there is no specific value known for this attribute for a given tuple, use the stated <value> (as opposed to NULL)

21

22 of 38

Example Usage of Declarations

CREATE TABLE Stops(

stopID INTEGER, personID INTEGER, stopTime TIMESTAMP,

race VARCHAR(10), location VARCHAR(20) NOT NULL,

age INTEGER, arrest BOOLEAN DEFAULT False,

PRIMARY KEY (stopID),

UNIQUE (personID, stopTime)

);

22

23 of 38

Adding Attributes

  • We can add new columns to a relation by using the following:

ALTER TABLE <name>

ADD <attribute declaration>,

ADD <attribute declaration>, …;

ALTER TABLE Zips

ADD area REAL,

ADD timezone VARCHAR(5);

23

24 of 38

Deleting Attributes

  • We can delete columns from a relation by using the following:

ALTER TABLE <name>

DROP <attribute>,

DROP <attribute>, …;

ALTER TABLE Zips

DROP area,

DROP timezone;

24

25 of 38

Recap

  • So we’ve talked about how to set up and modify a relational schema
  • We’ve also seen a few examples of constraints
    • PRIMARY KEY, UNIQUE
    • NOT NULL
  • More generally, constraints are relationships across data elements that the data system is required to preserve or enforce
  • Next, a broader class of constraints

26 of 38

Kinds of Constraints

  • Key (via PRIMARY KEY or UNIQUE)
  • NON NULL constraints

  • Cross-table referential integrity or FOREIGN KEY constraints
  • Attribute based constraints
  • Tuple-based constraints but we won’t cover them
  • Other constraints (any SQL expression) but we won’t cover them

26

27 of 38

Example

    • Cast_info (person_id, movie_id),
    • Actor (id, name),
    • Movie (id, title)
  • We might expect that person_id in Cast_info must be present in the ids in Actor, and likewise, movie_id must be present in the ids in Movie

  • Such a constraint is called a foreign key constraint
  • Think of this as a pointer from a tuple in one table to a tuple in another table
  • Referenced attributes must be declared UNIQUE or PRIMARY KEY (i.e., a key), why?

27

person_id

movie_id

1

23

3

45

id

name

1

Ingrid Bergman

3

Samuel L Jackson

id

title

23

Casablanca

45

Snakes on a plane

28 of 38

Declaring a FK Constraint

Using REFERENCES

CREATE TABLE Actor (id INTEGER, name TEXT, PRIMARY KEY actor_id);

CREATE TABLE Cast_info (person_id INTEGER, movie_id INTEGER,

FOREIGN KEY (person_id) REFERENCES Actor (id),

FOREIGN KEY (movie_id) REFERENCES Movie (id));

Or

CREATE TABLE Cast_info (

person_id INTEGER REFERENCES Actor (id),

movie_id INTEGER REFERENCES Movie (id));

Less verbose but doesn’t support multi-attribute FK constraint

28

person_id

movie_id

1

23

3

45

id

name

1

Ingrid Bergman

3

Samuel L Jackson

id

title

23

Casablanca

45

Snakes on a plane

29 of 38

Enforcing Foreign Key Constraints

  • If there is a FK constraint from attributes of R to the keys of S, two violations are possible:
    • An insert/update to R introduces values not found in S
    • A deletion from/update to S causes a tuple in R to “dangle”
  • Q: Why are other cases not important?

R

S

person_id

movie_id

1

23

3

45

id

name

1

Ingrid Bergman

3

Samuel L Jackson

id

title

23

Casablanca

45

Snakes on a plane

30 of 38

Enforcing Foreign Key Constraints

  • If there is a FK constraint from attributes of R to the keys of S, two violations are possible:
    • An insert/update to R introduces values not found in S
    • A deletion from/update to S causes a tuple in R to “dangle”
  • Q: Why are other cases not important?
    • A deletion from R or insertion to S introduces no new potential violations
    • Likewise for updates to R/S that don’t introduce new values or cause tuples to dangle

30

31 of 38

Enforcing Foreign Key Constraints

  • Here, R: Cast_info, S: Actor
  • Referencing relation changes:
    • An insert or update to Cast_info that introduces a non-existent actor will be rejected
  • Referenced relation changes:
    • A deletion or update to Actor that causes some tuples in Cast_info to “dangle” can be handled in three ways:
      • Default: reject the modification
      • CASCADE: make the same change to Cast_info as in Actor
        • Deleted Actor: delete the Cast_info tuples that refer to it
        • Updated Actor: update the Cast_info tuples that refer to it
      • SET NULL: change the person_id in Cast_info be NULL

31

person_id

movie_id

1

23

3

45

3

76

id

name

1

Ingrid Bergman

3

Samuel L Jackson

32 of 38

Example

CREATE TABLE Cast_info (person_id INTEGER, movie_id INTEGER,

FOREIGN KEY (person_id) REFERENCES Actor (id)

ON DELETE SET NULL ON UPDATE CASCADE,

FOREIGN KEY (movie_id) REFERENCES Movie (id)

ON DELETE SET NULL);

Q: what if we

    • Delete a tuple from Actor corresponding to id = 1?
    • Change a tuple in Actor from id = 1 to id = 2?
    • Delete a tuple from Movie corresponding to id = 23?
    • Change a tuple in Movie from id = 23 to id = 24?
    • Insert a tuple into Cast_info that adds a new person_id not found in Actor?

32

person_id

movie_id

1

23

3

45

id

name

1

Ingrid Bergman

3

Samuel L Jackson

id

title

23

Casablanca

45

Snakes on a plane

33 of 38

Attribute Based Constraints

  • NOT NULL is one such constraint
  • Example: ensure that actor_ids are four-digit integers.

CREATE TABLE Actor (

id INTEGER CHECK (id >= 1000 AND id <= 9999),

name TEXT);

  • Checked on changes to Actor
    • Either when a new id is added or an existing one is updated, not when deleted (why?)
  • For this reason, we can’t use CHECKs to enforce referential integrity
    • Even if we use within Cast_info
      • …, person_id INTEGER CHECK

(EXISTS (SELECT * FROM Actor A WHERE person_id = A.id)), …

    • We will miss changes to the referenced relation (in this case Actor)

33

34 of 38

Kinds of Constraints

  • Key (via PRIMARY KEY or UNIQUE)
  • Referential integrity constraints as FOREIGN KEY constraints
  • Attribute based constraints

  • Tuple-based constraints
    • Constraints across multiple attributes in a tuple
    • but we won’t cover them (still via CHECK)
    • Listed at the end of the list of attributes
  • Other constraints (any SQL expression)
    • Constraints across multiple relations (via CREATE ASSERTION)
    • but we won’t cover them

34

35 of 38

Why Constraints?

  • Helps encode domain knowledge that the system can enforce
    • E.g., age is between 0->120, zipcode is five digits
  • May appear to be a specification and performance (more on that next) burden for users, but acts as guardrails and provides safety
  • Prevents unintended downstream errors
    • E.g., when new inserts violate the constraints

36 of 38

Triggers, Briefly

  • Constraints introduce overhead to a data system
    • Specifically, it’s also not clear when to check constraints, especially those that are across tables (as CHECK ASSERTIONs)
    • Worst case behavior is to check all constraints after every edit
      • Systems are better than this, but not much better
  • An alternative is triggers
    • Triggers are “set off” on specific events specified by the user
    • Often less overhead than constraints
    • We’ll hopefully learn about triggers later!

37 of 38

Quick Demo

  • Insert into the Actor relation

INSERT INTO Actor VALUES (5000, 'Adam', 'Driver', 'Feb 3, 2020’);

INSERT INTO Actor VALUES (5001, 'Brenda', 'Driver', 'Feb 3, 2020'), (5002, 'Charles', 'Driver', 'Feb 3, 2020’);

INSERT INTO Actor (actor_id, first_name, last_name, last_update) VALUES (5003, 'Didi', 'Driver', 'Feb 3, 2020');

  • Updates to the Actor relation

UPDATE Actor

SET actor_id = 6000

WHERE first_name = ‘Adam’ AND last_name = ‘Driver’

  • Deletes from the Actor relation

DELETE FROM Actor WHERE last_name = ‘Driver';

37

38 of 38

Quick Demo

  • Creating the manager relation

CREATE TABLE manager (manager_name VARCHAR(20), age INTEGER DEFAULT 50, manager_id CHAR(4), PRIMARY KEY (manager_id), UNIQUE (manager_name));

  • Adding a tuple (Success!)

INSERT INTO Manager (age, manager_id)

VALUES (23, ‘sd45’);

  • Adding a tuple (Failure)

INSERT INTO Manager (manager_name, age)

VALUES ('John Smith', 123);

  • Changing the schema by adding attribs

ALTER TABLE Manager

ADD address VARCHAR (20),

ADD income REAL DEFAULT 10000.0;

  • Changing the schema by dropping attribs

ALTER TABLE Manager DROP address, DROP income;

  • Deleting the relation entirely

DROP TABLE Manager;

38