Three-Day RDBMS Day 1, Session 1

part of http://philip.greenspun.com/teaching/three-day-rdbms/

Setting up Virtual Machine and Image

Download VirtualBox 4.0 from www.virtualbox.org

Obtain a copy of the course VM from:

http://web.mit.edu/philg/www/three-day-rdbms-vm.zip or

http://dl.dropbox.com/u/433362/three-day-rdbms-vm.zip

Unzip the archive, open the RDBMS VM folder, and double click on the RDBMS.vbox file to launch the image.

Setting up MySQL

Open a terminal (Applications -> Accessories -> Terminal)

To launch the mysql client and connect to the course database: mysql threedayrdbms -p

When prompted for the password type: dev

You now have a prompt from which you can begin entering SQL statements. Let's start with a very simple one:

select 2+2;

In MySQL and some other ANSI SQL databases, you can use the SELECT statement simply to run a function, such as this arithmetic example. In Oracle you can only use SELECT to pull data from a table, so they supply you with a one-row system table called "dual" (the glorious history and an explanation of the name may be found in http://en.wikipedia.org/wiki/DUAL_table ) and the corresponding command is

select 2+2 from dual;

Note that the semicolon is not part of the SQL statement and need not be used when sending SQL commands from a C or Java program, for example. When typing at a shell client, the semicolon instructs the client to send the SQL to the database.

In the virtual machine, we already created a database for you using the following commands:

create database threedayrdbms;

use threedayrdbms;

-- since we already created the database and specified it

-- when starting the mysql client we can skip the explicit

-- selection of a database with the use command

We also created a table using the following command:

create table dog_breeds (

        breed_name                varchar(50),

        characteristics        varchar(100)

);

Note that we configured the MySQL installation to use the InnoDB storage engine by default, so you don't need to specify the InnoDB storage engine at table creation time in order to get some rudimentary support for transactions. After we created the table, we inserted two rows:

insert into dog_breeds (breed_name, characteristics)

values ('border collie', 'ankle-biter'),

       ('golden retriever', 'vicious killer');

-- note that the MySQL shell client runs, by default in autocommit mode

-- and therefore these rows are instantly available for all sessions

-- to see

You can experiment with the table and the shell client using the following commands:

select * from dog_breeds;

delete from dog_breeds where breed_name like 'b%';

select * from dog_breeds;

-- visit http://localhost/phpmyadmin/, log in as "dev", "dev", click on "Databases", then click on "threedayrdbms" and then click on "dog_breeds" and then insert and you'll get a Web form to enter "toy poodle" and "man's dog" or similar

Hooking it up to the web server

The virtual machine image you've been provided with comes with Linux, Apache, MySQL, PHP, and phpMyAdmin pre-installed. If you wanted to recreate this environment from a basic Linux image you'd need to install the packages for each of these application along with any applicable connectors or libraries. Many guides have been written this subject which can be readily found by searching the Internet for the name of your distribution and the keywords LAMP install. For now we'll assume you're working in the VM or have a similar configuration already installed.

Open your text editor of choice (e.g. Applications -> Accessories -> Text Editor, or start another terminal shell and run "emacs -nw")

Open the /home/dev/php/index.php file.

By default the file contains code to select the entire dog_breeds table and display it. The mysql_query line can be edited to try out different queries and the loop edited to change how the output is displayed.

Save the file and view the results by opening a web browser (Applications -> Internet -> Chromium Web Browser) and pointing it to http://localhost/php

When you are done you can remove the dog_breeds table:

drop table dog_breeds;

ACID: Bank Data Model

We've preloaded the database with the following tables:

create table bank_customers (

        customer_id        int not null primary key auto_increment,

        customer_name        varchar(100) not null,

        -- defaults in MySQL must be constant, except for timestamp

        date_joined        timestamp not null default current_timestamp

);

create table bank_accounts (

        account_id                int not null auto_increment,

        -- normally we put a REFERENCES constraint with the column def

        -- but MySQL ignores those, so it is at the end of the statement

        customer_id        int not null,

        account_type        enum('checking','savings') not null,

        balance                decimal(20,2) not null default 0,

        primary key (account_id),

        foreign key (customer_id) references bank_customers(customer_id)

);

-- note that we need to say how much precision we want, unlike in

-- Oracle. A simple "numeric" will result in 3.75 being rounded up

-- to 4 upon insertion.

We've preloaded the tables with the following customers and accounts:

insert into bank_customers (customer_id, customer_name, date_joined)

VALUES

(1, "Joe Hedge Fund", "2006-12-24"),

(2, "Susie Average", "1997-03-14");

-- note standard YYYY-MM-DD format above

insert into bank_accounts (customer_id, account_type, balance)

VALUES

(1, 'checking', 200000000),

(1, 'savings', 175000000),

(2, 'checking', 4000),

(2, 'savings', 30000);

Use the show command to display a listing of tables currently in the database:

show tables;

Note that this and DESCRIBE, used below, are special commands of the MySQL DBMS. They are not SQL language statements. You can achieve similar results to the show command by manually selecting the information from the MySQL server's information_schema view:

select table_name from information_schema.tables

where table_schema='threedayrdbms';

For more information on the structure of each table use the describe command:

describe bank_customers;

describe bank_accounts;

Reference: 

ACID: Atomicity

Let's look at a client program's attempt to transfer $1000 from a customer's savings account to checking. The transfer will be done in two steps: (1) add $1000 to checking, (2) subtract $1000 from savings. A software or power failure occurs between steps 1 and 2. Let's see if the RDBMS can protect us.

-- get the pre-transfer sum of Joe's checking and savings balances

select sum(balance) from bank_accounts where customer_id=1;

-- start transfer by adding the amount to checking

update bank_accounts

  set balance=balance+1000

where customer_id=1

and account_type='checking';

-- and next we'll debit the savings account...

Type Control+C to exit the client (simulating the crash of the client program or someone unplugging the RDBMS server). At this point Joe's checking account has been credited but his savings account has not been debited. Let's reconnect to see what state the database has been left in:

mysql threedayrdbms -p (or just type "!!" since this was the last shell command)

When prompted for the password type: dev

Now back at the mysql console, query for the bank's total deposits:

select sum(balance) from bank_accounts;

In attempting the transfer, our bank just gave away $1,000 to Joe, certainly not a sustainable way to do business. Let's try this again but wrapping the credit and debit together into a single atomic transaction:

start transaction;

update bank_accounts

 set balance=balance+1000

where customer_id=1

and account_type='checking';

Type Control+C to exit the client (once again, simulating a failure) before we can debit the savings account. Let's reconnect to see what state the database has been left in:

mysql threedayrdbms -p (or "!!")

When prompted for the password type: dev

Now back at the mysql console, query for total deposits:

select sum(balance) from bank_accounts;

This time the transaction could not be completed so it was rolled back and our bank didn't give away any more free money.

Discussion: How do we build this? (Original MySQL answer: Don't bother. See http://dev.mysql.com/doc/refman/5.5/en/storage-engine-compare-transactions.html)

ACID: Isolation

Open a second terminal (Applications -> Accessories -> Terminal). The next steps are presented in a table to make it clear where each command should be run. The steps should be executed left-to-right and then top-to-bottom.

Terminal 1

Terminal 2

Launch the mysql client and connect to the course database:

mysql threedayrdbms -p

When prompted for the password type: dev

Back in the first terminal start another balance transfer transaction:

start transaction;

update bank_accounts

set balance=balance+1000

where customer_id=1 and

account_type='checking';

Now, before debiting the savings account, check Joe's total balance:

select balance from bank_accounts

where customer_id=1;

You will find that his total balance has increased by $1,000.

Now switch over to your other terminal and execute that same select:

select balance from bank_accounts

where customer_id=1;

This session was isolated from the changes made by an uncommitted transaction start by another session. Until a successful commit, the extra $1,000 is confined to the session attempting the updates.

Switch back to the first terminal and finish the transfer:

update bank_accounts set

balance=balance-1000

where customer_id=1 and

account_type='savings';

commit;

Now that the transaction has been committed the balance returned by our select statement should be the same in each terminal:

select balance from bank_accounts

where customer_id=1;

Go ahead and verify this by running the same select command in the second terminal window:

select balance from bank_accounts

where customer_id=1;

Discussion: How do we build this? (Two answers: multi-versioning or locks.)

Reference:

ACID: Consistency

Now lets try inserting some data that violates one of the constraints we defined.

insert into bank_customers (customer_id, customer_name, date_joined)

        VALUES (1, "New Guy", "2011-01-08");

This insert will fail because we've attempted to give New Guy and Joe Hedge Fund the same customer_id (which we defined as a primary key).

Similarly, we can't create a bank account that doesn't belong to anyone:

insert into bank_accounts (customer_id, account_type, balance)

        VALUES (37, 'checking', 2000);

We don't have a bank customer with a customer_id of 37 so this insert will fail (we defined customer_id as a foreign key constraint).

Discussion: How do we build this? (Let's talk about indexing, performance, and the original MySQL answer: Don't bother.)

The main enemy of software reliability: programmers. The last line of defense: integrity constraints in the RDBMS.

Sadly, MySQL fails to deliver much of the value that companies get when they buy Oracle or similar systems because MySQL does not enforce the CHECK constraint. Here's an example:

-- no need to cut and paste this; just for discussion

-- (it won't work properly without a bank_vice_presidents table)

create table bank_wire_transfers (

        -- in real life we'd have a wire_transfer_id column so we could                 -- refer to one row easily

account_id             int not null

references bank_accounts(account_id),

        -- all of the routing info

        destination     varchar(100),

        dollar_amount   decimal(20,2),

        wire_date                datetime,

        -- may be NULL for small wires

        approved_by     int,

        foreign key (account_id) references bank_accounts(account_id),

        foreign key (approved_by)

references bank_vice_presidents(employee_id),

        CHECK (dollar_amount < 10000 OR approved_by is not null)

);

insert into bank_wire_transfers

(account_id, destination, dollar_amount)

values

(1, "Embezzler Bank of Brazil", 10000000)

MySQL is perfectly happy to accept a $10 million wire transfer with no approval. Any of the commercial RDBMSes would generate an error if a transaction included the INSERT statement above and leave the database unmodified. The folks at MySQL don't seem to be concerned about this deficiency, having taken the trouble to document it: http://dev.mysql.com/doc/refman/5.5/en/create-table.html says "The CHECK clause is parsed but ignored by all storage engines".

ACID: Durability

The transactions above, once committed, should survive a power failure as they've been written to a transaction log on disk. Rip the hard drive or SSD out of your laptop and see if you can recover them, though! If not, your installation falls short in the D of ACID.

Discussion: How do we build this?

Start work on Facebooklet

Our application lets a user register, create a profile with “likes”, join one or more groups, and designate one or more other users as friends.

Sample data:

name (1), group (1 for now; more later), likes (0 or more), friends (0 or more)

George W. Bush|Yale|starting wars, Gitmo|Pamela Anderson, Borat

Barack Hussein Obama II|Harvard|health insurance, unemployment|Bruno, Paula Abdul

Pamela Anderson|Hollywood|plastic surgery, home video|George W. Bush

Borat|Kazakstan|farming, travel, singing|Pamela Anderson, Obama

Exercise: 1-table data model

Create a single table, named "facebook_simple" that can store the above data (suggest four columns). Load in the first row (George W. Bush), either using http://localhost/phpmyadmin/ or an INSERT statement.

Don't worry about keys, uniqueness, integrity constraints, or taste. The purpose of this exercise is going to be showing you how painful it is to represent everything in one table with multiple values in a column.

Possibly helpful: http://dev.mysql.com/doc/refman/5.5/en/data-types.html

Exercise: King Bush II's friends

Write a query to determine if "Bruno" is among George W. Bush's friends (should return one row with "1" or "t")

Write a query to determine if "Borat" is among George W. Bush's friends

Write a query to print out a list, one row per friend, of George W. Bush's friends (remember that we suggested loading only the first row, so you're not worried about friend symmetry or using information from more than one row of the table)

Write a query to retrieve the table, sorted by last name.

Hints:

Questions:

Let's talk about First Normal Form.

Discussions: Anomalies with Unnormalized relations

create table fbt_all_in_one (

        name                             varchar(100),

        group_name                  varchar(100),

        group_is_university_p          boolean,

        group_main_city           varchar(100),

        likes                              text,

        friends                          text

);

Can we insert a new group, e.g., Olin College, without inserting a person at the same time? If not, that's an insert anomaly, according to E.F. Codd (dead).

What if we delete George W. Bush? Suppose that he was the only Yalie. Have we also deleted the facts that Yale is a university and that Yale is in New Haven, Connecticut? If so, that's a delete anomaly, according to Codd.

What if Yale decided to spin off its university and concentrate on its core business of investment management. We issue the following database command

update fbt_all_in_one

set group_is_university_p = false

where group_name = 'Yale';

Assuming both John Kerry and George W. Bush had both been users of the service, would more than one row have been updated? If so, that's the update anomaly.

Extra credit: Compare your grades so far to those of the current senator and former president by visiting http://www.cbsnews.com/stories/2005/06/07/politics/main700170.shtml

Exercise: multi-table data model

Create a multi-table data model to hold the Facebooklet data. If it doesn't suffer from any of the anomalies listed above, it is probably correct.

Before we start, let's look at and discuss a real-world data model: http://philip.greenspun.com/teaching/three-day-rdbms/education.sql.txt

Representing friend requests and friendship: Like Facebook, we need to model that User A requested to become friends with User B, but User B has not yet accepted. Once User B accepts, the friendship is symmetric and everything should work just as if User B had initiated the friendship. Also think about a way to represented defriended relationships. If you need to make a design choice, push for cleanliness of representation and don't worry if the queries might get a little ugly or complex. It is always possible to create a view to hide the complexity of a query.

Hint: Add an extra mapping table to represent a many-to-many relation, e.g., that a user could belong to more than one group and a group can contain more than one user.

Our preloaded solution: http://philip.greenspun.com/teaching/three-day-rdbms/facebooklet

Exercise: JOINs

Let's have queries for the following reports:

This would be a good time to read http://philip.greenspun.com/sql/style so that you get into the habit of indented SQL readably.

Reference:

Exercise: OUTER JOINs

How about a report showing all of the groups in the system and their associated members? It seems simple, but what about groups that don't have any members yet? We want them in our report.

If all of the groups in your table have members, consider that Harvard has received a $5 billion donation that came with the condition for the school to change its name to "Hogwarts" and for Drew Gilpin Faust to change her name to "Professor Dumbledore". Add a new group with no members:

INSERT INTO groups (title) VALUES ('Hogwarts');

Hint: something like

select *

from groups g left outer join group_memberships gm on (g.group_id = gm.group_id)

is a good starting point. add a WHERE clause to keep the results to something manageable, e.g.,

select *

from groups g left outer join group_memberships gm on (g.group_id = gm.group_id)

where title like 'H%'

If you want to print user names rather than user IDs, you'll need to bring in the USERS table. Something like

SELECT stuff_you_want_in_your_report

FROM table_you_want_every_row_from

 LEFT OUTER JOIN table_with_mapping_relation ON (join_column_constraint)

 LEFT OUTER JOIN table_with_extra_info ON (join constraint)

LIMIT 20;

Exercise: GROUP BY

Please write a query giving a report of all of the names of the groups in the system that have at least one member next to the number of members in that group. You'll be using GROUP BY and probably count(*).

Write a second query that also includes the names of groups with no members (and a "0" for number of members in the second column of the report).

Hint: COUNT(column_name) will return 0 if all of the column values are NULL.

References: http://philip.greenspun.com/sql/complex-queries

Exercise: find the mode

Use your new GROUP BY skills to write a query that will return a single row containing the most-liked "like" (human-readable topic, not the like_id)  and the number of users who liked it. Assume that there is one clear winner among the "likes".

Hint: GROUP BY, ORDER BY, and the MySQL extension "limit" may be useful. Remember that ORDER BY can do ascending (ASC) or descending (DESC).

Extra credit: What if there are two or three "likes", each with the same number of users mapped to it? Can you generate a report showing all of the likes whose popularity is at the mode? You might need a subquery and/or a view.

Exercise: HAVING

Please write a query giving a report of all of the names of the groups in the system that have at least three members.

Discussion: Query Optimization

Take some of the queries above and feed them to MySQL once again but prefixed with "EXPLAIN" or "EXPLAIN EXTENDED".

Reference: http://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html