Three-Day RDBMS Day 2, Session 1

part of http://philip.greenspun.com/teaching/three-day-rdbms/

What did everyone see in the previous day?

Now that we know what we're talking about, let's look at the following:

Discussion: Declarative Programming

History: early 1970s: SQL; 1979: Visicalc; 1990: Haskell

Quicksort in Haskell

qsort []     = []
qsort (x:xs) = qsort (filter (< x) xs) ++

               [x] ++

               qsort (filter (>= x) xs)

Discussion: Why RDBMS >> Network and Hiearchical

Airline DBMS structure made it quick and easy to ask "who is on Flight 37" but not "what flights has Joe Smith taken since January 1?" or "show me customers who've taken at least 17 flights in the last three years". Huge performance penalty in early implementations, though, which is why it wasn't until Oracle 6.0 in 1988 that it took off. Oracle as of January 9, 2011 has a market capitalization of $157 billion (compare to $245B for Microsoft; $184B for IBM; $63B for Ford).

Discussion: Why Companies love the RDBMS

Recap from Yesterday:

Discussion: Why the RDBMS dominated the early Web (1994-2010)

First dynamic Web applications were built by folks with an academic CS background. So they'd never heard of the RDBMS and its ability to process simultaneously updates. They knew about locks from their operating system course. A script would call "flock" and exclude any other cooperating scripts, including copies of itself. What if a script crashed, though, and left a lock behind? What if a new programmer was hired and didn't know that everyone else was using flock before modifying the files?

What are Web applications? Systems that receive, from multiple users, simultaneous streams of updates and queries to a common set of data. This is exactly the problem for which the standard corporate DBMS was built and the RDBMS happened to be the dominant DBMS technology in the mid-1990s.

Early database-backed Web sites:

Discussion: Under the Hood of Oracle

Why are we looking at Oracle? Philip couldn't find any good docs on how MySQL works (if indeed it does).

What's on disk and in RAM?

What does an index look like? Note that the pointers connecting leaf nodes make it fast to scan an index, but you won't find this data structure in any CS text.

What does this look like from the perspective of the operating system?


Start on the left at User Process. This is a Java or C "client" program or whatever that is sending SQL statements to the Oracle server (Shared Server Process, which requires the Dispatcher, or Dedicated Server Process).

A server process can read or write the latest database table block contents from the Database Buffer Cache in the System Global Area (if server has 32 GB of RAM, this could be 20 GB or more). In a perfect world, we could stop here, as long as the server processes took locks before altering the contents of the buffer cache.

The rest of the stuff is to deal with the possibilities of programmer error or database server failure (power, hardware, or software).

The redo log buffer contains information about what is to be changed by a transaction. The Log Writer writes redo entries to redo log files, at which point a transaction can be committed. The redo log files are precious. If you have a snapshot of the data files from 1990 and then all of the redo logs recording changes made since then, you can reconstruct the database up to any point in time between 1990 and now. If there are 50 simultaneous users and all send in a transaction at about the same time, Log Writer might be able to commit them all with just one disk write.

Database Writer copies updated blocks from memory to the hard drives. Remember that a transaction commits when the redo log is written. The datafiles on disk for table and index data may lag behind what's in RAM.

Checkpoint manages keeping track of how close to being up to date, by system change number (SCN), the data files are. A checkpoint results in the headers of all datafiles being updated. Despite the arrows in the figure, the Oracle docs say "The CKPT process does not write blocks to disk; the Database Writer(s) always performs that work."

Archiver copies redo logs (what was changed) to a separate disk or possibly a second "hot standby" Oracle installation.

PMON and SMON are processes that watch the other processes, clean up after crashes, restart what needs to be restarted, and otherwise paper over the inevitable failures in complex computer programs. They must work because Oracle can run for months or years.

RECO "is a background process used with the distributed database configuration that automatically resolves failures involving distributed transactions".

What happens during a transaction? Cut and paste from the Oracle Concepts book:

Before a transaction that modifies data is committed, the following has occurred:

When a transaction is committed, the following occurs:

  1. The internal transaction table for the associated undo tablespace records that the transaction has committed, and the corresponding unique system change number (SCN) of the transaction is assigned and recorded in the table.
  2. The log writer process (LGWR) writes redo log entries in the SGA's redo log buffers to the redo log file. It also writes the transaction's SCN to the redo log file. This atomic event constitutes the commit of the transaction.
  3. Oracle Database releases locks held on rows and tables.
  4. Oracle Database marks the transaction complete.

How does recovery work?

["Undo blocks" are what used to be called the Rollback Segment. They contain copies of database blocks before a transaction was started.]

Exercise: Permutations

Reconnect with mysql threedayrdbms -p

Write a query that will return all permutations of the numbers 1, 2, 3, and 4, one permutation on each row of the report (i.e., a four-column report).

Hints: You'll almost certainly have to start by creating a table with four rows, one for each number. Then think "self-join". Then remember that the join by default produces all possible combinations of rows from the joined tables; you can use one or more WHERE clauses to filter out the ones that aren't wanted.

Exercise: Histograms

Produce a report that shows a histogram of how many groups a typical Facebooklet user belongs to, e.g., a count of how many users are in 1 group, 2 groups, and so on.

Hint: You can SELECT from the results of another SELECT, i.e., query from a table that you're produced on the fly by SELECTing.

Second hint: We don't think this can be done except with two GROUP BYs (and you can only have one GROUP BY per SELECT).

Extra credit: Try the same thing to produce a histogram of how many friends users have.

Exercise: Unions

Suppose that our histogram shows that most users are in exactly two groups, but some are in just one and some are in more than one. How can we get a four-column report that shows each user's first and last name's next to either his one group's name (plus a NULL), both of his groups' names, or one group's name plus "N more". (See example output below.)

A reasonable way to start on this project is to think about a query that produces the correct results for a user with just one group membership. Then think about a couple more. Then UNION the results.

Hint: In a GROUP BY, all of the columns in the SELECT list must be either columns that you grouped by or aggregates of columns within the group. Aggregate functions include COUNT, SUM, MAX, and MIN. MAX and MIN are very helpful if you want to pull two values from rows within a group.

Reference: http://dev.mysql.com/doc/refman/5.5/en/union.html

Example output (apologies for the misaligned output, but Google Docs does not have a fixed-width font in which the lines fit, even for Courier 8 pt type):

+-------------+-----------+-----------------------------------------+-----------------------------------------+

| first_names | last_name | group1                                  | group2                                  |

+-------------+-----------+-----------------------------------------+-----------------------------------------+

| Barbara     | Anderson  | Harvard University                      | 2 more                                  |

| Charles     | Anderson  | Carnegie Mellon University              | New York University                     |

| Christopher | Anderson  | California Institute of Technology      | Franklin W. Olin College of Engineering |

| Daniel      | Anderson  | Dartmouth College                       | 2 more                                  |

| David       | Anderson  | Dartmouth College                       | 2 more                                  |

| Elizabeth   | Anderson  | California Institute of Technology      | Princeton University                    |

| James       | Anderson  | California Institute of Technology      | Duke University                         |

| Jennifer    | Anderson  | Brown University                        | 2 more                                  |

| John        | Anderson  | Tufts University                        | Wellesley College                       |

| Joseph      | Anderson  | Brown University                        | 2 more                                  |

| Linda       | Anderson  | Brown University                        | 2 more                                  |

| Mark        | Anderson  | New York University                     | NULL                                    |

| Mary        | Anderson  | Franklin W. Olin College of Engineering | Johns Hopkins University                |

| Michael     | Anderson  | Brown University                        | 2 more                                  |

| Patricia    | Anderson  | Brown University                        | NULL                                    |

Exercise: Intersection and Minus

Let's create an Oracle table with a politician's promises:

create table promises_made (

        promise    varchar(50) not null,

         date_made  date

);

and now a typically much smaller table...

create table promises_kept (

        promise    varchar(50) not null,

        date_kept  date

);

insert into promises_made (promise, date_made) values ('cut taxes', '2008-01-01');

insert into promises_made (promise, date_made) values ('reduce unemployment', '2008-01-12');

insert into promises_made (promise, date_made) values ('stop the wars', '2008-02-25');

insert into promises_made (promise, date_made) values ('close Gitmo', '2008-03-02');

-- throw in a second one on a new date

insert into promises_made (promise, date_made) values ('close Gitmo', '2008-03-03');

insert into promises_made (promise, date_made) values ('help unions', '2008-04-12');

insert into promises_made (promise, date_made) values ('complicate healthcare', '2008-04-16');

insert into promises_kept (promise, date_kept) values ('complicate healthcare', '2010-03-23');

insert into promises_kept (promise, date_kept) values ('help unions', '2009-06-01');

Let's try some queries...

SQL> select promise from promises_made intersect select promise from promises_kept;

PROMISE

-------------------------

complicate healthcare

help unions

SQL> select promise from promises_made minus select promise from promises_kept;

PROMISE

-------------------------

close Gitmo

cut taxes

reduce unemployment

stop the wars

Notice that the duplicate "close Gitmo" was removed, almost as if we'd used SELECT DISTINCT, because the minus operation returns a set and an element can be a member of a set only once.

MySql doesn't support the intersect and minus syntax. Exercise: write queries that calculate the intersection and set difference as above, using the facilities available to you in MySql.

Big Hint: Since we're doing only the simple case of one column ("1-tuple" in relational algebra parlance), we don't need anything as fancy as the general Oracle facilities that look at multiple values in each row.

Small Hints:

Extra credit: What if we had two or three values for each row? How could we do intersections and set differences then? Perhaps it would be easier to work with a simple table of coordinates in 3-space:

create table coordinates (

        x                 int not null,

        y                 int not null,

        z                 int not null,

        source                 varchar(20) not null

);

We've hired two surveyors, Herbert (source = 'herb') then Larry (source = 'larry'). Each has given us a set of points characterizing a topography. What would the queries look like to

Before starting, run the following INSERTs to populate the coordinates table with sample data:

-- Herb's findings

INSERT INTO coordinates (x,y,z,source)

VALUES

  (61,97,2,'herb'),

  (17,82,57,'herb'),

  (39,25,9,'herb'),

  (71,25,13,'herb'),

  (88,3,50,'herb'),

  (42,61,79,'herb'),

  (13,29,3,'herb'),

  (30,40,11,'herb'),

  (37,48,32,'herb'),

  (13,71,16,'herb'),

  (68,91,51,'herb'),

  (82,56,35,'herb');

-- Larry's findings

INSERT INTO coordinates (x,y,z,source)

VALUES

  (61,97,2,'larry'),

  (17,82,57,'larry'),

  (39,25,9,'larry'),

  (71,25,13,'larry'),

  (88,3,50,'larry'),

  (42,61,79,'larry'),

  (13,29,3,'larry'),

  (7,29,22,'larry'),

  (27,67,54,'larry'),

  (70,86,23,'larry');

Exercise: Three-valued logic (the full Winklevoss solution)

Suppose that our community is highly exclusive, in the way that Tyler and Cameron Winklevoss initially envisioned. New membership is by invitation only. A regular member can send out an invitation that expires after 30 days; an administrator can send out an invitation that never expires. One of the seemingly good things about SQL is that unknown information can be represented with a NULL, so it would seem to make sense to use NULL as the expiration date of an invite that never expires, but we can get into trouble querying...

create table all_invitations (

        user_id        int not null,

        -- would normally have a REFERENCES integrity constraint

        -- but we want this example to work standalone

        to_email        varchar(50) not null,

        -- if NULL, the invite does not expire

        expires         date

);

-- since MySQL lacks a CHECK constraint system, we might want to

-- define a trigger that runs a regexp and raises an error if the

-- email address does not match

delimiter $$

create trigger check_email before insert on all_invitations

        for each row begin

                if (select new.to_email

regexp "[A-Z0-9._%-]+@[A-Z0-9.-]+\\.[A-Z]+")=0 then

        -- send a user defined failure signal

                                signal sqlstate '45000'

set message_text='bad email address';

                end if;

        end;

$$

delimiter ;

In working with invitations day-to-day, there is no point in having every programmer add WHERE clauses to exclude the expired invitations, so let's create a view...

create or replace view invitations

as select * from all_invitations

where expires > current_date;

Let's insert a couple of invitations..

insert into all_invitations

(user_id, to_email, expires)

values

(1, 'billg@microsoft.com', '2050-01-01'),

(1, 'georgew@whitehouse.gov', '2009-01-20'),

(2, 'barackhussein@whitehouse.gov', '2013-01-20'),

(2, 'pamela@anderson.com', NULL);

And now select * from invitations

What did we do wrong that our INVITATIONS view excludes the still-valid Pamela Anderson invite?

Lecture: Data Warehousing

Discuss: http://philip.greenspun.com/sql/data-warehousing

Point out: extensions to SQL such as ROLLUP http://dev.mysql.com/doc/refman/5.5/en/group-by-modifiers.html

Exercise: Stored Procedures

Suppose that you want to query from the users table, which has grown to 1 million rows. You want only those rows that meet a complex criterion: whether the user is sociable, which we will define as having at least three friends or three group memberships. You can't think of a good way to express which rows meet this condition in standard SQL. You want to use standard procedural IF THEN logic. Here are the choices:

Here's how the code might look for the stored procedure:

--

-- Define a stored function

--

delimiter $$

create function is_sociable (v_user_id int)

returns boolean

reads sql data

begin

   declare friend_count int;

   declare membership_count int;

   select count(*) from friendships where (to_user_id = v_user_id or from_user_id = v_user_id) and accepted is not null into friend_count;

   select count(*) from group_memberships where (user_id = v_user_id) into membership_count;

   if (friend_count > 2 OR membership_count > 2) then

     return 1;

   else

     return 0;

   end if;

end $$

delimiter ;

Here's how the code might look for the stored procedure:

Discussion: Views for security

An external company wants to run some ads so you want them to see user_ids and likes, but not names or email addresses. How about creating a new database user, building a view, granting them SELECT on that view?

In the 1980s and 1990s, when client-server dominated IT, each individual accessing the database had a distinct user name. So the RDBMSes have extensive support for fine-grained access control. This is becoming a lost art, however, because the Web server tends to connect as a single RDBMS user.

Exercise: Friends of Friends

Write a query to return the following: for a given user (okay to hard code a user ID; pick "1" without loss of generality), return a list of friends of friends who are not already friends of the query user. This can then be show to the user as "friend suggestions".

Refine the query to restrict the report to just those people who have at least three friends in common.

Exercise: Getting Started with Android SDK

Android development is done using Java and the Android Software Development Kit. For more information on developing for Android (including great examples and reference material), check out the Android Developer Portal at http://developer.android.com. The VM you've been provided with comes pre-loaded with all of the tools you'll need to create an Android application.

We've already created and populated the VM with an Android application that fetches http://localhost/php/android/pulse.php (returns an XML file) and displays the following:

The application refreshes the display every 3 seconds so you can see the results of changes to the database pretty much in real time.

To get started launch the Eclipse IDE from: Applications -> Programming -> Eclipse. The development environment will load and open up the Facebooklet sample app. To see the application in action click the green play button (menu-based alternative: Run -> Run). This will launch an AVD (Android Virtual Device).

Note: If you get a "Debug Certificate Expired" error:

  1. Remove the expired cert file (open a command prompt and type:
    rm ~/.android/debug.keystore)
  2. Clean the Eclipse project (Project -> Clean)
  3. Close & restart Eclipse

Now that you've deleted the old certificate a new one will be generated. This will likely result in a "Re-installation failed due to different application signatures." error when you try to run the app in the AVD. To fix this problem you need to uninstall the application (these steps are done on the AVD):

  1. Menu -> Settings -> Apps -> Manage Apps
  2. Find the facebooklet app and select it
  3. Click the Uninstall button
  4. Click OK (twice)-

The AVD can take a few minutes to fully load so be patient. It's advisable to leave the AVD running in the background while you are developing to avoid experiencing this wait each time you recompile. If the AVD is left open, subsequent runs will connect to the existing AVD, download your application, and run it all in just a few seconds.

Experiment with the application. Try modifying the database with the mysql client and watching the interface on the AVD update. Read through the source code and try to familiarize yourself with what it's doing. The project consists of three Java files. facebooklet.java is the main file. It manages the user interface (GUI), sets up the connection to the webserver, and invokes the XML parser. FacebookletXMLHandler.java defines the inner workings of the XML parser such as the expected tag structure. Finally there's FacebookletData.java which is simply a data-structure for passing data from the XML parser to the GUI.

We don't want to permit programs running on the public internet, even our own, to connect directly to our database management system. The connection to the database is managed by PHP code running in our webserver, which produces an XML document that our application reads. For more insight into what's going on server-side launch your text editor of choice and open the file: /home/dev/php/android/pulse.php

Once you're feeling comfortable, modify the server and application code to add a display of the most recent status message.

Hints:

Extra credit: A basic Android application

We'll ignore the challenges of user authentication and have a completely open system. Pick a user_id and hard-wire it into your Android application. Write software on the Web server that accepts any user ID and will post status updates for that user and also return updates from friends, in reverse chronological order.

The Android application should have just one screen with a status entry box and, underneath, a periodically updated list of recent status updates.

We recommend building two scripts on the Web server side, http://localhost/php/android/status-updates.php?for_user_id=45 (example) and http://localhost/php/android/post-new-status.php that uses the POST method to supply the user_id and new message text.