Updated January 24, 2015
What did everyone see in the previous day?
Now that we know what we're talking about, let's look at the following:
History: early 1970s: SQL; 1979: Visicalc; 1990: Haskell
Quicksort in Haskell
qsort  = 
qsort (x:xs) = qsort (filter (< x) xs) ++
qsort (filter (>= x) xs)
Airline DBMS structure made it quick and easy to ask "who is on Flight 37" but not "what flights has Joe Smith taken since January 1?" or "show me customers who've taken at least 17 flights in the last three years". Huge performance penalty in early implementations, though, which is why it wasn't until Oracle 6.0 in 1988 that it took off. Oracle as of January 24, 2015 has a market capitalization of $194 billion (compare to $389B for Microsoft; $184B for IBM; $57B for Ford).
Recap from Yesterday:
First dynamic Web applications were built by folks with an academic CS background. So they'd never heard of the RDBMS and its ability to process simultaneously updates. They knew about locks from their operating system course. A script would call "flock" and exclude any other cooperating scripts, including copies of itself. What if a script crashed, though, and left a lock behind? What if a new programmer was hired and didn't know that everyone else was using flock before modifying the files?
What are Web applications? Systems that receive, from multiple users, simultaneous streams of updates and queries to a common set of data. This is exactly the problem for which the standard corporate DBMS was built and the RDBMS happened to be the dominant DBMS technology in the mid-1990s.
Early database-backed Web sites:
Why are we looking at Oracle? Philip couldn't find any good docs on how MySQL works (if indeed it does).
What's on disk and in RAM?
What does an index look like? Note that the pointers connecting leaf nodes make it fast to scan an index, but you won't find this data structure in any CS text.
What does this look like from the perspective of the operating system?
Start on the left at User Process. This is a Java or C "client" program or whatever that is sending SQL statements to the Oracle server (Shared Server Process, which requires the Dispatcher, or Dedicated Server Process).
A server process can read or write the latest database table block contents from the Database Buffer Cache in the System Global Area (consider a current 2U server from Dell, the R730xd, which can hold up to 768 GB of RAM; the Buffer Cache could be 500 GB in size). In a perfect world, we could stop here, as long as the server processes took locks before altering the contents of the buffer cache.
The rest of the stuff is to deal with the possibilities of programmer error or database server failure (power, hardware, or software).
The redo log buffer contains information about what is to be changed by a transaction. The Log Writer writes redo entries to redo log files, at which point a transaction can be committed. The redo log files are precious. If you have a snapshot of the data files from 1990 and then all of the redo logs recording changes made since then, you can reconstruct the database up to any point in time between 1990 and now. If there are 50 simultaneous users and all send in a transaction at about the same time, Log Writer might be able to commit them all with just one disk write.
Database Writer copies updated blocks from memory to the hard drives. Remember that a transaction commits when the redo log is written. The datafiles on disk for table and index data may lag behind what's in RAM.
Checkpoint manages keeping track of how close to being up to date, by system change number (SCN), the data files are. A checkpoint results in the headers of all datafiles being updated. Despite the arrows in the figure, the Oracle docs say "The CKPT process does not write blocks to disk; the Database Writer(s) always performs that work."
Archiver copies redo logs (what was changed) to a separate disk or possibly a second "hot standby" Oracle installation.
PMON and SMON are processes that watch the other processes, clean up after crashes, restart what needs to be restarted, and otherwise paper over the inevitable failures in complex computer programs. They must work because Oracle can run for months or years.
RECO "is a background process used with the distributed database configuration that automatically resolves failures involving distributed transactions".
What happens during a transaction? Cut and paste from the Oracle Concepts book:
Before a transaction that modifies data is committed, the following has occurred:
When a transaction is committed, the following occurs:
How does recovery work?
["Undo blocks" are what used to be called the Rollback Segment. They contain copies of database blocks before a transaction was started.]
Reconnect with mysql threedayrdbms -p
Write a query that will return all permutations of the numbers 1, 2, 3, and 4, one permutation on each row of the report (i.e., a four-column report).
[Foggy memories of 6.041? A "permutation" is an arrangement of the numbers 1, 2, 3, 4. So "4, 4, 1, 1" wouldn't qualify due to the fact that not all numbers are included.]
Produce a report that shows a histogram of how many groups a typical Facebooklet user belongs to, e.g., a count of how many users are in 1 group, 2 groups, and so on.
Hint: You can SELECT from the results of another SELECT, i.e., query from a table that you're produced on the fly by SELECTing.
Second hint: We don't think this can be done except with two GROUP BYs (and you can only have one GROUP BY per SELECT).
Extra credit: Try the same thing to produce a histogram of how many friends users have.
Suppose that our histogram shows that most users are in exactly two groups, but some are in just one and some are in more than one. How can we get a four-column report that shows each user's first and last name's next to either his one group's name (plus a NULL), both of his groups' names, or one group's name plus "N more". (See example output below.)
A reasonable "classic SQL" way to start on this project is to think about a query that produces the correct results for a user with just one group membership. Then think about a couple more. Then UNION the results.
Hint: In a GROUP BY, all of the columns in the SELECT list must be either columns that you grouped by or aggregates of columns within the group. Aggregate functions include COUNT, SUM, MAX, and MIN. MAX and MIN are very helpful if you want to pull two values from rows within a group.
If you are anti-union for any reason, here's a fragment of a clever solution that uses the newer, but reasonably standard, CASE expression:
WHEN 1 THEN NULL
WHEN 2 THEN min(g.title)
ELSE concat(count(gm.group_id)-1, ' more')
END as group2
If you are wondering "How did such stupid teachers come up with a clever solution?" the answer is that it came from Sarah Zenklusen, a former student. She is so clever that she has to live in Switzerland.
Example output (apologies for the misaligned output, but Google Docs does not have a fixed-width font in which the lines fit, even for Courier 8 pt type):
| first_names | last_name | group1 | group2 |
| Barbara | Anderson | Harvard University | 2 more |
| Charles | Anderson | Carnegie Mellon University | New York University |
| Christopher | Anderson | California Institute of Technology | Franklin W. Olin College of Engineering |
| Daniel | Anderson | Dartmouth College | 2 more |
| David | Anderson | Dartmouth College | 2 more |
| Elizabeth | Anderson | California Institute of Technology | Princeton University |
| James | Anderson | California Institute of Technology | Duke University |
| Jennifer | Anderson | Brown University | 2 more |
| John | Anderson | Tufts University | Wellesley College |
| Joseph | Anderson | Brown University | 2 more |
| Linda | Anderson | Brown University | 2 more |
| Mark | Anderson | New York University | NULL |
| Mary | Anderson | Franklin W. Olin College of Engineering | Johns Hopkins University |
| Michael | Anderson | Brown University | 2 more |
| Patricia | Anderson | Brown University | NULL |
We created an Oracle table with a politician's promises:
create table promises_made (
promise varchar(50) not null,
and now a typically much smaller table...
create table promises_kept (
promise varchar(50) not null,
insert into promises_made (promise, date_made) values ('cut taxes', '2008-01-01');
insert into promises_made (promise, date_made) values ('reduce unemployment', '2008-01-12');
insert into promises_made (promise, date_made) values ('stop the wars', '2008-02-25');
insert into promises_made (promise, date_made) values ('close Gitmo', '2008-03-02');
-- throw in a second one on a new date
insert into promises_made (promise, date_made) values ('close Gitmo', '2008-03-03');
insert into promises_made (promise, date_made) values ('help unions', '2008-04-12');
insert into promises_made (promise, date_made) values ('complicate healthcare', '2008-04-16');
insert into promises_kept (promise, date_kept) values ('complicate healthcare', '2010-03-23');
insert into promises_kept (promise, date_kept) values ('help unions', '2009-06-01');
We tried some queries...
SQL> select promise from promises_made intersect select promise from promises_kept;
SQL> select promise from promises_made minus select promise from promises_kept;
stop the wars
Notice that the duplicate "close Gitmo" was removed, almost as if we'd used SELECT DISTINCT, because the minus operation returns a set and an element can be a member of a set only once.
MySql doesn't support the intersect and minus syntax. Exercise: write queries that calculate the intersection and set difference as above, using the facilities available to you in MySql.
Big Hint: Since we're doing only the simple case of one column ("1-tuple" in relational algebra parlance), we don't need anything as fancy as the general Oracle facilities that look at multiple values in each row.
Extra credit: What if we had two or three values for each row? How could we do intersections and set differences then? Perhaps it would be easier to work with a simple table of coordinates in 3-space:
create table coordinates (
x int not null,
y int not null,
z int not null,
source varchar(20) not null
We've hired two surveyors, Herbert (source = 'herb') then Larry (source = 'larry'). Each has given us a set of points characterizing a topography. What would the queries look like to
Before starting, run the following INSERTs to populate the coordinates table with sample data:
-- Herb's findings
INSERT INTO coordinates (x,y,z,source)
-- Larry's findings
INSERT INTO coordinates (x,y,z,source)
Suppose that our community is highly exclusive, in the way that Tyler and Cameron Winklevoss initially envisioned. New membership is by invitation only. A regular member can send out an invitation that expires after 30 days; an administrator can send out an invitation that never expires. One of the seemingly good things about SQL is that unknown information can be represented with a NULL, so it would seem to make sense to use NULL as the expiration date of an invite that never expires, but we can get into trouble querying...
create table all_invitations (
user_id int not null,
-- would normally have a REFERENCES integrity constraint
-- but we want this example to work standalone
to_email varchar(50) not null,
-- if NULL, the invite does not expire
-- since MySQL lacks a CHECK constraint system, we might want to
-- define a trigger that runs a regexp and raises an error if the
-- email address does not match
create trigger check_email before insert on all_invitations
for each row begin
if (select new.to_email
regexp "[A-Z0-9._%-]+@[A-Z0-9.-]+\\.[A-Z]+")=0 then
-- send a user defined failure signal
signal sqlstate '45000'
set message_text='bad email address';
In working with invitations day-to-day, there is no point in having every programmer add WHERE clauses to exclude the expired invitations, so let's create a view...
create or replace view invitations
as select * from all_invitations
where expires > current_date;
-- Let's insert a few invitations..
insert into all_invitations
(user_id, to_email, expires)
(1, 'email@example.com', '2050-01-01'),
(1, 'firstname.lastname@example.org', '2009-01-20'),
(2, 'email@example.com', '2015-02-20'),
(2, 'firstname.lastname@example.org', NULL);
And now select * from invitations
What did we do wrong that our INVITATIONS view excludes the still-valid Pamela Anderson invite? Can you fix it?
Point out: extensions to SQL such as ROLLUP http://dev.mysql.com/doc/refman/5.5/en/group-by-modifiers.html
Suppose that you want to query from the users table, which has grown to 1 million rows. You want only those rows that meet a complex criterion: whether the user is sociable, which we will define as having at least three friends or three group memberships. You can't think of a good way to express which rows meet this condition in standard SQL. You want to use standard procedural IF THEN logic. Here are the choices:
Here's how the code might look for the stored procedure:
-- Define a stored function
create function is_sociable (v_user_id int)
reads sql data
declare friend_count int;
declare membership_count int;
select count(*) from friendships where (to_user_id = v_user_id or from_user_id = v_user_id) and accepted is not null into friend_count;
select count(*) from group_memberships where (user_id = v_user_id) into membership_count;
if (friend_count > 2 OR membership_count > 2) then
Task: Write a query that uses this function to print out a list of the names of the sociable users.
An external company wants to run some ads so you want them to see user_ids and likes, but not names or email addresses. How about creating a new database user, building a view, granting them SELECT on that view?
In the 1980s and 1990s, when client-server dominated IT, each individual accessing the database had a distinct user name. So the RDBMSes have extensive support for fine-grained access control. This is becoming a lost art, however, because the Web server tends to connect as a single RDBMS user.
Write a query to return the following: for a given user (okay to hard code a user ID; pick "1" without loss of generality), return a list of friends of friends who are not already friends of the query user. This can then be show to the user as "friend suggestions".
Refine the query to restrict the report to just those people who have at least three friends in common.
Android development is done using Java and the Android Software Development Kit (SDK). For more information on developing for Android (including great examples and reference material), check out the Android Developer Portal at http://developer.android.com.
For the exercises in this class, we'll use Android Studio. Before Day 2, you should download Android Studio, the java development kit and the accelerated emulator (if your system supports it). You should also launch Android Studio at least once, so that it can fetch a number of required components including emulator boot images and the SDK.
The exercises build on a sample Facebooklet application that we've provided. It consists of an Android client app, which you'll import into Android studio (instructions below), and a server backend that is already running on the VM.
The client application displays the following information, which it pulls from the server in JSON format:
You can inspect the JSON directly at http://localhost:8080/~dev/php/rdbmsapp/pulse.php from your laptop, or at http://localhost/~dev/php/rdbmsapp/pulse.php from inside the VM. The client refreshes the display every 3 seconds so you can see the results of changes to the database pretty much in real time. Note that the script may return an error if you have not completed the data model exercises from Day 1. If this is the case, you can load our solutions for the data model using the instructions in /home/dev/sql/facebooklet/facebooklet-loader.sql on the VM.
Instructions for importing the client app into Android Studio:
At this point you should be able to compile and run the app by selecting Run -> Run app (or clicking on the green play button in the toolbar). Before doing this, be sure to launch the Android Device Manager (AVD) from Tools -> Android -> AVD Manager and create a new virtual device for testing.
The AVD can take a few minutes to fully load so be patient. It's advisable to leave the AVD running in the background while you are developing to avoid experiencing this wait each time you recompile. If the AVD is left open, subsequent runs will connect to the existing AVD, download your application, and run it all in just a few seconds.
Experiment with the application. Try modifying the database with the mysql client and watching the interface on the AVD update. Read through the source code and try to familiarize yourself with what it's doing. MainActivity.java is the main file. It manages the user interface (GUI), sets up the connection to the webserver, and invokes the JSON parser defined in JSONParser.java.
We don't want to permit programs running on the public internet, even our own, to connect directly to our database management system. The connection to the database is managed by PHP code running in our webserver, which produces a JSON document that our application reads. For more insight into what's going on server-side launch your text editor of choice and open the file: /home/dev/public_html/php/rdbmsapp/pulse.php
Once you're feeling comfortable, modify the server and application code to add a display of the most recent status message.
We'll ignore the challenges of user authentication and have a completely open system. Pick a user_id and hard-wire it into your Android application. Write software on the Web server that accepts any user ID and will post status updates for that user and also return updates from friends, in reverse chronological order.
The Android application should have just one screen with a status entry box and, underneath, a periodically updated list of recent status updates.
We recommend building two scripts on the Web server side, http://localhost/php/android/status-updates.php?for_user_id=45 (example) and http://localhost/php/android/post-new-status.php that uses the POST method to supply the user_id and new message text.