Three-Day RDBMS Day 3, Session 1

part of http://philip.greenspun.com/teaching/three-day-rdbms/

Updated January 25, 2015

Lecture: Materialized Views and Event Scheduler

When queries are running too slowly, sometimes programmers create materialized views to hold intermediate computations. These end up being especially valuable in data warehouses, where the database cannot fit into RAM.

In Oracle:

CREATE MATERIALIZED VIEW sales_by_month_by_state
    REFRESH START WITH SYSDATE NEXT
TRUNC(SYSDATE) + 1 + 3/24
   
ENABLE QUERY REWRITE
    AS SELECT t.calendar_month_desc, c.cust_state_province,
       SUM(s.amount_sold) AS sum_sales
       FROM times t, sales s, customers c
       WHERE s.time_id = t.time_id AND s.cust_id = c.cust_id
       GROUP BY t.calendar_month_desc, c.cust_state_province;

Should be rebuilt every morning at 3:00 am (sysdate is fractional after midnight, so the TRUNC function bashes it back to midnight, then we add 3/24 because 1 = one day (Oracle's native date-time handling is not ANSI standard SQL))

The part of the Oracle system that cannot be accomplished in MySQL is "enable query rewrite". The above materialized view will speed up queries against the TIMES, SALES, and CUSTOMERS table even though the queries don't reference the view.

Oracle has options to update the view periodically, e.g,. once every night. This can be accomplished with a MySQL Event, which can be scheduled to run periodically, much like Unix cron or the Microsoft Windows Task Scheduler. See http://dev.mysql.com/doc/refman/5.5/en/create-event.html

Oracle offers the option that the materialized view is updated whenever a change is made to the underlying tables, e.g., a row being inserted. This can be accomplished in MySQL with triggers on the underlying tables.

Exercise: de-duplication

Suppose that we are concerned that there are some duplicate users in our Facebooklet application. Someone registers, for example, as "Thomas Jones" and a Yahoo email address, but then comes back a few months later, forgets that he already has an account, and signs up as "Tom Jones" with a Gmail address. The UNIQUE integrity constraint on email address won't help here. Write a query to find potential dupes. Let's start with a report that shows the following

user_id1, name1, email1, user_id2, name2, email2

(6 columns) where rows have the same last name and the same first character of the first name.

Hints: this will require a self-join of the users table; LEFT(input_string, 1) will return the leftmost character of input_string, using the built-in MySQL string function LEFT.

Exercise: Statistical Data in the RDBMS

We've preloaded data from the U.S. Census Bureau Current Population Survey (http://www.census.gov/cps/) into census_cps_all and created a view called census_cps of the fields that you'll need to do these problems. This is based on gathering data, via interviews, on nearly 150,000 Americans every month. (There are 139,414 rows in the March 2014 data set that we loaded.)

[Thanks to Jean Roth of the National Bureau of Economic Research for giving us a flattened CSV-format file so that we didn't have to decode the fixed-width Census records.]

There are also some helper tables:

Each record in census_cps_all contains information about one person. It also contains information about his or her family and his or her household. When there are multiple people in a family or household this means the data are not fully normalized, e.g., because information about a household will be duplicated in multiple records. There are more than 700 columns in this table and the field names are described in the Current Population Survey, 2013 Annual Social and Economic (ASEC) Supplement (http://www.census.gov/prod/techdoc/cps/cpsmar13.pdf ).

Here's the view definition…

CREATE VIEW census_cps AS SELECT

    a_age as age,

    IF(a_sex=1, "M", "F") as sex,

    census_states_decode.state_name as state,

        if(pehruslt=-4, 40, pehruslt) as weekly_hours,

    wsal_val as annual_wages,

    census_education_decode.level_name as education_level,

    IF(alm_yn=0, NULL, IF(alm_yn=2, 0, 1)) as receiving_alimony, -- no info changed to null, yes to 1, and no to 0

        alm_val as annual_alimony,

    IF(csp_yn=0, NULL, IF(csp_yn=2, 0, 1)) as receiving_child_support, -- no info changed to null, yes to 1, and no to 0

        csp_val as annual_child_support,

        IF(halm_yn=0, NULL, IF(halm_yn=2, 0, 1)) as household_receiving_alimony,

    IF(hcsp_yn=0, NULL, IF(hcsp_yn=2, 0, 1)) as household_receiving_child_support,

        IF(resnssi1=1, 1, 0) as receiving_ssdi,

        IF(hfoodsp=1, 1, 0) as receiving_food_stamps

    FROM census_cps_all, census_states_decode, census_education_decode

    WHERE

        census_states_decode.state_id=census_cps_all.gestfips AND

        census_education_decode.level_id=census_cps_all.a_hga;

In 2010, the Treasury Department said that 567,887 Americans paid alimony (based on tax returns claiming a deduction; alimony is tax-deductible but child support is not). That's roughly 0.18 percent of 320 million Americans. We would therefore expect to find 251 people in our CPS data set receiving alimony prior to correcting for the fact that children under age 15 are mostly excluded, constitute roughly 20 percent of the U.S. population, and are very unlikely to have been married and divorced.  Question 1: How many Americans actually told the Census bureau that they were receiving alimony and what was their average age?

[Source: http://www.treasury.gov/tigta/auditreports/2014reports/201440022_oa_highlights.pdf ]

What's the total sum of alimony that survey respondents say that they reported, times 3000 (the "weight" that the Census expert says should be applied to estimate a nationwide total), and how does it compare to the "more than $10 billion" the Treasury says Americans reported paying?

What's the total sum of child support that survey respondents say that they received, times the 3000 weight? How does it compare to the $32 billion that the Federal Office of Child Support Enforcement says that state and federal authorities collected from deadbeat parents? (because child support is not tax-deductible there is no authoritative source for the total paid by "absent parents" who have not been reported for non-payment or late payment)

Question 2: Where do the men receiving alimony live, how old are they, and how much do they earn in wages, on top of the alimony that they receive? And how much is their alimony as a percentage total (expressed as an integer percent, e.g., "33") of their total wage+alimony income?

Question 3: Write a query to display a table of the number of men or women (pick sex = "M" or sex = "F"; your choice) age 30-40 in each state who are (a) receiving no child support, and (b) receiving child support. This table will have two rows for each state and three columns: state, receiving_child_support, count. Hints:

Question 4: Write a query to display a table of states and the percentage of men or women (pick one sex) age 30 to 40 in that state who are receiving child support, sorted in order of declining percentage. Hints:

Question 5: Create a query to find the average wages of the people surveyed on per-state basis (limit to annual_wages > 0 and weekly_hours > 30 so that you don't include people who aren't comparable to full-time workers). The report should have three columns: state, number of people included in average, mean salary. We suggest the use of ROUND.

Question 6: Find the median wage for Massachusetts residents between age 22 and 36 who have at least some wage income and who work at least 30 hours per week.

select count(*)

from census_cps

WHERE annual_wages > 0

AND age BETWEEN 22 AND 36

AND education_level = "Bachelor's"

AND weekly_hours >= 30

AND state = "Massachusetts";

Question 7: Can you do that in a single query? Search the Web for "median in mysql" and you'll see that the most straightforward queries involve a full join (O[N^2] performance). Try one of these out on a restricted data set, e.g., starting with a view:

create view census_cps_massachusetts_subgroup

as

select * from census_cps

WHERE annual_wages > 0

AND age BETWEEN 22 AND 36

AND education_level = "Bachelor's"

AND weekly_hours >= 30

AND state = "Massachusetts";

As there are only 90 rows in this subset, and there are indices on the underlying census_cps_all table that allow these rows to be pulled out without a complete table scan, it will be practical to handle a 90x90 join.

[What's good about Oracle instead of MySQL? Check out http://docs.oracle.com/cd/B28359_01/server.111/b28286/functions091.htm#SQLRF06315 for how you can just call the MEDIAN function.]

Discussion: MySQL Procedures

On Day 2 we saw one kind of stored procedure or "stored routine", a MySQL function. Now we'll look at using a MySQL procedure to do a straightforward calculation of median wage in each state.

What's the difference between the two? A stored function returns a scalar value, such as a number, and is invoked as part of a SELECT. A stored procedure is invoked with a CALL and can return a set of rows.

DELIMITER $$

CREATE PROCEDURE census_rich_list ()

BEGIN

  SELECT state, age, (annual_wages + annual_alimony + annual_child_support) as total_income

  FROM census_cps

  ORDER BY total_income desc

  LIMIT 10;

END $$

DELIMITER ;

call census_rich_list;

What if we wanted the median wage for subgroups within each state? Here's a MySQL procedure already loaded into your virtual machine:

DELIMITER ;;

CREATE PROCEDURE median_income_by_state_and_sex()

BEGIN

    # declare variables

    DECLARE currentState, currentSex VARCHAR(20);

    DECLARE currentMedianOffset INT;

    DECLARE done INT DEFAULT FALSE;

 

    # declare cursor, select the offset of the median record for each state/sex combo for wage earners within the target age range (22 to 36 years old)

    DECLARE median_offset_cursor CURSOR FOR

        SELECT state, sex, (count(*)/2)

        FROM census_cps

        WHERE annual_wages > 0 AND age BETWEEN 22 AND 36

        GROUP BY state, sex;

 

    # declare handle

    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    # create temporary table to hold results

    DROP TEMPORARY TABLE IF EXISTS median_wages;

    CREATE TEMPORARY TABLE median_wages (state varchar(20), sex varchar(1), median_wage int, sample_size int, avg_hours int);

    # open cursor

    OPEN median_offset_cursor;

 

    the_loop: LOOP

 

        # get the values of each column into our variables

        FETCH median_offset_cursor INTO currentState, currentSex, currentMedianOffset;

        IF done THEN

          LEAVE the_loop;

        END IF;

        # select the median value for this state/sex combination

        INSERT INTO median_wages (state, sex, median_wage, sample_size, avg_hours)

            SELECT state, sex, annual_wages, (SELECT currentMedianOffset)*2, (SELECT avg(weekly_hours) FROM census_cps WHERE annual_wages>0 AND age BETWEEN 22 AND 36 AND state = currentState AND sex = currentSex) FROM census_cps

            WHERE annual_wages > 0 AND age BETWEEN 22 AND 36

                AND state = currentState AND sex = currentSex

            ORDER BY annual_wages

            LIMIT currentMedianOffset,1;

 

    END LOOP the_loop;

 

    CLOSE median_offset_cursor;

    # output the table of medians

    SELECT state, sex, median_wage, sample_size, avg_hours

    FROM median_wages

    ORDER BY state, sex;

END;;

DELIMITER ;

You can test this out with

call median_income_by_state_and_sex;

Try out another predefined procedure

call median_income_with_bachelors_by_state_and_sex;

to see where you might want to settle after you get your bachelor's degree (this one is limited to those who work at least 30 hours per week).

Exercise: analyzing visit intervals (time series data)

We've supplied you with a table of simulated data on user visits. Here's the data model:

create table ts_users (

        user_id        int not null primary key auto_increment,

        -- cannot be null, no numbers please

      -- in Oracle 11g we'd add CHECK ( regexp_like(name, '^([a-zA-Z ])+$') )

        name                varchar(100) not null,

        -- can be null, but only numbers please (except in MySQL

        -- which doesn't respect CHECK constraints)

        telephone        varchar(20) check (upper(telephone) = lower(telephone))

);

create table ts_visits (

        visit_id                int not null primary key auto_increment,

        user_id        int not null,

        -- store visit time precise to the second

        visit_time        timestamp default current_timestamp,

        foreign key (user_id) references ts_users(user_id) on delete cascade

);

-- note: we use the prefix "ts_" to avoid interference with the

-- real Facebooklet data model

-- the ON DELETE CASCADE causes rows in this table to disappear when the

-- corresponding user is deleted

Note that timestamp(0) = timestamp in MySQL because precision down to the second is the best that it can do. The typical RDBMS can store timestamps with precision down to the microsecond (TIMESTAMP(6)) and Oracle supports nanosecond precision (TIMESTAMP(9)), which could come in handy for physics and electronics data! (Sometimes folks will use those last digits of a TIMESTAMP(6) as a random number.)

We've inserted users with the following command:

insert into ts_users (name) values ('moe'),('larry'),('curly');

and used the script at http://philip.greenspun.com/teaching/three-day-rdbms/populate-visits.sql to populate the table with more or less random data.

Please write a query that will show, for 'moe', the time intervals in days between his visits (this report should have N-1 rows where N is the number of visits by 'moe') along with the date-times of the two visits that generated that interval, ordered by ascending date.

Hint: a self-join, a subquery, and the use of the MAX function may be required.

Please write a single query that gives a report of the average interval between visits for each user and then an overall site average of time between visits for a particular user. For clarity: it should be a single SELECT statement that returns N+1 rows where N is the number of users.

Hint: the ROLLUP feature of GROUP BY may be useful. (*** I'm actually not sure if it will work to give the average per user or if it will mash all of the visit intervals together ***)

Please cut and paste to add a new user and a couple of visits:

insert into ts_users (name) values ('joe newbie');

insert into ts_visits

(user_id, visit_time)

values

(last_insert_id(), '2010-02-01'), (last_insert_id(), '2010-04-01');

Now let's have a report with three columns:

Hint: This can be done with GROUP BY, HAVING, MIN, MAX and UNION or with the CASE statement.

Lecture: High-performance Hosting of RDBMS in a virtualized and containerized world

Traditional hosting (people my age): One computer, enough RAM to hold all data regularly accessed, lots of separate physical disk drives, e.g., one for the table, a different one for each index on the table, one for the logs, etc. Thus an update that involves multiple disk writes can be done in the time consumed by single disk seek and write. Downside: Moving to bigger computer means exporting data, importing (could take a weekend or longer) or using more DBMS admin skills.

Traditional hosting (people your age): One computer, enough RAM to hold all data regularly accessed, SSD storage big enough for the entire database, logs, etc., possibly with a magnetic disk for storage backups going back to the dawn of time. A 1 TB SSD drive for a desktop computer is now (January 2015) about $440.

Virtual hosting: This is what you're doing right now. The virtual machine (VM) has one "disk", which on my desktop computer right now is C:\cygwin\home\Philip\VirtualBox VMs\three-day-rdbms_default_1421369255990_87357.vmdk (4.4 GB file). Downside: slow; Upside: easy to move to bigger computer. (How slow? VMware did a study with, presumably, the best-tuned virtual machine ever to run Oracle, and found that Oracle RDBMS throughput dropped about 20 percent compared to native hardware. More typical is a friend's ecommerce site where the VM was configured to use a maximum of 16 GB of RAM and installed by itself on a computer with 32 GB of RAM.)

Virtual hosting, real-world server: in a high-load installation the VM could have multiple files assigned, each of which was on a separate physical hard drive, would then look like a separate disk to the VM. Could even give the VM actual physical disks, but then moving to a new computer isn't so straightforward (what if the new machine has a different number of disks?)

Idiot-proof yet modern and hip: Amazon RDS! Pick a size, a type of storage (disk or flash memory (SSD)), and an RDBMS, then let Amazon keep it all running, upgraded, backed up, etc. Amazon can also set up the system for automatic failover to a hot standby system. Amazon currently supports MySQL, Oracle, Microsoft SQL Server, Postgres, and … Aurora (November 2014 innovation from Amazon, a "drop-in replacement for MySQL"). Note that Google has a competitive service, though limited to MySQL.

Modern, hip, and like an Asian martial arts weapon (i.e., you are more likely to hurt yourself than an opponent): Containerization (see docker.com). Instead of running MySQL as an operating system process ("Traditional hosting") or in a virtual machine that is larded up with a full copy of Unix you run MySQL in a "container" that is easy to move from one computer to another but smaller and faster to start up than a VM. What about storage? If you're a true containerization believer then MySQL can use a separate "storage container" when it wants to write to a hard drive. How does this work out when people try it? We did a quick Google search and found "When I run the benchmark on the native machine, I get 779.5 transactions per second. When I run the benchmark in a mysql container, I get 336 transactions/second."

Pricing Out Amazon RDS

Consider a small online community with 100,000 active users. Each user has uploaded about 200 digital images, each of them 3 MB in size. That's 600 MB of data per user or 60,000,000 MB of data = 60 TB or 60,000 GB. Amazon Aurora charges, as of January 2015, 10 cents per GB per month, so that's $6000 per month for storage. If we assume  200 million requests of the database per month, at 20 cents per million requests that's an additional $40/month. Let's assume that we want the largest instance, which 32 virtual CPUs and 244 GB of RAM. That's $1.44 per hour on a reserved basis = $1050.

What if we were to push the images into Amazon S3 "standard storage"? Now the cost of storage falls to about 3 cents per month per GB. So it will cost about $1800 per month to host the images, but we will have to work harder to ensure integrity and atomicity to avoid, for example, deleting a user but leaving behind some of his or her images. Now that we have a simple database of text and numbers perhaps we want to indulge in Oracle. Assume a 50 GB database. Suppose that we think this can be served by an "xlarge" instance, which has 15 GB of RAM, because the rest of the database is on SSD. That will cost us $1,000 per month, including the Oracle license, plus 11.5 cents per month per GB = $5.75(!).

Lecture: The Web on Small Screens

Traditional web: create a one-size-fits-all layout, optimized for a screen width of N pixels where N is the average width of your users' screens, but that works for slightly narrower screens (users must scroll horizontally) and slightly wider screens (empty lanes on the left and right sides).

Less common: use HTML <table width="100%">, which stretches and shrinks to fit each viewer's screen, and test the layout on a variety of screens and browser widths.  Problems: images distort as you scale up and down; fonts sizes don't change; <table> for anything but tabular data frowned on by CSS brigade.

Smartphone screens are so much smaller than desktop screens that "designing for the middle" no longer works.  Initially, the displays simply didn't have enough pixels (the iPhone 1 shipped at 320 pixels wide).  Smartphone makers solved this problem by rendering the page for a larger display and cropping the view to the user's screen.  To view the other parts of the page, users would scroll right and down, and later zoom in and out to locate where they wanted to scroll to.  Eventually, smartphone pixel densities reached the point where full desktop-oriented designs could be rendered on smartphone screens.

Question: what should the default zoom be?  Apple's solution: 980 pixels.  Good: renders most web pages fully even if they're not designed for mobile.  Bad: traditional web pages are unusable on phones.

The other part of Apple's solution: let the web page owner indicate that they support mobile layouts by setting the default zoom explicitly, via the viewport tag.

Lecture: The Web Across 10x Dynamic Range (aka Responsive Design)

A basic invocation of the viewport tag looks like this:

<meta name="viewport" content="width=device-width">

Adding this tag to the <head> section of an html page indicates to the browser that your page will render on a variety of screen sizes, and that the browser shouldn't emulate a 980 pixel-wide display.

But how do you actually make this phone-friendly rendering happen?  There are two components to creating web pages that display across a variety of screen sizes.  The first and harder one is the design component: determining the most important information to display on smaller screens, determining what additional enriching information can be displayed on larger screens and finally the specifics of the layout at each size.

The second and easier component is to use cascading style sheets (CSS) to express how to display and hide elements on various-sized display media.  CSS defines a syntax for media queries that evaluate to true or false for a particular display medium.  Here is an example of a media query

      @media (max-width: 20cm) {

        .hide-on-smaller-browsers {

          display: none;

        }

      }

This CSS block causes any item having the "hide-on-smaller-browsers" class to be hidden on screens smaller than 20cm wide.  You can see this in action by visiting this page and adjusting your browser width up and down.  More generally: style declarations inside a media query that evaluates to true are obeyed; style declarations inside a media query that evaluates to false are ignored.

Web pages that satisfy these design goals and use CSS media queries to achieve them across a variety of media are called responsive.   More about the design motivation, plus an introduction to media queries, may be found in this article by Ethan Marcotte which gave the technique its name.

Postscript: What is a pixel?

It turns out that getting CSS to render the same way across a wide range of devices, even at the same physical screen size, is trickier than it may seem.  For more read http://alistapart.com/article/a-pixel-identity-crisis.

Exercise: A basic iPhone application

With a few extra tags you can transform a Web page into a "native-like" iPhone application.  Start by creating a copy of your server side Android scripts to

http://localhost/php/iphone/status-updates.php?for_user_id=45 (example) and

http://localhost/php/iphone/post-new-status.php

If necessary, adjust the html layout of status-updates.php to 320x460 pixels.  Next, create or locate a 57x57 pixel icon in either png or gif format (a Google image search on "57x57 icon" is fruitful) on your server at

http://localhost/php/iphone/basic.png (or basic.gif as appropriate)

and add the following link element to the <HEAD> section of your page:

<link rel="apple-touch-icon" href="/php/iphone/basic.png"/>

To hide the various Safari controls when you page launches, also add the following to the <HEAD>:

<meta name="apple-mobile-web-app-capable" content="yes" />

<meta name="apple-mobile-web-app-status-bar-style" content="black" />

Test your application by launching it in Chrome.

Bonus: copy your scripts and install your data model on a publically viewable, database-backed web server, and invite your friends to try it on their iPhones.  They will first need to point their browser to your page, and then select "Add to Home Page" from the menu.  The page will then appear like a native iPhone app.

More:

Discussion: Is everything that we learned pointless/worthless?

The last question from a class member in January 2011 was "What about this NoSQL stuff I keep hearing about?" For him and everyone else, we've prepared http://philip.greenspun.com/teaching/three-day-rdbms/beyond-the-rdbms