Published using Google Docs
Homework 4 [PDS Fall 2013]
Updated automatically every 5 minutes

Practical Data Science, Fall 2013

Homework 4: Relational Databases and SQL

This homework requires access to the social e-commerce database specified in the class. For all questions, submit your SQL query and at most the first 10 rows returned by the query.

Analytics with SQL

  1. How many shops are there?
  2. What is the average price over all listings? What is the average “price” across all transactions?
  3. What is the average individual price of each listing purchased (note that the price field in the transactions table is the total price for the transaction; you need to control for quantity). How does this compare to the average listing price?
  4. Remove listings with a price or quantity of 0 and recompute the average price. How does this compare to the average price of each listing purchased?
  5. What are the 5 most expensive listings?
  6. How many listings has each user purchased? (Explain your interpretation(s))

Optional:

  1. Compute the distribution of how many users purchase different numbers of listings (# listings purchased vs. # users with that many purchases). You can ignore users with 0 purchases.   (Could you plot this? Hint: click export)
  2. Compute the number of users with each gender.
  3. Among the users with purchases, compute the number of users with each gender.
  4. If you needed to combine information from two or more tables and then use Python to perform further analytics on the result, are you now prepared to do so?  Explain briefly.

Dealing with Data: Parsing Text and Extracting Information

Consider the html document representing the course roster web page for last year’s class located here. Because this is the same html that is rendered by your favorite web browser, you can load it up and see what it looks like (probably something like file→open file). You will be using this raw, semi-structured data for the following tasks:

  1. Within the document, there are several student IDs (the column actually is titled E-mail). Extract these IDs from the html and print them to a file, one per line.
  2. Constrain your search to print only those students with four letters in their last names or less. How many students were removed?
  3. For every student in the class, in addition to extracting their student ID, extract their name. Present the results by printing out, one student per line:

                first (and middle) name [tab] last name [tab] student id

Optional:

  1. The E-mail column seems misnamed. Create a new html document that replaces all student ids in this field with student email addresses.