Data Just Right

A Practical Introduction to Data Science Skills

Michael Manoochehri

UCB iSchool MIMS Alum, 2010

Google, Inc / Google Cloud Platform, Data

Author of upcoming book “Data Just Right


So, where are we? Why is this all so confusing?

Perspectives on Data Science

Collecting and Serving Large Datasets

Building out the Data Pipeline Pattern

Telling the Story: Narrative Tools

Case Studies

Future Trends in Data Tools

A push toward everything “In the Cloud”

My Suggestions for becoming a “Data Scientist”

So, where are we? Why is this all so confusing?

“a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology that provokes extensive utopian and dystopian rhetoric.” - danah boyd & Kate Crawford (Critical Questions for Big Data)

“Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away.” - George Dyson

Rapid innovation in the data technology space is the result of open source technology, commodity hardware, accessibility of cloud computing, major success stories in data-driven applications, innovations in visualization techniques, and feeling of opportunity for human creativity to produce novel insights.

The cliffhanger: What do we need to invest in?

Perspectives on Data Science

Despite the (IMO) unfortunate term, defining the role of “Data Scientist” is important to identify, and to help define this increasingly important role in organizations.

Hilary Mason: Getting Started with Data Science

“ scientists do three fundamentally different things: math, code (and engineer systems), and communicate.” - Hilary Mason

DJ Patil: What is a Data Scientist?

“It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data...  If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a “mashup” of several analytical efforts, you’ve got a big data opportunity.”

Sean Taylor: Scientists make their own Data

“Many social/digital scientists are reluctant to invest in making data because it’s much more costly and risky than analyzing data you already have available.” - Sean Taylor

Collecting and Serving Large Datasets

Example 1: Hosting and sharing large amounts of data

It’s not as easy as just throwing a bunch of files into a cloud storage provider.

Example: Library of Congress Twitter Archive

Goal: Understand who the audience is, what your budget dictates,

Example 2: Collecting large amounts of data

Example needs:

This versus that versus this versus that

Tendency for confusion around use cases, results in re-inventing the wheel, and incredible niche technology. Example: Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris

Building out the Data Pipeline Pattern

A common pattern often seen in this space is the use of a schema-less non-relational database to collect data, a read-only structured schema analytics databases are being used to run queries quickly over this data, with MapReduce transformation step in the middle.

Because of the process for open-source application, there is a lot of “mismatch” between the tools needed to build this pipeline pattern.

Collecting data: High availability, scaleable, partition-tolerant database

Processing and transforming data

Fast, aggregate query tool for asking quick queries about your data

Telling the Story: Narrative Tools

Tools for Collaborative Mathematical and Statistical Analysis

R (versus?) Python

  1. R: Well established in the stats and math worlds, most popular - great for exploring datasets, not ideal for building robust, applications
  2. Python is a general purpose programming language that has a growing set of statistics libraries

Recommendation: For practical data applications, explore the growing, if perhaps incomplete, set of Python-based tooling of Numpy, SciPy, Matplotlib, Pandas, and especially iPython

Visualization tools

Visualization is a rich field with much cultural history (and the potential for telling a dishonest story).

There are a multitude of tools, both commercial and open source, for exploring data visually, such Tableau, QlikView, Gephi, etc

Building your own, or graphing as you explore:

Sharing on the web - JavaScript Visualization libraries

Case Studies

//staq: Understanding Players using near-real time data analytics

Combination of technologies: memSQL and BigQuery

Support “real-time” analytics with MemSQL, and aggregate analytics with BigQuery, along with a

Visualizing all the Ships in the World

Using a variety of collection and caching techniques to build a robust data processing application

Are there 5 positions in the NBA... or 13?

Topological visualization using clustering

Guardian UK Data Blog

Future Trends in Data Tools

A push toward everything “In the Cloud”


Convergence: Tools that offer a best of many worlds

What will be automated, what will require even more skilled professional... humans?

My Suggestions for becoming a “Data Scientist”

Short term skills:


Moving and asking questions about data

Distributed Data tools

Long term skills:

Dive into statistics