A Practical Introduction to Data Science Skills: Notes and Links

Data Just Right

A Practical Introduction to Data Science Skills

Michael Manoochehri

UCB iSchool MIMS Alum, 2010

Google, Inc / Google Cloud Platform, Data

Author of upcoming book “Data Just Right”

google.com/+MichaelManoochehri

@nTangledMichael

So, where are we? Why is this all so confusing?

Perspectives on Data Science

Collecting and Serving Large Datasets

Building out the Data Pipeline Pattern

Telling the Story: Narrative Tools

Case Studies

Future Trends in Data Tools

A push toward everything “In the Cloud”

My Suggestions for becoming a “Data Scientist”

So, where are we? Why is this all so confusing?

“a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology that provokes extensive utopian and dystopian rhetoric.” - danah boyd & Kate Crawford (Critical Questions for Big Data)

“Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away.” - George Dyson

Rapid innovation in the data technology space is the result of open source technology, commodity hardware, accessibility of cloud computing, major success stories in data-driven applications, innovations in visualization techniques, and feeling of opportunity for human creativity to produce novel insights.

The cliffhanger: What do we need to invest in?

Perspectives on Data Science

Despite the (IMO) unfortunate term, defining the role of “Data Scientist” is important to identify, and to help define this increasingly important role in organizations.

Hilary Mason: Getting Started with Data Science

“...data scientists do three fundamentally different things: math, code (and engineer systems), and communicate.” - Hilary Mason

DJ Patil: What is a Data Scientist?

“It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data... If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a “mashup” of several analytical efforts, you’ve got a big data opportunity.”

Sean Taylor: Scientists make their own Data

“Many social/digital scientists are reluctant to invest in making data because it’s much more costly and risky than analyzing data you already have available.” - Sean Taylor

Collecting and Serving Large Datasets

Example 1: Hosting and sharing large amounts of data

It’s not as easy as just throwing a bunch of files into a cloud storage provider.

Example: Library of Congress Twitter Archive

Goal: Understand who the audience is, what your budget dictates,

Should you host files? What format?
Should you build an API for exploring your data?
Should your data be an interactive visualization?

Example 2: Collecting large amounts of data

Example needs:

Collecting data quickly - high performance for collecting
Retrieving a single piece of data quickly (key-value stores)
Greater freedom in querying

This versus that versus this versus that

Tendency for confusion around use cases, results in re-inventing the wheel, and incredible niche technology. Example: Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris

Building out the Data Pipeline Pattern

A common pattern often seen in this space is the use of a schema-less non-relational database to collect data, a read-only structured schema analytics databases are being used to run queries quickly over this data, with MapReduce transformation step in the middle.

Because of the process for open-source application, there is a lot of “mismatch” between the tools needed to build this pipeline pattern.

Collecting data: High availability, scaleable, partition-tolerant database

Example: Redis, memory-based key-value store

Processing and transforming data

Example: Hadoop: Excellent and flexible framework for batch data processing

Fast, aggregate query tool for asking quick queries about your data

Example: BigQuery, Impala, Spark + Shark, etc

Telling the Story: Narrative Tools

Tools for Collaborative Mathematical and Statistical Analysis

R (versus?) Python

R: Well established in the stats and math worlds, most popular - great for exploring datasets, not ideal for building robust, applications
Python is a general purpose programming language that has a growing set of statistics libraries

Recommendation: For practical data applications, explore the growing, if perhaps incomplete, set of Python-based tooling of Numpy, SciPy, Matplotlib, Pandas, and especially iPython

Visualization tools

Visualization is a rich field with much cultural history (and the potential for telling a dishonest story).

There are a multitude of tools, both commercial and open source, for exploring data visually, such Tableau, QlikView, Gephi, etc

Building your own, or graphing as you explore:

Python - matplotlib
R - ggplot2

Sharing on the web - JavaScript Visualization libraries

Mike Bostock’s open source d3.js (evolved from the protovis project)

Case Studies

//staq: Understanding Players using near-real time data analytics

Combination of technologies: memSQL and BigQuery

Support “real-time” analytics with MemSQL, and aggregate analytics with BigQuery, along with a

Visualizing all the Ships in the World

Using a variety of collection and caching techniques to build a robust data processing application

Are there 5 positions in the NBA... or 13?

Topological visualization using clustering

Guardian UK Data Blog

Future Trends in Data Tools

A push toward everything “In the Cloud”

Examples:

Amazon RedShift - Data Warehouse
Google BigQuery API (See Dremel)

Convergence: Tools that offer a best of many worlds

Relational features (consistency) in scalable, partition tolerant, non-relational databases
Non-relational databases with SQL interfaces
Hadoop being overloaded
New paradigms in distributed data framework (Spark, Mesos, other AMPlab projects).

What will be automated, what will require even more skilled professional... humans?

My Suggestions for becoming a “Data Scientist”

Short term skills:

Code

A working understanding of R
Become fairly proficient in Python and JavaScript
iPython

Moving and asking questions about data

Learn SQL
Learn how to work around a UNIX shell, learn sed, awk, grep, “pipes”

Distributed Data tools

Run a Hadoop instance locally
Write a Streaming MapReduce job in Python
Build a toy project using non-relational database

Long term skills:

Dive into statistics

Start with basics, understand correlation, significance
Learn about mathematical models

Visualization:

Explore the rich world of data visualization - read, read, read, share, collaborate
Don’t share a data visualization until you have subjective questions about your own work

Finally:

Throw all of the above away, and solve a real data challenge
Help a municipality open a dataset to the public
Help a non-profit answer a question requiring merging of complex data
Check out the iSchool MIMS final projects