Data Just Right
A Practical Introduction to Data Science Skills
Michael Manoochehri
UCB iSchool MIMS Alum, 2010
Google, Inc / Google Cloud Platform, Data
Author of upcoming book “Data Just Right”
google.com/+MichaelManoochehri
@nTangledMichael
So, where are we? Why is this all so confusing?
Collecting and Serving Large Datasets
Building out the Data Pipeline Pattern
Telling the Story: Narrative Tools
A push toward everything “In the Cloud”
My Suggestions for becoming a “Data Scientist”
“a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology that provokes extensive utopian and dystopian rhetoric.” - danah boyd & Kate Crawford (Critical Questions for Big Data)
“Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away.” - George Dyson
Rapid innovation in the data technology space is the result of open source technology, commodity hardware, accessibility of cloud computing, major success stories in data-driven applications, innovations in visualization techniques, and feeling of opportunity for human creativity to produce novel insights.
The cliffhanger: What do we need to invest in?
Despite the (IMO) unfortunate term, defining the role of “Data Scientist” is important to identify, and to help define this increasingly important role in organizations.
Hilary Mason: Getting Started with Data Science
“...data scientists do three fundamentally different things: math, code (and engineer systems), and communicate.” - Hilary Mason
DJ Patil: What is a Data Scientist?
“It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data... If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a “mashup” of several analytical efforts, you’ve got a big data opportunity.”
Sean Taylor: Scientists make their own Data
“Many social/digital scientists are reluctant to invest in making data because it’s much more costly and risky than analyzing data you already have available.” - Sean Taylor
Example 1: Hosting and sharing large amounts of data
It’s not as easy as just throwing a bunch of files into a cloud storage provider.
Example: Library of Congress Twitter Archive
Goal: Understand who the audience is, what your budget dictates,
Example 2: Collecting large amounts of data
Example needs:
This versus that versus this versus that
Tendency for confusion around use cases, results in re-inventing the wheel, and incredible niche technology. Example: Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris
A common pattern often seen in this space is the use of a schema-less non-relational database to collect data, a read-only structured schema analytics databases are being used to run queries quickly over this data, with MapReduce transformation step in the middle.
Because of the process for open-source application, there is a lot of “mismatch” between the tools needed to build this pipeline pattern.
Collecting data: High availability, scaleable, partition-tolerant database
Processing and transforming data
Fast, aggregate query tool for asking quick queries about your data
Tools for Collaborative Mathematical and Statistical Analysis
R (versus?) Python
Recommendation: For practical data applications, explore the growing, if perhaps incomplete, set of Python-based tooling of Numpy, SciPy, Matplotlib, Pandas, and especially iPython
Visualization tools
Visualization is a rich field with much cultural history (and the potential for telling a dishonest story).
There are a multitude of tools, both commercial and open source, for exploring data visually, such Tableau, QlikView, Gephi, etc
Building your own, or graphing as you explore:
Sharing on the web - JavaScript Visualization libraries
//staq: Understanding Players using near-real time data analytics
Combination of technologies: memSQL and BigQuery
Support “real-time” analytics with MemSQL, and aggregate analytics with BigQuery, along with a
Visualizing all the Ships in the World
Using a variety of collection and caching techniques to build a robust data processing application
Are there 5 positions in the NBA... or 13?
Topological visualization using clustering
Examples:
Convergence: Tools that offer a best of many worlds
What will be automated, what will require even more skilled professional... humans?
Short term skills:
Code
Moving and asking questions about data
Distributed Data tools
Long term skills:
Dive into statistics
Visualization:
Finally: