Dream bigger than Data
Dr. Robert Clewley
Principal Data Scientist at FullStory,
a Digital Experience Intelligence platform.
Formerly Mailchimp, Georgia State University.
A Call to Action for future Data Scientists
START A STORY RIGHT AWAY
Gartner: 85% big data projects fail;
20% of analytical insights deliver business value
This talk is about data science
Hint: it's in the name.
You know you're a data scientist if you're applying the scientific method.
Opinions expressed here are
PyData Atlanta born of Dreaming Bigger than Data
A response to corporate, "enterprisey" data science presos that focus on literal selling of technology and figurative selling of a professional practice that's unrealistically polished, patronizingly superior, and usually delivered by a white man in a suit.
It's about storytelling and community building.
It's about science.
This talk is about career growth
This talk is about Broadening Your Mindset
Enablement
Activation
Affordances
Trust
Risk
Falsification
(Politics)
The Age of Information
We've apparently lived in it (in the USA) since the 1960s.
It signals the move away from a society and economy driven by physical industry to one driven by information.
�WIRED, 2014: ‘Pharmaceutical companies fell in love with “high throughput screening” techniques in the 1990s, as a way of testing out all possible molecular combinations to match a target. It was a bust. Most have now moved back towards a more rational model based around deep understanding, experience and intuition.’
Will all Data Science resemble existing unicorns?
Big companies found low-hanging DS fruit long ago
They have successfully channeled their ML on narrow scopes
They like to show off about it (marketing)
Don't be fooled by the hype (survivorship bias)
There are fewer opportunities to do that in other domains
"Most" data problems are much less well defined and scoped
Future problems are much harder, and even less well scoped
The Age of (too much) Information
Yes, this is about Data Science
DS is at a focal point where it can drive big gains in core capabilities that affect our lives and our future.
The DIKW pyramid is a convenient (and imperfect) way to talk about this.
Impactful things (like decisions) only happen up here
}
More abstractions,
more science
Let's do some gardening with data
Goal: optimize for a healthy crop
Data - Information - Knowledge - Wisdom
Data are unprocessed, observed facts*
Raw camera bit stream from your garden
Information involves organizing data and inference from it ("what")
Computer vision measures the shape, color, size of object vs. background, classifying individual tomatoes
Knowledge is insightful, contextual, impactful ("how", "when", "why")
Understanding a correlation between rainfall, sunshine, temperature, and seasonal tomato growth.
Knowing that tomatoes are fruits (application context) and have expected properties (validation).
Wisdom is pragmatic and synthetic ("what if", "what's best")
Is it worth the opportunity cost of relocating the garden next year to improve sun exposure 10%?
Understanding that tomatoes don't belong in a fruit salad.
I should not propose a relocation to my company this year.
Future focus
Past focus
Data - Information - Knowledge - Wisdom
Data are unprocessed, observed facts*
Raw camera bit stream from your garden
Information involves organizing data and inference from it ("what")
Computer vision measures the shape, color, size of object vs. background, classifying individual tomatoes
Knowledge is insightful, contextual, impactful ("how", "when", "why")
Understanding a correlation between rainfall, sunshine, temperature, and seasonal tomato growth.
Knowing that tomatoes are fruits (application context) and have expected properties (validation).
Wisdom is pragmatic and synthetic ("what if", "what's best")
Is it worth the opportunity cost of relocating the garden next year to improve sun exposure 10%?
Understanding that tomatoes don't belong in a fruit salad.
I should not propose a relocation to my company this year.
Within�human minds
Within computers
Data is, and should be, Boring
Data is, and should be, Boring
But data has affordances: opportunities, possibilities for action ...�
The technology is also pretty boring
Eyes on the Prize
�If you're worried your job might get automated out of existence, prepare to think much bigger than an individual's DS activities and technology-centric thinking.
This process focuses on individual, mechanical activities
�This leaves out the important context of a well-defined problem, stakeholders, objectives, application, change management, etc...
It leaves out the process of achieving knowledge and wisdom.
Enter the Science
Empirical science is our established, rational tool to determine knowledge and provide the basis for wisdom.
If you agree that evidence-based reasoning and critical thinking are crucial properties of a better society, then data science is a great career choice!
You can be at the forefront of creating systems to generate and communicate trustworthy knowledge, understanding, and wisdom.
How do we get there?
A. Logistics and preparation
B. Administration and governance
C. Enablement
D. Activation
Wisdom can only grow from Affordances
Ideally, there is a trustworthy, reliable process associated with each type of activity A-D, which is aimed to maximize value while minimizing cost and risk associated with doing data science.
Everything about moving from data to wisdom is about creating more trust, accountability, reliability around derivations from the previous layer, through building ever-more complex affordances.
These become more people-focused further up the DIKW pyramid, hence the dependence on institutional governance and enablement policies and culture.
Affordances without trust
... will build you a house of cards
=
This is just good Computer Science
Modularity, structure, well-designed interfaces and abstractions
=
affords progress to build more,�achieve more
2 is much bigger than 1
If your data isn't stored in at least two places, you don't really "have" that data.
If your scientific study hasn't been independently reproduced and corroborated by at least one other, you haven't really created value.
A metaphor for building Affordances
Measure computing productivity using early vs. modern computer user interfaces.
E.g. Hex keypad for direct entry of machine code instructions for a single-threaded CPU with no OS.
versus
Many layers of abstractions of multiprocessing applications, windows, buttons, dashboards, kernel services, clients, servers, APIs, etc.
There are countless people hours already invested in vetting and evolving better solutions to support productivity, and those investments continue. Widespread adoption is predicated on reliability, predictability, measurable efficiency, and trustworthiness.
Trust-building is not just split train/test dataset
Hypothesis about how model works
Demonstration that you tried other approaches
Documented buy-in from stakeholders
Analysis of error margins, tolerances and suitability for UX (or other downstream systems)
Transparency about testing framework
Reproducibility
Opportunities for growth and iteration (esp. by others)
Keeping Perspective (Summary)
Algorithms, tools, and hardware are great
but focus on the impact.�
Logistics, administration and governance are crucial
but focus on the impact.
Impact depends on activation, an oft-neglected phase of DS.
The rest of this deck isn't finished!
Falsify, not validate
Stop "validating" hypotheses
Start "falsifying" hypotheses
Activating data involves adding more context, but not in an isolated, single use.
Activation isn't about a single use, just like a single working science experiment isn't a sufficient result to ensure a widely-accepted new theory.
Activation in science involves
The latter is an act that can have vastly nonlinear, emergent consequences. It's not only about shouting loud from social media because, first of all, it's not about you. It's having put in the care and attention to enable others to validate, test, and apply the data at a greater scale than possible before.
This is because affordances come from combinations of components that lead to the nonlinear emergence of more possibilities than those in the sum of the parts. Because those components are fallible human creations, it's normal for those to benefit from critical review and iterative improvement.
Simple example for you
Your activation framework might be "only" a set of scripts, version control, and some shared documents about the design of the model and reports about how the data is processed and insights used for solving some particular problem space.
Resources