1 of 32

Dream bigger than Data

Dr. Robert Clewley

Principal Data Scientist at FullStory,

a Digital Experience Intelligence platform.

Formerly Mailchimp, Georgia State University.

A Call to Action for future Data Scientists

2 of 32

START A STORY RIGHT AWAY

Gartner: 85% big data projects fail;

20% of analytical insights deliver business value

3 of 32

This talk is about data science

Hint: it's in the name.

You know you're a data scientist if you're applying the scientific method.

Opinions expressed here are

  • as a data scientist
    • not a data engineer or an ML Ops engineer
  • entirely my own

4 of 32

PyData Atlanta born of Dreaming Bigger than Data

A response to corporate, "enterprisey" data science presos that focus on literal selling of technology and figurative selling of a professional practice that's unrealistically polished, patronizingly superior, and usually delivered by a white man in a suit.

It's about storytelling and community building.

It's about science.

5 of 32

This talk is about career growth

  • What are later-career options for data scientists?
  • How do you want to have a lasting impact on the world?
  • Data science should focus on the science
    • It should not focus on the data or the tools
    • Don't be stuck as a "data technician"
  • Things you can't learn at school (yet)
  • Take away a practical framework to try yourself

6 of 32

This talk is about Broadening Your Mindset

Enablement

Activation

Affordances

Trust

Risk

Falsification

(Politics)

7 of 32

The Age of Information

We've apparently lived in it (in the USA) since the 1960s.

It signals the move away from a society and economy driven by physical industry to one driven by information.

�WIRED, 2014: ‘Pharmaceutical companies fell in love with “high throughput screening” techniques in the 1990s, as a way of testing out all possible molecular combinations to match a target. It was a bust. Most have now moved back towards a more rational model based around deep understanding, experience and intuition.’

8 of 32

Will all Data Science resemble existing unicorns?

Big companies found low-hanging DS fruit long ago

They have successfully channeled their ML on narrow scopes

They like to show off about it (marketing)

Don't be fooled by the hype (survivorship bias)

There are fewer opportunities to do that in other domains

"Most" data problems are much less well defined and scoped

Future problems are much harder, and even less well scoped

9 of 32

The Age of (too much) Information

  • We are drowning in data and information.
  • We are losing our ability to focus our attention (individually and collectively) on important things.
    • Like the environment, social justice, ...
  • We are terrible at communicating.
  • We are terrible at organizing complex things.
  • We are terrible at focusing on long-term (esp. latent) issues vs. short-termism.

10 of 32

Yes, this is about Data Science

DS is at a focal point where it can drive big gains in core capabilities that affect our lives and our future.

The DIKW pyramid is a convenient (and imperfect) way to talk about this.

Impactful things (like decisions) only happen up here

}

More abstractions,

more science

11 of 32

Let's do some gardening with data

Goal: optimize for a healthy crop

12 of 32

Data - Information - Knowledge - Wisdom

Data are unprocessed, observed facts*

Raw camera bit stream from your garden

Information involves organizing data and inference from it ("what")

Computer vision measures the shape, color, size of object vs. background, classifying individual tomatoes

Knowledge is insightful, contextual, impactful ("how", "when", "why")

Understanding a correlation between rainfall, sunshine, temperature, and seasonal tomato growth.

Knowing that tomatoes are fruits (application context) and have expected properties (validation).

Wisdom is pragmatic and synthetic ("what if", "what's best")

Is it worth the opportunity cost of relocating the garden next year to improve sun exposure 10%?

Understanding that tomatoes don't belong in a fruit salad.

I should not propose a relocation to my company this year.

Future focus

Past focus

13 of 32

Data - Information - Knowledge - Wisdom

Data are unprocessed, observed facts*

Raw camera bit stream from your garden

Information involves organizing data and inference from it ("what")

Computer vision measures the shape, color, size of object vs. background, classifying individual tomatoes

Knowledge is insightful, contextual, impactful ("how", "when", "why")

Understanding a correlation between rainfall, sunshine, temperature, and seasonal tomato growth.

Knowing that tomatoes are fruits (application context) and have expected properties (validation).

Wisdom is pragmatic and synthetic ("what if", "what's best")

Is it worth the opportunity cost of relocating the garden next year to improve sun exposure 10%?

Understanding that tomatoes don't belong in a fruit salad.

I should not propose a relocation to my company this year.

Within�human minds

Within computers

14 of 32

Data is, and should be, Boring

  • In a literal sense, data is passive�
  • It is the least important part of doing data science�
  • The technology is also pretty boring
    • People spend way too much time fussing about technology
    • Hammers, nails ...

15 of 32

Data is, and should be, Boring

But data has affordances: opportunities, possibilities for action ...�

16 of 32

The technology is also pretty boring

  • Courses and certifications mostly teach tools and how to contribute individually
    • They are biased to well-established techniques and what's "trendy"
    • There's more to life than model training and deployment pipelines
    • If all you have is a hammer, everything looks like a nail�
  • Focus your future learning in SWE, math, stats, PM, and SCIENCE
    • Don't spend all your time riding the hype train over the finer points of data lakes vs. warehouses vs. meshes

17 of 32

Eyes on the Prize

  • Data Science is about evidence-based decisions, action,� and IMPACT�
  • The application and context of data is everything�
  • This is inherently more focused on humans than computers

�If you're worried your job might get automated out of existence, prepare to think much bigger than an individual's DS activities and technology-centric thinking.

18 of 32

This process focuses on individual, mechanical activities

  • Data preparation
  • Exploratory data analysis
  • Data representation and transformation
  • Algorithmics, computation, programming
  • Modeling and stats
  • Visualization, presentation, communication, documentation

�This leaves out the important context of a well-defined problem, stakeholders, objectives, application, change management, etc...

It leaves out the process of achieving knowledge and wisdom.

19 of 32

Enter the Science

Empirical science is our established, rational tool to determine knowledge and provide the basis for wisdom.

If you agree that evidence-based reasoning and critical thinking are crucial properties of a better society, then data science is a great career choice!

You can be at the forefront of creating systems to generate and communicate trustworthy knowledge, understanding, and wisdom.

20 of 32

How do we get there?

A. Logistics and preparation

  1. Instrumentation and data collection
  2. Data storage (data at rest)
  3. Data transfer and processing (data in motion)

B. Administration and governance

  1. Ownership and control
  2. Security, privacy
  3. Lifecycle and change management

C. Enablement

  1. Internal accessibility
  2. Documentation
  3. Coordination of R&D with stakeholders
  4. Alignment with mission�

D. Activation

  1. Validation, resilience, risk management
  2. Reproducibility, accountability, trust
  3. Wider accessibility
  4. Metrics for success
  5. Scientific process

21 of 32

Wisdom can only grow from Affordances

Ideally, there is a trustworthy, reliable process associated with each type of activity A-D, which is aimed to maximize value while minimizing cost and risk associated with doing data science.

Everything about moving from data to wisdom is about creating more trust, accountability, reliability around derivations from the previous layer, through building ever-more complex affordances.

These become more people-focused further up the DIKW pyramid, hence the dependence on institutional governance and enablement policies and culture.

22 of 32

Affordances without trust

... will build you a house of cards

=

23 of 32

This is just good Computer Science

Modularity, structure, well-designed interfaces and abstractions

=

affords progress to build more,�achieve more

24 of 32

2 is much bigger than 1

If your data isn't stored in at least two places, you don't really "have" that data.

If your scientific study hasn't been independently reproduced and corroborated by at least one other, you haven't really created value.

25 of 32

A metaphor for building Affordances

Measure computing productivity using early vs. modern computer user interfaces.

E.g. Hex keypad for direct entry of machine code instructions for a single-threaded CPU with no OS.

versus

Many layers of abstractions of multiprocessing applications, windows, buttons, dashboards, kernel services, clients, servers, APIs, etc.

There are countless people hours already invested in vetting and evolving better solutions to support productivity, and those investments continue. Widespread adoption is predicated on reliability, predictability, measurable efficiency, and trustworthiness.

26 of 32

Trust-building is not just split train/test dataset

Hypothesis about how model works

Demonstration that you tried other approaches

Documented buy-in from stakeholders

Analysis of error margins, tolerances and suitability for UX (or other downstream systems)

Transparency about testing framework

Reproducibility

Opportunities for growth and iteration (esp. by others)

27 of 32

Keeping Perspective (Summary)

Algorithms, tools, and hardware are great

but focus on the impact.�

Logistics, administration and governance are crucial

but focus on the impact.

Impact depends on activation, an oft-neglected phase of DS.

28 of 32

The rest of this deck isn't finished!

29 of 32

Falsify, not validate

Stop "validating" hypotheses

Start "falsifying" hypotheses

30 of 32

Activating data involves adding more context, but not in an isolated, single use.

Activation isn't about a single use, just like a single working science experiment isn't a sufficient result to ensure a widely-accepted new theory.

Activation in science involves

  • creating and openly distributing the means to reproduce the experiment, and
  • multiplying the potential to include and empower a whole community of people to not only repeat the experiment but to understand its limitations and do better.

The latter is an act that can have vastly nonlinear, emergent consequences. It's not only about shouting loud from social media because, first of all, it's not about you. It's having put in the care and attention to enable others to validate, test, and apply the data at a greater scale than possible before.

This is because affordances come from combinations of components that lead to the nonlinear emergence of more possibilities than those in the sum of the parts. Because those components are fallible human creations, it's normal for those to benefit from critical review and iterative improvement.

31 of 32

Simple example for you

Your activation framework might be "only" a set of scripts, version control, and some shared documents about the design of the model and reports about how the data is processed and insights used for solving some particular problem space.

32 of 32

Resources