1 of 29

Data quality: prevention is better than the cure

Andrew Jones

Principal Engineer | Author | Google Developer Expert

All opinions are my own

2 of 29

Improve data quality at the source, so we can…

Deliver value from data faster and cheaper

3 of 29

64% of organisations think that big data and analytics are the way to deliver competitive advantage…

…yet only 1 in 5 are using it to deliver increased revenue

Source: Nash Squared

4 of 29

5 of 29

Hey #engineering, anyone know where I can find this data?

Sorry all, an upstream schema change broke this morning's run

IF(date < 2022, v1_pricing, v2_pricing)

We can’t even do BI well, what makes them think we can do AI?!

6 of 29

Working with poor quality data is

time consuming and expensive

Poor data quality costs organisations an average of $12.9 million a year - Gartner

7 of 29

1-10-100 rule of data quality

George Labovitz and Yu Sang Chang, 1992

8 of 29

Failure - $100

  • Data is not accessible or usable
  • We give up! Opportunity cost
  • Unable to meet our strategic goals

9 of 29

Remediation - $10

  • Alerts and observability, often when data lands in production
  • Implementing workarounds downstream in increasingly complex/expensive ETL
  • Regular data issues erode trust in your data

10 of 29

Prevention - $1

  • “Shift-left”, so issues are caught at the source
  • Data incidents have reduced impact/cost
  • Prevent common issues from occurring

11 of 29

12 of 29

13 of 29

Encourage collaboration

  • Bring data consumers and generators closer together
  • Clearly articulate the value of your data-driven applications
  • Incentivise generators to provide data that meets requirements

14 of 29

Explicit interface

  • Generators explicitly provide data through an interface
  • Data provided with clear expectations (SLOs)
  • Data generators own the interface and responsible for it

15 of 29

16 of 29

17 of 29

18 of 29

19 of 29

20 of 29

21 of 29

22 of 29

23 of 29

Prevention is better than the cure

24 of 29

UP NEXT: Book giveaway!

Thanks!

25 of 29

What percentage of AI projects fail to deliver?

Source: Gartner

26 of 29

85%

That’s a lot! How many due to poor quality data?

Source: Gartner

27 of 29

How much of an organisation's data is actually used?

Source: Seagate

28 of 29

32%

That leaves 68% of your data incurring costs, both monetary and in increased risk, without generating any value.

Maybe we should think about quality, more than quantity? 🤔

Source: Seagate

29 of 29

📙 https://data-contracts.com

🔖 https://andrew-jones.com

👥 in/andrewrhysjones

Questions?