1 of 71

Saving Data to the DWeb:�a Primer and a Practical Perspective

Kelsey Breseman

@ifoundtheme

@envirodgi

2 of 71

“Cool URIs don’t change”

�—Tim Berners-Lee, 1998

w3.org/Provider/Style/URI

@ifoundtheme | 2

3 of 71

“When someone follows a link and it breaks, they generally lose confidence in the owner of the server.

�—Tim Berners-Lee, 1998

@ifoundtheme | 3

4 of 71

What if the owner of the server

is your government?

@ifoundtheme | 4

5 of 71

envirodatagov.org/publications

6 of 71

https://climate.dot.gov as of Mar 1, 2017

6

7 of 71

https://climate.dot.gov as of Jan 27, 2020

@ifoundtheme | 7

8 of 71

If you rely on e.g. a *.gov page for certain information but the page gets taken down,

how do you know where to look for an archive?

@ifoundtheme | 8

9 of 71

If you find an archive,

how do you know it’s exactly what was originally there?

@ifoundtheme | 9

10 of 71

Is the archive any safer than the original?

@ifoundtheme | 10

11 of 71

What happens when files are shared on the decentralized web?

// text-based slides by Dana: datatogether.org/posts/05_dweb_primer/

Image-based version by Kelsey Breseman, based on slides by Dana Breseman; EDGI; Data Together

12 of 71

  1. Initializing

(what we have)

@ifoundtheme | 12

13 of 71

@ifoundtheme | 13

14 of 71

Let’s put some files on the decentralized web!

@ifoundtheme | 14

15 of 71

@ifoundtheme | 15

File

File Size

summary.txt

3.5 KB

img_254.png

249.1 KB

survey.csv

510.0 KB

16 of 71

2. Chunking

(standardizing)

@ifoundtheme | 16

17 of 71

@ifoundtheme | 17

18 of 71

Break up files into manageable chunks

@ifoundtheme | 18

19 of 71

@ifoundtheme | 19

File

Chunks (max size = 256KB*)

summary.txt

3.5 KB

img_254.png

249.1 KB

survey.csv

(510 KB)

255.0 KB

255.0 KB

*256KB is approximately correct for IPFS

20 of 71

Chunking: Why?

  • No chunk is bigger than 256 KB—a standard�(or a different size, depending on the protocol)
  • You know if the # chunks will fit on your hardware

@ifoundtheme | 20

21 of 71

3. Hashing�(cryptographic)

@ifoundtheme | 21

22 of 71

@ifoundtheme | 22

52ed879e 70f71d92 6eb69570 08e03ce4 ca6945d3

23 of 71

Give each chunk a unique identifier

@ifoundtheme | 23

24 of 71

Hashing is

one-way

You can’t run the function backward to get the input

@ifoundtheme | 24

25 of 71

A hash has

fixed size

It doesn’t matter how long the input is, the output will be the same size

@ifoundtheme | 25

Fox

DFCD3454 BBEO788A 751A696C 24D97009 CA992D17

The red fox runs across the ice

52ED879E 70F71D92 6EB69570 08E03CE4 CA6945D3

Input

(Doesn’t have to be text!)

Output (Hash)

Hash Function

26 of 71

A hash is

deterministic

The same input will always produce the same output

@ifoundtheme | 26

Fox

DFCD3454 BBEO788A 751A696C 24D97009 CA992D17

Fox

DFCD3454 BBEO788A 751A696C 24D97009 CA992D17

Input

(Doesn’t have to be text!)

Output (Hash)

Hash Function

27 of 71

A hash is

collision-�resistant

A different input will produce a different output

@ifoundtheme | 27

52ED879E 70F71D92 6EB69570 08E03CE4 CA6945D3

4604281 935C7FB0 9158585A B94AE214 26EB3CEA

Input

Output (Hash)

Hash Function

The red fox walks across the ice

The red fox runs across the ice

28 of 71

28

File

Hash

summary.txt

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8

img_254.png

4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

survey.csv

(2 chunks)

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

29 of 71

Hashing: Why?

  • File verification: any change anywhere in a file will result in a different hash
  • File identification: a given hash will always return the same file
  • Efficiency: even a large file can be identified with a short text string

@ifoundtheme | 29

30 of 71

4. Merkle DAG

(hash together)

@ifoundtheme | 30

31 of 71

@ifoundtheme | 31

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8

baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f

4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

32 of 71

Make a hash that refers to all of the chunks as a group

@ifoundtheme | 32

33 of 71

33

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8

4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

summary.txt

img_254.png

survey.csv

(chunk 1)

survey.csv

(chunk 2)

34 of 71

34

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8

4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

summary.txt

img_254.png

survey.csv

(chunk 1)

survey.csv

(chunk 2)

35 of 71

35

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8

4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

summary.txt

img_254.png

survey.csv

(chunk 1)

survey.csv

(chunk 2)

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c

4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3

36 of 71

36

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8

4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

summary.txt

img_254.png

survey.csv

(chunk 1)

survey.csv

(chunk 2)

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c

4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3

37 of 71

37

baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c

4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8

4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

summary.txt

img_254.png

survey.csv

(chunk 1)

survey.csv

(chunk 2)

38 of 71

38

baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c

4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8

4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

summary.txt

img_254.png

survey.csv

(chunk 1)

survey.csv

(chunk 2)

39 of 71

Merkle DAG: Why? all the benefits of hashing…

  • File identification: one hash to find related data
  • Efficiency: the same length ID even for lots of files
  • File verification: any change anywhere in any file will result in a different hash

@ifoundtheme | 39

40 of 71

40

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8

summary.txt

img_254.png

4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

F4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

survey.csv (chunk 1)

survey.csv (chunk 2)

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3

baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f

File verification

41 of 71

41

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8

4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

F4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15cb096f9266421cb102a8f1f7ac907cdb34d7f08c6ba0203e1d328b438

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

Modified

Survey.csv (chunk 2)

004ea48ea78f96b567e3eeeb034f952370a1a6d6e230123151d9ee2d

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08004ea48ea78f96b567e3eeeb034f952370a1a6d6e230123151d9ee2d

b096f9266421cb102a8f1f7ac907cdb34d7f08c6ba0203e1d328b438

754c1be0c376fb5dc9f57381d1efe01e668d68b91c7f7cd83b3f9f34

File verification

summary.txt

img_254.png

survey.csv (chunk 1)

42 of 71

Merkle DAG: Why? all the benefits of hashing, and more!

  • File identification: one hash to find related data
  • Efficiency: the same length ID even for lots of files
  • File verification: any change anywhere in any file will result in a different hash
    • Partial verification if you don’t want all the files (use the top-level hashes)

@ifoundtheme | 42

43 of 71

@ifoundtheme | 43

f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8

summary.txt

img_254.png

4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

F4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3

4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3

baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f

Partial verification

44 of 71

5. Accessing

(finding files)

@ifoundtheme | 44

45 of 71

Access the files by requesting their URI

@ifoundtheme | 45

46 of 71

What you need to do is to have the web server look up a persistent URI in an instant and return the file, wherever your current crazy file system has it stored away at the moment.

Tim Berners-Lee, 1998w3.org/Provider/Style/URI

@ifoundtheme | 46

47 of 71

On IPFS, the URI is the�top-level hash

@ifoundtheme | 47

48 of 71

@ifoundtheme | 48

baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f

49 of 71

?

@ifoundtheme | 49

/ipfs/baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f

50 of 71

?

@ifoundtheme | 50

/ipfs/baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f

51 of 71

51

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f

52 of 71

IPFS: Why?

  • Because files are found by their hash, any modification to the file = a new URI
  • Therefore IPFS is best for static content:�One URI will always yield the exact same content.

@ifoundtheme | 52

53 of 71

On Dat, the URI is the�read key for the directory

@ifoundtheme | 53

54 of 71

?

@ifoundtheme | 54

dat://afcda09c58405e822a207d29010c8aa0e768d3a7d915e11570ac4cef8c31c7eb

55 of 71

55

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

56 of 71

56

baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f

3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c

22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94

57 of 71

Dat: Why?

  • Because files are found by their read key, you will always find the most current content available from the author at the working directory
  • Therefore Dat is best for changing content:�The content may change, but the URI will not.

57

58 of 71

6. Retrieving

(rehosting)

@ifoundtheme | 58

59 of 71

@ifoundtheme | 59

baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f

60 of 71

When you download the data, you automatically re-host it, share files (or parts of files) to peers across the network...

@ifoundtheme | 60

61 of 71

...Becoming a node and strengthening the decentralized web

@ifoundtheme | 61

62 of 71

envirodatagov.org/publications

63 of 71

@ifoundtheme | 63

Air quality data collected daily at a few of the sites in Deer Park, near Houston, Texas

64 of 71

64

Firefighters battle a petrochemical fire at the Intercontinental Terminals Company, March 18, 2019, in Deer Park, Texas.

David J. Phillip/AP

65 of 71

“Data were considered to be at risk unless they had a dedicated plan to not be at risk.”

Mayernik, M.S., Breseman, K., Downs, R.R., Duerr, R., Garretson, A., Hou, C.-Y. (Sophie) . and (EDGI) and Earth Science Information Partners (ESIP) Data Stewardship Committee, E.D.G.I., 2020. Risk Assessment for Scientific Data. Data Science Journal, 19(1), p.10. DOI: http://doi.org/10.5334/dsj-2020-010

@ifoundtheme | 65

66 of 71

Data risk for PDFs stored in a traditional archive

66

Severity of risk

Likelihood of occurrence

Time to recover

Impact on user

Degree of control

Lack of use

Low

High

Low

Low

High

Loss of knowledge

Med

High

High

High

Med

Catastrophes

High

Low

High

High

Low

Dependence on service provider

Med

Med

Med

Med

Med

Political interference

Med

Low

Med

High

Low

Format obsolescence

High

Low

High

High

Med

Hardware breakdown

High

Med

High

High

Low

Selected risk factors from the data risk matrix proposed in: Mayernik, M., Breseman, K., Downs, R. R., Duerr, R., Garretson, A., & Hou, S. (2020, January 9). Risk Assessment for Scientific Data. http://doi.org/10.5334/dsj-2020-010

67 of 71

Data risk for structured data stored on IPFS (Qri)

67

Severity of risk

Likelihood of occurrence

Time to recover

Impact on user

Degree of control

Lack of use

High

Med

Low

High

Med

Loss of knowledge

High

High

Med

High

Med

Catastrophes

Low

Low

Low

Low

Low

Dependence on service provider

Low

Low

Low

Low

Med

Political interference

Low

Low

Low

Low

Low

Format obsolescence

High

Med

High

High

Med

Hardware breakdown

Low

Low

Low

Low

Low

Selected risk factors from the data risk matrix proposed in: Mayernik, M., Breseman, K., Downs, R. R., Duerr, R., Garretson, A., & Hou, S. (2020, January 9). Risk Assessment for Scientific Data. http://doi.org/10.5334/dsj-2020-010

68 of 71

Single-point-of-failure risks are down; new-technology risks are up

@ifoundtheme | 68

69 of 71

The best way to strengthen the decentralized web is to use it.

@ifoundtheme | 69

70 of 71

Thanks!

Kelsey Breseman

@ifoundtheme

envirodatagov.org�@EnviroDGI

datatogether.org�@Data_Together

@ifoundtheme | 70

71 of 71

Sources

Image Credits

Kelly Wilkinson

Tsumisu (Pixabay)

Illustrade (Pixabay)

goodfreephotos.com

Fontawesome

71