Saving Data to the DWeb:�a Primer and a Practical Perspective
Kelsey Breseman
@ifoundtheme
@envirodgi
“Cool URIs don’t change”
�—Tim Berners-Lee, 1998
w3.org/Provider/Style/URI
@ifoundtheme | 2
“When someone follows a link and it breaks, they generally lose confidence in the owner of the server. ”
�—Tim Berners-Lee, 1998
@ifoundtheme | 3
What if the owner of the server
is your government?
@ifoundtheme | 4
envirodatagov.org/publications
https://climate.dot.gov as of Mar 1, 2017
6
https://climate.dot.gov as of Jan 27, 2020
@ifoundtheme | 7
If you rely on e.g. a *.gov page for certain information but the page gets taken down,
how do you know where to look for an archive?
@ifoundtheme | 8
If you find an archive,
how do you know it’s exactly what was originally there?
@ifoundtheme | 9
Is the archive any safer than the original?
@ifoundtheme | 10
What happens when files are shared on the decentralized web?
// text-based slides by Dana: datatogether.org/posts/05_dweb_primer/
Image-based version by Kelsey Breseman, based on slides by Dana Breseman; EDGI; Data Together
(what we have)
@ifoundtheme | 12
@ifoundtheme | 13
Let’s put some files on the decentralized web!
@ifoundtheme | 14
@ifoundtheme | 15
File | File Size |
summary.txt | 3.5 KB |
img_254.png | 249.1 KB |
survey.csv | 510.0 KB |
2. Chunking
(standardizing)
@ifoundtheme | 16
@ifoundtheme | 17
Break up files into manageable chunks
@ifoundtheme | 18
@ifoundtheme | 19
File | Chunks (max size = 256KB*) |
summary.txt | 3.5 KB |
img_254.png | 249.1 KB |
survey.csv (510 KB) | 255.0 KB |
255.0 KB |
*256KB is approximately correct for IPFS
Chunking: Why?
@ifoundtheme | 20
3. Hashing�(cryptographic)
@ifoundtheme | 21
@ifoundtheme | 22
52ed879e 70f71d92 6eb69570 08e03ce4 ca6945d3
Give each chunk a unique identifier
@ifoundtheme | 23
Hashing is
one-way
You can’t run the function backward to get the input
@ifoundtheme | 24
A hash has
fixed size
It doesn’t matter how long the input is, the output will be the same size
@ifoundtheme | 25
Fox
DFCD3454 BBEO788A 751A696C 24D97009 CA992D17
The red fox runs across the ice
52ED879E 70F71D92 6EB69570 08E03CE4 CA6945D3
Input
(Doesn’t have to be text!)
Output (Hash)
Hash Function
A hash is
deterministic
The same input will always produce the same output
@ifoundtheme | 26
Fox
DFCD3454 BBEO788A 751A696C 24D97009 CA992D17
Fox
DFCD3454 BBEO788A 751A696C 24D97009 CA992D17
Input
(Doesn’t have to be text!)
Output (Hash)
Hash Function
A hash is
collision-�resistant
A different input will produce a different output
@ifoundtheme | 27
52ED879E 70F71D92 6EB69570 08E03CE4 CA6945D3
4604281 935C7FB0 9158585A B94AE214 26EB3CEA
Input
Output (Hash)
Hash Function
The red fox walks across the ice
The red fox runs across the ice
28
File | Hash |
summary.txt | f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8 |
img_254.png | 4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0 |
survey.csv (2 chunks) | 3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08 |
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94 |
Hashing: Why?
@ifoundtheme | 29
4. Merkle DAG
(hash together)
@ifoundtheme | 30
@ifoundtheme | 31
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8
baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f
4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
Make a hash that refers to all of the chunks as a group
@ifoundtheme | 32
33
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8
4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
summary.txt
img_254.png
survey.csv
(chunk 1)
survey.csv
(chunk 2)
34
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8
4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
summary.txt
img_254.png
survey.csv
(chunk 1)
survey.csv
(chunk 2)
35
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8
4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
summary.txt
img_254.png
survey.csv
(chunk 1)
survey.csv
(chunk 2)
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c
4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3
36
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8
4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
summary.txt
img_254.png
survey.csv
(chunk 1)
survey.csv
(chunk 2)
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c
4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3
37
baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c
4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8
4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
summary.txt
img_254.png
survey.csv
(chunk 1)
survey.csv
(chunk 2)
38
baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c
4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8
4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
summary.txt
img_254.png
survey.csv
(chunk 1)
survey.csv
(chunk 2)
Merkle DAG: Why? all the benefits of hashing…
@ifoundtheme | 39
40
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8
summary.txt
img_254.png
4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
F4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
survey.csv (chunk 1)
survey.csv (chunk 2)
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d0822ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3
baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f
File verification
41
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8
4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
F4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15cb096f9266421cb102a8f1f7ac907cdb34d7f08c6ba0203e1d328b438
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
Modified
Survey.csv (chunk 2)
004ea48ea78f96b567e3eeeb034f952370a1a6d6e230123151d9ee2d
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08004ea48ea78f96b567e3eeeb034f952370a1a6d6e230123151d9ee2d
b096f9266421cb102a8f1f7ac907cdb34d7f08c6ba0203e1d328b438
754c1be0c376fb5dc9f57381d1efe01e668d68b91c7f7cd83b3f9f34
File verification
summary.txt
img_254.png
survey.csv (chunk 1)
Merkle DAG: Why? all the benefits of hashing, and more!
@ifoundtheme | 42
@ifoundtheme | 43
f4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e8
summary.txt
img_254.png
4ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
F4d6a47e90c3336fb7d8f7b8418fef43681c37554fbdb644064802e84ed9b9fa34f0c98ba0abeaf5fce632d2a229547ee69cda3a1024cec0
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3
4cdf565d3ac3fe2b5ff646fa41dfb63959b5180dd6e0d981102987b3
baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f
Partial verification
5. Accessing
(finding files)
@ifoundtheme | 44
Access the files by requesting their URI
@ifoundtheme | 45
What you need to do is to have the web server look up a persistent URI in an instant and return the file, wherever your current crazy file system has it stored away at the moment.
Tim Berners-Lee, 1998�w3.org/Provider/Style/URI
@ifoundtheme | 46
On IPFS, the URI is the�top-level hash
@ifoundtheme | 47
@ifoundtheme | 48
baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f
?
@ifoundtheme | 49
/ipfs/baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f
?
@ifoundtheme | 50
/ipfs/baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f
51
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f
IPFS: Why?
@ifoundtheme | 52
On Dat, the URI is the�read key for the directory
@ifoundtheme | 53
?
@ifoundtheme | 54
dat://afcda09c58405e822a207d29010c8aa0e768d3a7d915e11570ac4cef8c31c7eb
55
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
56
baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f
3850d19ba67da3e7e3731d7fed2ea7198f808827b4d8d64976170d08
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
84f43e7df855b9db71b1e0845b5f478f84fb8de9b5906774ec72f15c
22ba8185f0c757e8cb295588b4c5b6b2be765ca30a4689289bba6c94
Dat: Why?
57
6. Retrieving
(rehosting)
@ifoundtheme | 58
@ifoundtheme | 59
baa3260b631c5cb383d7355da21d8d5adcc18301c35268aa7891419f
When you download the data, you automatically re-host it, share files (or parts of files) to peers across the network...
@ifoundtheme | 60
...Becoming a node and strengthening the decentralized web
@ifoundtheme | 61
envirodatagov.org/publications
@ifoundtheme | 63
Air quality data collected daily at a few of the sites in Deer Park, near Houston, Texas
64
Firefighters battle a petrochemical fire at the Intercontinental Terminals Company, March 18, 2019, in Deer Park, Texas.
David J. Phillip/AP
“Data were considered to be at risk unless they had a dedicated plan to not be at risk.”
Mayernik, M.S., Breseman, K., Downs, R.R., Duerr, R., Garretson, A., Hou, C.-Y. (Sophie) . and (EDGI) and Earth Science Information Partners (ESIP) Data Stewardship Committee, E.D.G.I., 2020. Risk Assessment for Scientific Data. Data Science Journal, 19(1), p.10. DOI: http://doi.org/10.5334/dsj-2020-010
@ifoundtheme | 65
Data risk for PDFs stored in a traditional archive
66
| Severity of risk | Likelihood of occurrence | Time to recover | Impact on user | Degree of control |
Lack of use | Low | High | Low | Low | High |
Loss of knowledge | Med | High | High | High | Med |
Catastrophes | High | Low | High | High | Low |
Dependence on service provider | Med | Med | Med | Med | Med |
Political interference | Med | Low | Med | High | Low |
Format obsolescence | High | Low | High | High | Med |
Hardware breakdown | High | Med | High | High | Low |
Selected risk factors from the data risk matrix proposed in: Mayernik, M., Breseman, K., Downs, R. R., Duerr, R., Garretson, A., & Hou, S. (2020, January 9). Risk Assessment for Scientific Data. http://doi.org/10.5334/dsj-2020-010
Data risk for structured data stored on IPFS (Qri)
67
| Severity of risk | Likelihood of occurrence | Time to recover | Impact on user | Degree of control |
Lack of use | High | Med | Low | High | Med |
Loss of knowledge | High | High | Med | High | Med |
Catastrophes | Low | Low | Low | Low | Low |
Dependence on service provider | Low | Low | Low | Low | Med |
Political interference | Low | Low | Low | Low | Low |
Format obsolescence | High | Med | High | High | Med |
Hardware breakdown | Low | Low | Low | Low | Low |
Selected risk factors from the data risk matrix proposed in: Mayernik, M., Breseman, K., Downs, R. R., Duerr, R., Garretson, A., & Hou, S. (2020, January 9). Risk Assessment for Scientific Data. http://doi.org/10.5334/dsj-2020-010
Single-point-of-failure risks are down; new-technology risks are up
@ifoundtheme | 68
The best way to strengthen the decentralized web is to use it.
@ifoundtheme | 69
Thanks!
Kelsey Breseman
@ifoundtheme
envirodatagov.org�@EnviroDGI
datatogether.org�@Data_Together
@ifoundtheme | 70
Sources
The center of this presentation is based on a collaboration with Dana Breseman at https://datatogether.org/posts/05_dweb_primer/ using the following sources:
IPFS intro: https://docs.ipfs.io
Dat intro: https://docs.datproject.org/
IPFS summary: https://dev.to/jaybeekeeper/learning-about-dat-protocol-and-decentralization--1ghi
Chunking: https://medium.com/textileio/whats-really-happening-when-you-add-a-file-to-ipfs-ae3b8b5e4b0f
Hashing: https://komodoplatform.com/cryptographic-hash-function/
Merkle trees: https://medium.com/byzantine-studio/blockchain-fundamentals-what-is-a-merkle-tree-d44c529391d7
Merkle trees: https://taravancil.com/blog/how-merkle-trees-enable-decentralized-web/
Dat structure: https://datprotocol.github.io/how-dat-works/
Dat diagrams: https://hackmd.io/-hJeNqjkS4WlMJ0P5PRw-A
@RangerMauve for being an excellent reviewer/general resource on Dat
Image Credits
Kelly Wilkinson
Tsumisu (Pixabay)
Illustrade (Pixabay)
goodfreephotos.com
Fontawesome
71