1 of 34

The Decentralized Web

What it is, what’s useful in the archival space,�and current status and challenges

Kelsey Breseman October 2019�Rob Brackett EDGI & Data Together

2 of 34

70%

of Internet traffic is directly controlled by Google and Facebook

Staltz, “The Web Began Dying in 2014, Here’s How” 2017

2

3 of 34

Great power

Great irresponsibility

3

4 of 34

A question of trust: who should hold your data?

EDGI: Who should hold environmental data that belong to the public?

& how can they be collected, held, and distributed justly?

4

5 of 34

The internet is fragile

...as we’ve learned

  • If there’s only one place it’s kept, it can be taken down (even by accident!)
  • If the link breaks, you can’t find it anymore
  • Scientists have local copies of public datasets, but:
    • validity unprovable
    • not (typically) shared

5

6 of 34

Data requires continuous stewardship.

Who is able to hold & distribute data?

Who is not?

What data is well-kept, and for whom?

6

7 of 34

How do we steward data?

“Applied to information, stewardship focuses on assuring accuracy, validity, security, management, and preservation of information holdings.”

Dawes. “Stewardship and usefulness: Policy principles for information-based transparency.” Government Information Quarterly, 2010.

Content-based addressing (CIDs, hashes, crypto- signatures, etc.)

LOCKSS: “Lots of Copies Keeps Stuff Safe”�(a principle + eponymous software & org)

7

8 of 34

Lots of copies:

peer-to-peer (P2P)

sharing

8

9 of 34

Lots of Copies Keeps Stuff Safe (LOCKSS)

  • Every time you look at something on the internet, you are temporarily downloading a copy of it. What if you kept it?
  • What if your use of data was stewardship of that data?
  • What if by using public data, you became one of many distributed holders of that data (hosting it in a way that makes it findable by its content-based address)?

9

10 of 34

Kind of like

...Ever worried about downloading a virus from one of these?

If the internet is peer-based, we’re going to need a way to trust our content.

10

11 of 34

Provable Validity

“How do I know this is really the NASA data?”

(if it’s not at nasa.gov/has been archived)

12 of 34

Digital fingerprinting Digital signatures

“Is this the right content?”

  • The content is static, not updating
  • There isn’t a trusted authority who is holding the content
  • I need to be certain that this content is an exact match with some other copy (held by someone else, or at some other point in time)

“Is this content trustworthy?”

  • The content can change, but can only be changed by the holder of the page’s key
  • I need to be certain that the changes can come only from that authority

12

13 of 34

Content-based addressing:

I’m getting what I asked for

13

14 of 34

The address of the content is built from the content

Everything is algorithmically scrambled into this hash– so if you change even a comma of the content, the address is different

14

Example: IPFS’s “CID” content-based address

Contents of “hello.txt”:

Hello world!

Adding file to IPFS and getting back its CID:

➜ ipfs add hello.txt

> QmXgBq2xJKMqVo8jZdziyudNmnbiwjbpAycy5RbfDBoJRM

Contents of “hello.txt” (the same file, modified):

Hello, world!

Adding file to IPFS and getting back its CID:

➜ ipfs add hello.txt

> QmeeLUVdiSTTKQqhWqsffYDtNvvvcTfJdotkNyi1KDEJtQ

15 of 34

The content’s address is made from the content itself

If you ask by content rather than address, you will always get back the exact thing you asked for

...or nothing, if no one has it online

https:// frijol.gitbooks.io /climate-change/content/

how I’m asking | who I’m asking | where to look

how I’m asking | what I’m asking for (from anyone)– the hash

/ipfs/ Qc98f2c0ee40323148c99285a83c1a80d2179a454dcbc7d3393dc52cc146f47

15

16 of 34

Content addressing means you get what you asked for, even if you don’t know or trust the source.

Challenges:

  • How do you know what hash corresponds to the data you’re looking for (e.g. NOAA-produced data)? Still need a trusted source to know what hash to ask for
  • Changes to content would change the hash (so you can trust nobody changed it). How do you get updates?
  • You no longer need the data to be kept at a .gov address to prove its validity– but you need it to be kept somewhere!

16

17 of 34

Key-based addressing:

This content is from who it says it’s from

17

18 of 34

A “keypair”: two keys that only work with each other

If it’s encrypted with one of the keys, it can only be decrypted with the other (and vice versa).

If you make one of those keys public but keep the other one private to just you, you can encrypt something using the private key & send it along with the public key– so anyone can decrypt but only you could have encrypted it. (“Digital signature”)

18

19 of 34

The content’s address is its public key

If the content’s signature can be verified with the public key, only someone with the private key could have put it there. (this is done automatically)

https:// frijol.gitbooks.io /climate-change/content/

how I’m asking | who I’m asking | where to look

how I’m asking | what I’m asking for (from anywhere)

dat:// b3c98f2c0ee40323148c99285a83c1a80d2179a454dcbc7d3393dc52cc146f47

19

20 of 34

Key-based addressing means you know who made the data, even if you don’t know or trust who gave it to you.

Challenges:

  • How do you know that the data is the same as it was yesterday? Still need a way to refer to historical data.
  • You no longer need the data to be kept at a .gov address to prove its validity– but you need it to be kept somewhere!

20

21 of 34

Tying that together

  • With content addressing, you can trust that the content is what it says it is– but you need a trusted source
  • You can trust the source with key-based addressing– can that point to directories of content addresses?
  • If you have that working, now you need everything to be massively distributed

All of these are works in progress.

21

22 of 34

Pitfalls you will still have:

  • What if the thing you’re looking for is obscure?
  • What if whoever has it goes offline?
  • What if it’s a really big dataset– waste of energy?�Who has that much spare memory?

22

23 of 34

The internet is fragile.

The decentralized web�(the new Internet?)�...has different problems

  • Validity is provable!
  • Better P2P shareability!
  • But it depends on:
    • A network effect: lots of peers, online, w/storage
    • Good usability
    • Designed-in attention to ethics, values, justice

23

24 of 34

What is the DWeb, so far?

Who’s building this new internet?

25 of 34

I think you mean

“new internets

25

26 of 34

Protocol Labs (IPFS, Filecoin) (QRI), Dat, Secure Scuttlebutt, CKAN….

26

27 of 34

Lots of development in this space!

Challenges:

  • Most of this is pretty new/not production-ready
  • Lack of compatibility between protocols
  • A lot of projects are just a few unpaid open source devs
  • The other ones have profit motives

Justice & equity concerns:

  • Homogeneity among developers (and early adopters of the tech) → may not consider important impacts of early design choices on communities not represented
  • Costs associated with storage & greater upload needs

27

28 of 34

How is EDGI involved?

And, how might we be?

29 of 34

Storing datasets on IPFS

Via a Data Together node hosted by QRI

29

30 of 34

Data Together Reading Group

Hosting values-focused discussions�for DWeb creators

06/2018 Decentralized Web

07/2018 Ownership

08/2018 Commons

09/2018 Centralization

09/2018 Privacy

10/2018 Justice

04/2019 Knowledge Commons

05/2019 Civics

06/2019 Alternatives to Capitalism

08/2019 Stewardship

11/2019 Decentralization

30

31 of 34

Fostering connections

Bringing together dWeb developers & data managers from different projects

  • Partner with S2AC mapping out data rescue projects
  • Connect with PEGI
  • Vice-chair ESIP data stewardship committee
  • Partner with Protocol Labs, Filecoin, QRI, (Dat?) through Data Together
  • Introduce them to each other

31

32 of 34

Testing dWeb technologies

Testing archival use cases and giving feedback to the technology creators

32

33 of 34

...and more?

34 of 34

Learn more & get involved

34