1 of 43

Jefferson Bailey, Director, Web Archiving, Internet Archive @jefferson_bail | jefferson@archive.org

Abbie Grotke, Web Archiving Team Lead, Library of Congress @agrotke | abgr@loc.gov

Mark Phillips, Associate Dean for Digital Libraries, UNT Libraries @vphill | mark.phillips@unt.edu

Harvesting Democracy: �Archiving Federal Government Web Content at End of Term

AALL | July 17, 2016

2 of 43

it all began a long, long, time ago, in a far away place

https://flic.kr/p/4N2jHU

https://flic.kr/p/4JNkLE

3 of 43

original end of term web archive partners

for 2008/2012 - all IIPC & NDIIPP/NDSA partners

4 of 43

extant gov web archiving efforts

Capture, Preservation, & Access�

  • LOC: .gov, election, other
  • GPO: agency sites, often ephemeral
  • NARA: congressional web harvest every 2 years
  • IA: global & curated crawls
  • Agency-level: NIH/NLM, DOE, DOL, HHS, CMS, others, using AIT or comm tools
  • UNT & Others: Topical .gov collecting

Community Efforts�

  • Federal Web Archiving Group
    • most of those at left plus other feds
  • Research Initiatives
    • academic
    • NGO or watchdog
  • Citizen Driven
    • grassroots efforts
  • End of Term
    • focused but large-scale multi-institutional project

5 of 43

goals of the end of term project

  • work collaboratively to preserve public U.S. Government websites
  • document federal agencies’ presence on the web during the end of Presidential terms
  • enhance the existing research collections of the partner institutions
  • raise awareness about the need for preservation
  • engage with researchers and subject experts

6 of 43

eot collaborative distribution of work

  • IA: crawling, preservation, access, full-text search
  • LC: crawling, preservation, data transfers
  • UNT: nomination tool development, crawling, nomination mgmt, preservation, access
  • CDL: web portal, metadata
  • GPO: URL nomination, outreach
  • All: URL contributions, outreach, project management
  • Others: URLs, education

Some variance of roles between 2008 & 2012 (and for 2016)

7 of 43

major funding brought to you by….

https://flic.kr/p/8uMXjb

8 of 43

no one

https://flic.kr/p/8uMXjb

9 of 43

defining the “government web presence”

Stanford WebBase Project

2004 crawl list of URLs

10 of 43

and people like you!

11 of 43

.gov websites proliferate like invasive species

12 of 43

and yes, invasivespecies.gov once existed

13 of 43

some are non-public or unlisted

14 of 43

“web waste” & preservation mentalities

15 of 43

end of term web archive

16 of 43

affiliated efforts

17 of 43

eot extent

In Internet Archive

  • EOT 2008
    • ~3,000 seeds
    • ~102m URLs (~160m total across partners)
    • 17.95 TB (compressed)
    • multiple crawls & duplication
  • EOT 2012
    • ~5,500 seeds
    • ~45m URLs (~120m total across partners)
    • 18.60 TB (compressed)
    • more focused crawls & deduped

  • Similar data sizes, but 2012 had fewer URLs
  • 2012 notable for media richness, uniqueness, density

18 of 43

eot stats 2008 and 2012

19 of 43

EOT2008 and EOT2012 Crawling Schedule

20 of 43

TLD Breakdown

EOT2008 - 241 unique TLDs

EOT2012 - 251 unique TLDs

225 common TLDs

21 of 43

Domain Name Breakdown

EOT2008 – 87,889 unique domains

EOT2012 – 186,214 unique domains

30,066 common domains

22 of 43

Subdomain Breakdown

EOT2008 – 140,939 unique subdomains

EOT2012 – 352,679 unique subdomains

50,155 common subdomains

23 of 43

EOT2008-EOT2012 – TLD biggest change

24 of 43

Largest change by percent

25 of 43

.gov & .mil biggest change - 2008-2012

26 of 43

Top 30 .gov & .mil domains present in 2008

missing in 2012

27 of 43

Top 30 .gov & .mil domains new in 2012

28 of 43

researcher access to .gov

Researchers: PoliSci, Comms,

Legal, Informatics, CS

Project: Mining ~100TB of .gov data

Pros: Data w/ services, subsidized cluster, collaborative structure, some R&D

Cons: Low up-take, tech hurdles, resource constraints

Lessons Learned: Researcher use of “big data” of web archives produce challenges of scale, processing, expertise, and familiarity with context and provenance.

29 of 43

researcher access to .gov

WAT Datasets

(Web Archive Transformation)

Key Metadata from Every Resource

LGA Datasets

(Longitudinal

Graph Analysis)

What Links to What

over Time

WANE Datasets

(Web Archive

Named Entities)

Names of People, Places, Organizations

Web Archive Datasets

(via platform, disk, APIs, whatever)

30 of 43

researcher access to .gov

31 of 43

wbm beta access to .gov

https://web-beta.archive.org

32 of 43

wbm beta access to .gov (ppt/pdf)

https://waybacksearch.archivelab.org:8091

33 of 43

eot 2016

34 of 43

eot 2016: more partners!

Federal Government Web Archiving

Working Group

35 of 43

rough timeframe for 2016 project

2016

  • July 2016: Recruitment of subject experts/nominators to help identify additional websites for prioritized crawling. Today is the kickoff!
  • September 2016: Bookend (baseline) crawl of government web domains begins.
  • Fall 2016: Partners will crawl various aspects of government domains at varying frequencies, depending on selection polices/interests. Team will determine strategy for crawling prioritized websites.
  • November - February 2016-17: Crawl of prioritized websites, continued crawls of bulk lists.

2017

  • January 2017: Focused crawls will be conducted as needed during this period, particularly around Inauguration day
  • Spring or Summer 2017: Bookend crawl of all seeds, plus additional crawl of prioritized websites as determined by team.

36 of 43

eot 2016 opportunities

  • Expand Acquisition
    • distribute crawling
    • deploy new tech
    • build web archiving capacity
  • Nomination and Annotation
    • community engagement
    • contributed seed lists
    • educational opportunities
  • Researcher Engagement
    • notable longitudinal breadth
    • good periodicity for data-mining
    • growing community of interest
  • More Partners!

37 of 43

eot 2016 strategies

  • Potential Project Strategies
    • distributed crawling – deduped/replay?
    • coordinated outreach – affiliate communities?
    • more listserv & project interest
    • researcher access – datasets and hosts?
  • Access & Preservation
    • updated portal w/ FTS for all 3 eots
    • single replay WB
    • distributed preservation?

38 of 43

eot challenges

  • Same ol’ web challenges
    • complexity of content
    • volume & proliferation
    • “you get what you get” w/ little cataloging or QA
  • Distribution of work
    • more partners = more project/partner mgmt
    • contributed seed lists
  • Resource constraints
    • the “it isn’t anyone’s actual job” problem
    • tech, time limitations & scale of data
    • funding = ☹

39 of 43

eot 2016 content

  • Content
    • 7,000+ social media accounts (scrape of gov SM registry API) 44% FB, 37% TW, 10% YT
    • ~6,000 known seeds (via gov data, WB, FOIA)
    • ??? of gov on non-gov domains/seeds
    • more crowdsourced, curatorial nominations

gov,dontserveteens)

gov,dot)

gov,dot,adfs)

gov,dot,fastlane)

gov,dot,fhwa)

gov,dot,fhwa,borderplanning)

gov,dot,fhwa,collaboration)

gov,dot,fhwa,efl)

gov,dot,fhwa,environment)

gov,dot,fhwa,fhwapap04)

gov,dot,fhwa,flh)

gov,dot,fhwa,international)

gov,dot,fhwa,mutcd)

gov,dot,fhwa,nhi)

gov,dot,fhwa,ops)

gov,dot,fhwa,safety)

gov,dot,fhwa,wfl)

gov,dot,fhwa,wwwcf)

gov,dot,fmcsa)

gov,dot,fmcsa,ai)

gov,dot,fmcsa,cms)

gov,dot,fmcsa,csa)

gov,dot,fmcsa,csa2010)

gov,dot,fmcsa,li-public)

gov,dot,fmcsa,mrb)

gov,dot,fmcsa,nrcme)

gov,dot,fmcsa,safer)

gov,dot,fra)

gov,dot,fra,safetydata)

gov,dot,fta)

gov,dot,fta,transit-safety)

gov,dot,isddc)

gov,dot,its)

gov,dot,its,benefitcost)

gov,dot,its,pcb)

gov,dot,its,standards)

gov,dot,marad)

gov,dot,nhtsa)

gov,dot,nhtsa,www-esv)

gov,dot,nhtsa,www-fars)

gov,dot,nhtsa,www-nrd)

gov,dot,nhtsa,www-odi)

gov,dot,oig)

gov,dot,ost,airconsumer)

gov,dot,ost,dotcr)

gov,dot,ost,dothr)

gov,dot,ost,testimony)

gov,dot,phmsa)

gov,dot,phmsa,npms)

gov,dot,phmsa,opsweb)

gov,dot,phmsa,primis)

gov,house,bobbyscott)

gov,house,brown)

gov,house,castor)

gov,house,chrissmith)

gov,house,chu)

gov,house,clerk)

gov,house,cole)

gov,house,cummings)

gov,house,delbene)

gov,house,denham)

gov,house,desjarlais)

gov,house,docs)

gov,house,donovan)

gov,house,duckworth)

gov,house,edworkforce)

gov,house,energycommerce)

gov,house,farr)

gov,house,flores,rsc)

gov,house,foreignaffairs)

gov,house,foreignaffairs,democrats)

gov,house,fosteryouthcaucus-karenbass)

gov,house,gabbard)

gov,house,gosar)

gov,house,grothmanforms)

gov,house,gutierrez)

gov,house,heck)

gov,house,history)

gov,house,homeland)

gov,house,issa)

gov,house,jones)

gov,house,jordan)

gov,house,lee)

gov,house,lgbt-polis)

gov,house,messer)

gov,house,mulvaney)

gov,house,naturalresources)

gov,house,norton)

gov,house,oversight)

gov,house,oversight,democrats)

gov,house,paulgosar)

gov,house,perry)

gov,house,peteking)

gov,house,quigley)

gov,house,resourcescommittee)

gov,house,rules)

gov,house,scalise)

gov,house,scalise,rsc)

gov,house,schiff)

gov,house,science)

gov,house,sensenbrenner)

gov,house,smallbusiness)

gov,house,timryan)

gov,ems)

gov,energy)

gov,energy,afdc)

gov,energy,betterbuildingssolutioncenter)

gov,energy,buildingdata)

gov,energy,catalyst)

gov,energy,eere)

gov,energy,eere,apps1)

gov,energy,eere,apps2)

gov,energy,etec)

gov,energy,fossil)

gov,energy,genomicscience)

gov,energy,hss)

gov,energy,hydrogen)

gov,energy,nnsa)

gov,energy,pi)

gov,energy,science)

gov,energy,ssl)

gov,energycodes)

gov,energysavers)

gov,energystar)

gov,enfield-ct)

gov,ennistx)

gov,enterpriseal)

gov,eop)

gov,epa)

gov,epa,archive)

gov,epa,blog)

gov,epa,cfpub)

gov,epa,cumulis)

gov,epa,developer)

gov,epa,gispub4)

gov,epa,iaspub)

gov,epa,nepis)

gov,epa,ofmpub)

gov,epa,semspub)

gov,epa,water)

gov,epa,yosemite)

gov,epa,yosemite1)

gov,erie)

gov,erie,gis1)

gov,erie,gis2)

gov,erieco)

gov,erieco,engage)

gov,eriecountypa)

gov,essexct)

gov,eugene-or)

gov,eugene-or,ceapps)

gov,eugene-or,pdd)

gov,eulesstx)

gov,exeternh)

gov,fcc)

gov,fcc,apps)

gov,fcc,appsdemo)

gov,fcc,consumercomplaints)

gov,fcc,esupport)

gov,fcc,fjallfoss)

gov,fcc,hraunfoss)

gov,fcc,licensing)

gov,fcc,reboot)

gov,fcc,stations)

gov,fcc,transition)

gov,fcc,wireless)

gov,fcc,wireless2)

gov,fda)

gov,fda,accessdata)

gov,fda,blogs)

gov,fdic)

gov,fdicig)

gov,fdlp)

gov,fdlp,purl)

gov,fdsys)

gov,fec)

gov,fec,docquery)

gov,fec,eqs)

gov,federalregister)

gov,federalreserve)

gov,federalreserve,oig)

gov,federalreserveconsumerhelp)

gov,fedshirevets)

gov,feedthefuture)

gov,fema)

gov,fema,asd)

gov,fema,beta)

gov,fema,careers)

gov,fema,citizencorps)

gov,fema,community)

gov,fema,emilms)

gov,fema,gis)

gov,fema,hazards)

gov,fema,m)

gov,fema,msc)

gov,fema,ndms)

gov,fema,training)

gov,fema,usfa)

gov,fema,usfa,apps)

gov,ferc)

gov,ferndalemi)

gov,ffiec)

gov,ffiec,ithandbook)

gov,fgdc)

gov,fhfa)

gov,fido,xml)

40 of 43

eot 2016: new and improved nomination tool

41 of 43

eot 2016: how you can help

  • any and all nominations welcome
  • we need particular help with:
    • judicial branch websites
    • government content on non-government domains (.com, .edu, etc.)
    • important content or subdomains on very large websites (such as NASA.gov) that might be related to current Presidential policies
    • Social media

42 of 43

further information and the form to submit : http://digital2.library.unt.edu/nomination/eth2016

43 of 43

going forward

THANKS!

  • Crawl it All!
    • Community opportunity for more distributed crawling and acquisition methods
  • Access it All!
    • Unified portal and search indices
    • New access models, user groups, analytical tools
  • Preserve it All!
    • Take our WARCs and datasets, please!
  • Join the Fun of it All!
    • Email: eot2016@archive.org (or any of us)