Web archiving software comparison
Comments
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

 
Comment only
 
 
ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAI
1
For more information about this spreadsheet, please visit https://github.com/archivers-space/research/tree/master/web_archiving
2
General informationUsers & developer features"Out of the box" website crawling capabilitiesAdvanced data harvesting capabilitiesArchive management featuresOther notesDate/version checked
3
Name
(link to home page)
Open source?
Source
repo
Operating system(s)Primary dev. languageTarget audienceCLIGUIWUILibrary APINetwork APIExtensibility frameworkParallel crawlingScheduled crawlingCrawl storage format(s)Capture raw responsesFollow linksURL filteringAdvanced filteringExtract links from JavaScriptRun JavaScriptHandle ReactExtract links from FlashRun FlashTargeted scrapingManual form interactionAuto form interactionExtract file dataBrowsePlaybackFull-text searchNotable users Notes and commentsEvaluation dateVersion
examined
4
Whole site archiving systems
5
Archive-It½n/aweb appJavaenterprise✖︎✖︎✖︎✖︎ARC, WARC ✖︎✖︎✖︎✖︎Run by IA. Used by 100's of institutions.Paid service. Core software is Heritrix, with some extensions.2017-05-21
6
BrozzlerApacheGitHubLin, Win, macOSPythonuser✖︎✖︎✖︎✖︎WARC✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎IAUses Chromium to fetch pages. Uses warcprox and pywb.2017-07-141.b11
7
CrawlerGitHubLin, Win, macOSPHPuser✖︎✖︎✖︎✖︎✖︎✖︎✖︎MySQL✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎Made by the FCC.Bare bones crawler. License not stipulate. Purpose unclear.2017-06-012012-06-04
8
Crawler4jApacheGitHubLin, Win, macOSJavauser✖︎︎✖︎✖︎✖︎︎Java classes✖︎files on disk✖︎✖︎✖︎✖︎︎✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎2017-05-23
9
CrawljaxApacheGitHubLin, Win, macOSJavauser✖︎✖︎Plug-inslimitedlog file; plug-ins can do more✖︎✖︎✖︎✖︎✖︎✖︎✖︎Several papers written about the implementation..2017-06-013.5
10
grab-siteMITGitHubLin, Win, macOSPythonuser✖︎✖︎RPCPython✖︎︎✖︎︎WARC✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎︎✖︎✖︎✖︎Uses wpull internally.2017-05-15
11
geccoMITGitHubLin, Win, macOSJavauser✖︎✖︎✖︎✖︎Java classes✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎Code comments are in Chinese.2017-06-01
12
HeritrixApache+GitHubLinJavauser✖︎︎✖︎︎JMXJava classes✖︎ARC, WARC✖︎✖︎︎✖︎✖︎︎✖︎︎✖︎︎✖︎✖︎✖︎Used by IA.2017-05-15
13
HTTrackGPLGitHubLin, WinCuser✖︎✖︎C callbacks✖︎files on disk✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎2017-07-03
14
ItSucksGPLSFLin, Win, macOSJavauser✖︎✖︎Java classes✖︎files on disk✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎2017-05-15
15
NetarchiveSuite
LGPLGitHubLinJavauser, enterprise✖︎✖︎Java classesARC, WARC✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎;Netarkivet at The Royal Library of DenmarkUses Heritrix for crawling.2017-07-14
16
NutchApacheApacheLin, Win, macOSJavauser, enterprise✖︎✖︎Plug-insseveral db options✖︎✖︎✖︎✖︎✖︎✖︎✖︎2017-06-01
17
Octoparse✖︎✖︎Win.NETuser, enterprise✖︎✖︎✖︎REST✖︎database, CSV, Excel, files on disk✖︎✖︎Seems to run in the cloud, but there's a downloadable console or something.2017-09-036.4.3
18
simplecrawlerBSDGitHubLin, Win, macOSNode.jsuser✖︎✖︎✖︎Node modules, ES6+✖︎files on disk✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎2017-09-031.1.5
19
SquidwarcGPLv3GithubLin, macOSNode.jsuser, enterprise✖︎✖︎✖︎Node modules, ES6+✖︎WARC✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎;✖︎A high fidelity archival crawler that uses Chrome or Chrome Headless2017-07-21d4ca0b8
20
StormCrawlerApacheGitHubLin, Win, macOSJavauser, enterprise✖︎✖︎✖︎Java classeslimitedseveral db options✖︎✖︎✖︎✖︎✖︎✖︎✖︎Several companies, apparently.https://github.com/DigitalPebble/storm-crawler/wiki/Presentations2017-06-01
21
WAIL (Electron)GPLv3GitHubLin, Win, macOSNode.js (Electron)user✖︎✖︎✖︎✖︎✖︎WARC✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎Uses Chrome Browser and Heritrix for crawling. Pywb, Twitter Monitoring And Automatic Archival2017-07-191.2.0-Beta2
22
WAIL (py)GPLv3GitHubWin, macOSPythonuser✖︎*✖︎✖︎✖︎✖︎WARC✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎OpenWayback and Heritrix, cf. WAIL (Electron)2017-07-19v0.2016.07.09
23
WebRecorder.io
ApacheGitHubLin, Win, macOSPythonuser✖︎✖︎︎✖︎✖︎✖︎︎✖︎✖︎✖︎︎✖︎✖︎✖︎Interactive, high-fidelity web archiving tool2017-07-03eccea96
24
wgetGPL
Savannah
Lin, Win, macOSCuser✖︎✖︎✖︎✖︎✖︎✖︎✖︎WARC, files on disk(a)limited✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎(a) It can show HTTP responses but it does not store them.2017-05-171.19
25
wpullGPLGitHubLin, Win, macOSPythonuser✖︎✖︎︎plug-ins, scripts✖︎✖︎︎WARC✖︎✖︎✖︎︎✖︎✖︎︎✖︎✖︎︎✖︎︎✖︎︎✖︎✖︎2017-05-202.01
26
27
Single page snapshot/archiving systems
28
Archive.is✖︎n/aweb appn/auser✖︎✖︎✖︎︎REST✖︎✖︎web page✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎limited✖︎✖︎✖︎✖︎Very good quality page captures.2017-05-15
29
curlMITGitHubLin, Win, macOSCuser✖︎✖︎✖︎✖︎✖︎✖︎files on disk✖︎︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎︎limited✖︎︎✖︎✖︎✖︎✖︎Standard with many operating systems2017-05-21
30
FreezePage✖︎n/aweb appn/auser✖︎✖︎✖︎✖︎✖︎✖︎✖︎web page✖︎✖︎︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎Seems to be free for use, but not open source.2017-06-25
31
Paparazzi!✖︎n/amacOSn/auser✖︎︎✖︎✖︎︎✖︎︎✖︎︎✖︎︎PDF, PNG, JPG, TIFF✖︎︎✖︎︎✖︎✖︎✖︎︎✖︎✖︎︎✖︎limited✖︎✖︎✖︎︎✖︎︎✖︎︎mhucka uses this all the timeVery good quality full-page captures.2017-05-24
32
Perma.ccMIT + GPLGitHubweb appPythonuser, enterprise✖︎︎✖︎︎✖︎RESTDjango✖︎✖︎WARC, PDF, PNG✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎limited✖︎2017-05-24
33
WARCreateGPLv3GitHubChrome extensionJavaScriptuser✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎WARC✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎2017-07-14
34
webkit2pngMITGitHubLin, Win, macOSPythonuser✖︎✖︎✖︎✖︎✖︎✖︎✖︎PNG✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎2017-05-27
35
36
Data scraping systems
37
BixoApacheGitHubLin, Win, macOS, EMRJavauser, enterprise✖︎✖︎✖︎Java classes✖︎files✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎Author seems to be creator of Krugle. Uses Apache Nutch, Hadoop, Tika, others.2017-06-25
38
import.io✖︎n/aweb appn/auser, enterprise✖︎✖︎✖︎REST✖︎JSON, CSV, Gdocs, Tableau✖︎limited✖︎✖︎✖︎✖︎✖︎✖︎Offers edu & charity discounts. Has a pretty active user forum.2017-05-24
39
morph.ioAfferoGitHubLin, Win, macOS, Docker, web appRubyuser, enterprise✖︎Plug-ins (supports many langs)limitedSQLlite✖︎✖︎✖︎✖︎✖︎✖︎✖︎Offers cloud-based scraping. Active user forum.2017-05-29
40
PortiaBSDGitHubLin, Win, macOS, DockerPythonuser✖︎︎✖︎︎Pythonfiles, MySQL, git✖︎✖︎✖︎︎✖︎✖︎Has visual scraping definition editor.2017-05-29
41
Web Scraper Plus+
✖︎n/aWinn/auser✖︎✖︎✖︎✖︎︎✖︎✖︎files, user db✖︎✖︎︎✖︎︎✖︎✖︎︎✖︎✖︎✖︎︎✖︎︎Can extract data & store in database. May no longer be supported.2017-05-23
42
43
44
– not yet categorized –
45
Abot
46
AbotX
47
Andjing
48
Anemone
49
Aperture
50
Apifier
51
Arachnid (Java)
52
Arachnid (PHP)
53
arachniweb
54
Arale
55
ArchiveBot
56
ARCOMEM
57
ASPseek
58
Bingo!
59
blekko✖︎
60
CCBot
61
cl-web-crawler
62
CrawlBot✖︎web app
63
crawler.js
64
crawwwler
65
DataparkSearch
66
DeepArc
67
DeepVaccum✖︎macOS
68
Django Dynamic Scraper
69
dryscrape
70
EIS Archiver
71
Ex-Crawler
72
F(b)arc
73
Gungho
74
Hounder
75
html-snapshots
76
html2warc
77
HyperSpider (JS)
78
icrawler
79
iwebcrawler
80
jedi-crawler
81
Jspider
82
JWAT
83
Knowlesys
84
LARM
85
Lassie
86
Lentil
87
METIS
88
Miru
89
mnoGoSearch
90
mummif.it
91
Newspaper
92
Norconex HTTP Collector
93
NutchWAX
94
OpenWayback
95
OpenWebSpider
96
OutbackCDX
97
PageFreezer
98
PageVault
99
Panscient
100
ParseHub
Loading...
 
 
 
Sheet1
Sheet2