ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAI
1
For more information about this spreadsheet, please visit https://github.com/archivers-space/research/tree/master/web_archiving
2
General informationUsers & developer features"Out of the box" website crawling capabilitiesAdvanced data harvesting capabilitiesArchive management featuresOther notesDate/version checked
3
Name
(link to home page)
Open source?Source
repo
Operating system(s)Primary dev. languageTarget audienceCLIGUIWUILibrary APINetwork APIExtensibility frameworkParallel crawlingScheduled crawlingCrawl storage format(s)Capture raw responsesFollow linksURL filteringAdvanced filteringExtract links from JavaScriptRun JavaScriptHandle ReactExtract links from FlashRun FlashTargeted scrapingManual form interactionAuto form interactionExtract file dataBrowsePlaybackFull-text searchNotable users Notes and commentsEvaluation dateVersion
examined
4
Whole site archiving systems
5
Archive-It½n/aweb appJavaenterprise✖︎✖︎✖︎✖︎ARC, WARC ✖︎✖︎✖︎✖︎Run by IA. Used by 100's of institutions.Paid service. Core software is Heritrix, with some extensions.2017-05-21
6
BrozzlerApacheGitHubLin, Win, macOSPythonuser✖︎✖︎✖︎✖︎WARC✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎IAUses Chromium to fetch pages. Uses warcprox and pywb.2017-07-141.b11
7
CrawlerGitHubLin, Win, macOSPHPuser✖︎✖︎✖︎✖︎✖︎✖︎✖︎MySQL✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎Made by the FCC.Bare bones crawler. License not stipulate. Purpose unclear.2017-06-012012-06-04
8
Crawler4jApacheGitHubLin, Win, macOSJavauser✖︎︎✖︎✖︎✖︎︎Java classes✖︎files on disk✖︎✖︎✖︎✖︎︎✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎2017-05-23
9
CrawljaxApacheGitHubLin, Win, macOSJavauser✖︎✖︎Plug-inslimitedlog file; plug-ins can do more✖︎✖︎✖︎✖︎✖︎✖︎✖︎Several papers written about the implementation..2017-06-013.5
10
grab-siteMITGitHubLin, Win, macOSPythonuser✖︎✖︎RPCPython✖︎︎✖︎︎WARC✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎︎✖︎✖︎✖︎Uses wpull internally.2017-05-15
11
geccoMITGitHubLin, Win, macOSJavauser✖︎✖︎✖︎✖︎Java classes✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎Code comments are in Chinese.2017-06-01
12
HeritrixApache+GitHubLinJavauser✖︎︎✖︎︎JMXJava classes✖︎ARC, WARC✖︎✖︎︎✖︎✖︎︎✖︎︎✖︎︎✖︎✖︎✖︎Used by IA.2017-05-15
13
HTTrackGPLGitHubLin, WinCuser✖︎✖︎C callbacks✖︎files on disk✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎2017-07-03
14
ItSucksGPLSFLin, Win, macOSJavauser✖︎✖︎Java classes✖︎files on disk✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎2017-05-15
15
NetarchiveSuiteLGPLGitHubLinJavauser, enterprise✖︎✖︎Java classesARC, WARC✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎;Netarkivet at The Royal Library of DenmarkUses Heritrix for crawling.2017-07-14
16
NutchApacheApacheLin, Win, macOSJavauser, enterprise✖︎✖︎Plug-insseveral db options✖︎✖︎✖︎✖︎✖︎✖︎✖︎2017-06-01
17
Octoparse✖︎✖︎Win.NETuser, enterprise✖︎✖︎✖︎REST✖︎database, CSV, Excel, files on disk✖︎✖︎Seems to run in the cloud, but there's a downloadable console or something.2017-09-036.4.3
18
PageFreezer✖︎✖︎web appn/auser, enterprise✖︎✖︎✖︎✖︎web pages✖︎✖︎✖︎✖︎✖︎✖︎EDGI web monitors use it2017-10-04
19
simplecrawlerBSDGitHubLin, Win, macOSNode.jsuser✖︎✖︎✖︎Node modules, ES6+✖︎files on disk✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎2017-09-031.1.5
20
SquidwarcGPLv3GithubLin, macOSNode.jsuser, enterprise✖︎✖︎✖︎Node modules, ES6+✖︎WARC✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎;✖︎A high fidelity archival crawler that uses Chrome or Chrome Headless2017-07-21d4ca0b8
21
StormCrawlerApacheGitHubLin, Win, macOSJavauser, enterprise✖︎✖︎✖︎Java classeslimitedseveral db options✖︎✖︎✖︎✖︎✖︎✖︎✖︎Several companies, apparently.https://github.com/DigitalPebble/storm-crawler/wiki/Presentations2017-06-01
22
WAIL (Electron)GPLv3GitHubLin, Win, macOSNode.js (Electron)user✖︎✖︎✖︎✖︎WARC✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎Uses Chrome Browser and Heritrix for crawling. Pywb, Twitter Monitoring And Automatic Archival2018-12-131.2.0-Beta2
23
WAIL (py)MITGitHubWin, macOSPythonuser✖︎*✖︎✖︎✖︎✖︎WARC✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎OpenWayback and Heritrix, cf. WAIL (Electron)2018-12-13v0.2016.07.09
24
WebMagicApacheGitHubLin, Win, macOSJavauser✖︎✖︎✖︎✖︎︎Java✖︎files on disk✖︎✖︎✖︎✖︎✖︎✖︎✖︎Intended as a programming framework, not end-user app. Comments are in Chinese2017-10-040.7.3
25
WebRecorder.io ApacheGitHubLin, Win, macOSPythonuser✖︎✖︎︎✖︎✖︎WARC✖︎︎✖︎✖︎✖︎︎✖︎✖︎✖︎Interactive, high-fidelity web archiving tool2017-07-03eccea96
26
wgetGPLSavannahLin, Win, macOSCuser✖︎✖︎✖︎✖︎✖︎✖︎✖︎WARC, files on disklimited✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎Use option --save-headers to save HTTP headers2017-05-171.19
27
wpullGPLGitHubLin, Win, macOSPythonuser✖︎✖︎︎plug-ins, scripts✖︎✖︎︎WARC✖︎✖︎✖︎︎✖︎✖︎︎✖︎✖︎︎✖︎︎✖︎︎✖︎✖︎2017-05-202.01
28
29
Single page snapshot/archiving systems
30
Archive.is✖︎n/aweb appn/auser✖︎✖︎✖︎︎REST✖︎✖︎web page✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎limited✖︎✖︎✖︎✖︎Very good quality page captures.2017-05-15
31
curlMITGitHubLin, Win, macOSCuser✖︎✖︎✖︎✖︎✖︎✖︎files on disk✖︎︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎︎limited✖︎︎✖︎✖︎✖︎✖︎Standard with many operating systems2017-05-21
32
FreezePage✖︎n/aweb appn/auser✖︎✖︎✖︎✖︎✖︎✖︎✖︎web page✖︎✖︎︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎Seems to be free for use, but not open source.2017-06-25
33
Paparazzi!✖︎n/amacOSn/auser✖︎︎✖︎✖︎︎✖︎︎✖︎︎✖︎︎PDF, PNG, JPG, TIFF✖︎︎✖︎︎✖︎✖︎✖︎︎✖︎✖︎︎✖︎limited✖︎✖︎✖︎︎✖︎︎✖︎︎mhucka uses this all the timeVery good quality full-page captures.2017-05-24
34
Perma.ccMIT + GPLGitHubweb appPythonuser, enterprise✖︎︎✖︎︎✖︎RESTDjango✖︎✖︎WARC, PDF, PNG✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎limited✖︎2017-05-24
35
WARCreateMITGitHubChrome extensionJavaScriptuser✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎WARC✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎2017-07-14
36
webkit2pngMITGitHubLin, Win, macOSPythonuser✖︎✖︎✖︎✖︎✖︎✖︎✖︎PNG✖︎✖︎︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎2017-05-27
37
38
Data scraping systems
39
BixoApacheGitHubLin, Win, macOS, EMRJavauser, enterprise✖︎✖︎✖︎Java classes✖︎files✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎Author seems to be creator of Krugle. Uses Apache Nutch, Hadoop, Tika, others.2017-06-25
40
import.io✖︎n/aweb appn/auser, enterprise✖︎✖︎✖︎REST✖︎JSON, CSV, Gdocs, Tableau✖︎limited✖︎✖︎✖︎✖︎✖︎✖︎Offers edu & charity discounts. Has a pretty active user forum.2017-05-24
41
iRobotSoft.com✖︎✖︎Winn/auser, enterprise✖︎✖︎✖︎✖︎files on disk, user db✖︎✖︎✖︎✖︎✖︎✖︎✖︎2017-10-042.8.2
42
morph.ioAfferoGitHubLin, Win, macOS, Docker, web appRubyuser, enterprise✖︎Plug-ins (supports many langs)limitedSQLlite✖︎✖︎✖︎✖︎✖︎✖︎✖︎Offers cloud-based scraping. Active user forum.2017-05-29
43
PortiaBSDGitHubLin, Win, macOS, DockerPythonuser✖︎︎✖︎︎Pythonfiles, MySQL, git✖︎✖︎✖︎︎✖︎✖︎Has visual scraping definition editor.2017-05-29
44
WebScraper.io (fork)
LGPLv3GitHubChrome extensionJavaScriptuser✖︎✖︎✖︎✖︎✖︎✖︎✖︎CSV, CouchDB✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎✖︎A version of this is used by WebScraper.io (commercial)This is a fork; the original (by Martins Balodis) has not changed since 2014.2017-10-040.3.1
45
Web Scraper Plus+
✖︎n/aWinn/auser✖︎✖︎✖︎✖︎︎✖︎✖︎files, user db✖︎✖︎︎✖︎︎✖︎✖︎︎✖︎✖︎✖︎︎✖︎︎Can extract data & store in database. May no longer be supported.2017-05-23
46
47
48
– not yet categorized –
49
80legs.com
50
Abot
51
AbotX
52
Andjing
53
Anemone
54
Aperture
55
Apifier
56
Arachnid (Java)
57
Arachnid (PHP)
58
arachniweb
59
Arale
60
ArchiveBot
61
ARCOMEM
62
ASPseek
63
Bingo!
64
blekko✖︎
65
CCBot
66
cl-web-crawler
67
CrawlBot✖︎web app
68
crawler.js
69
crawwwler
70
DataparkSearch
71
DeepArc
72
DeepVaccum✖︎macOS
73
Django Dynamic Scraper
74
dryscrape
75
EIS Archiver
76
Ex-Crawler
77
F(b)arc
78
Gungho
79
Hounder
80
html-snapshots
81
html2warc
82
HyperSpider (JS)
83
icrawler
84
iwebcrawler
85
jedi-crawler
86
Jspider
87
JWAT
88
Knowlesys
89
LARM
90
Lassie
91
Lentil
92
LinkGrabber
93
METIS
94
Miru
95
mnoGoSearch
96
mummif.it
97
Newspaper
98
Norconex HTTP Collector
99
NutchWAX
100
OpenWayback
Error
Google logo

Google Docs encountered an error. Please try reloading this page, or coming back to it in a few minutes.

To learn more about the Google Docs editors, please visit our help center.


We're sorry for the inconvenience.
- The Google Docs Team