ABCDEFGHIJKLMNOPQRSTUVWXYZAA
1
PDF URLs SHEETWeb URLs SHEETPDFx
2
3
Script to download all URLs in a text file recursively from the waybackmachine>> Extract URLs & Corresponding Anchor Text
https://github.com/metachris/pdfx
4
One script per PDF depending on the number of URLs extractedThe following is a cross-browser supported code for extracting URLs along with their anchor text.
5
#!/bin/bash
6
#This script will use a file populated with a list of URLs as an input file and replace the $line with a URL.var urls=$$('a');
7
for(url in urls){
8
# The file filled with URLs in this case is being read by the loop below, via input.file.console.log("#"+url+" > "+urls[url].innerHTML +" >> "+urls[url].href)
9
}
10
while read -r line
11
>> Extract Links with their anchor text (For Chrome & Firefox)- Styled version
12
# The loop starts with while read -r line In this case, line will refer to each line (a URL) in your file.If you are using Chrome or Firefox use the following code for a styled version of the same.
13
# The loop then runs wayback_machine_downloader ... "$line", which means it's running the command against the URL line from your file.
14
var urls=$$('a');
15
dofor(url in urls){
16
wayback_machine_downloader -a -d "$line"_IA -c1 --only "/\.(sit|sea|img|smi|hqx|bin|pkg|cpt)$/i" "$line"console.log("%c#"+url+" > %c"+urls[url].innerHTML +" >> %c"+urls[url].href,"color:red;","color:green;","color:blue;");
17
done < /Users/admin/Desktop/URLs.txt}
18
19
# add to tail if you want a log file>> Extract URLs Only 
20
# | tee -a log.txtAnd if you want to extract just the links without the anchor text, then use the following code.
21
22
# *************The final line done < input.file is what's providing your file to the loop. You can replace input.file with the actual name of your file.var urls=$$('a');
23
$ chmod a+x foo.sh to make the shellscript executablefor(url in urls)
24
Additional info: https://archive.org/details/github.com-hartator-wayback-machine-downloader_-_2017-06-05_10-33-01console.log(urls[url].href);
25
If you see many EOF in stdout consider reducing the number of concurrent downloads. -c6 is used in the script but you should probably reduce it to one actually to reduce pressure on s archive.org's server thereby increasing download success. Also the text file containing URLs should not end with a / as it may results in a child _IA folder rather than the _IA appendage to the destination download folder.
26
PDFs are being downloaded on a case by case basis. Reason: Once a Macintosh focused site is found PDFs would be relevant to the project.>> Extract External URLs Only
27
External Links are the ones that point outside the current domain. If you want to extract the external URLs only, then this is the code you need to use.
28
Key:
29
- Title = The name of the Book or Document in PDF formatvar links = $$('a');
30
- foo.txt = The name of the local file hosting URLs for downloadfor (var i = links.length - 1; i > 0; i--) {
31
- pdfx -v path_to_pdf_file > foo.txt = extract URLsif (links[i].host !== location.host) {
32
console.log(links[i].href);
33
}
34
}
35
36
>> Extract URLs with a specific extension
37
If you would like to extract links having a particular extension then paste the following code into the console. Pass the extension wrapped in quotes to the getLinksWithExtension() function. Please note that the following code extracts links from HTML link tag only (<a></a>)  and not from other tags such as a script or image tag.
38
39
function getLinksWithExtension(extension) {
40
var links = document.querySelectorAll('a[href$="' + extension + '"]'),
41
i;
42
43
for (i=0; i<links.length; i++){
44
console.log(links[i]);
45
}
46
}
47
getLinksWithExtension('sit') //change sit to any extension
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
Method 1
63
Recursive PDF URL extraction across a directory - outputting stdout to a text file
64
find /path/to/folder -type f -name '*.pdf' -exec pdfx -v {} \; > foo_IA.txt
65
66
Method 2
67
Or navigate to a folder with PDFs and …
68
ls *.pdf | xargs -n 1 pdfx -v
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100