ABCDEFGHIJKLMNOPQRS
1
ABSTRACT
2
Spreadsheet and script used to automate web scraping of structured data
creating a loop with a numeric parameter variable
3
4
SHEETSDESCRIPTIONSTATUS
5
READMEIntroduction to the template functionality and examplesCreated by the template
6
AdvancedAdvanced execution parameters at the beginning we can leave the default valuesCreated by the template
7
ParametersThis is the main sheet where we need to inform the input parameters and where the script writes information about the execution statusCreated by the template
8
LogsInformation about errors on the script executionCreated/Deleted by the script
9
AuxSheet used to retrieve each dataset with the ImportXML functionCreated/Deleted by the script
10
DataSheet where the script accumulates the data from each ImportXML function. This is our output sheetCreated/Deleted by the script
11
TestSheet used when we are testing the scraping parametersCreated/Deleted by the script
12
13
SCRIPT FUNCTIONS (Custom Menu "Scraping" and buttons inside the parameters sheet)DESCRIPTION
14
Test Scraping ParametersUsed to test that the URL parameters and XPATH works in a simple case.
Use this before launching the complete scraping execution
15
Scrape DataOnce the test has given us the expected results.
Use this function to launch the complete scraping process
16
Reset ProjectUse this function to restart the project from zero again
17
Reset Project keeping dataUse this function to reset the parameters to start a new project
It does not delete the Data Sheet so if we want to keep it, we need to rename it so that it does not get overwritten by the next execution
18
19
SCRAPE DATA DETAILED DESCRIPTIONFURTHER RESOURCES
20
1. It constructs a loop for scraping web URLs varying a numeric parameter on each execution
21
2. It uses the importXML google spreadsheet function to get the data from each constructed URLhttp://www.youtube.com/watch?v=EXhmF9rjqP4
22
3. It uses a XPATH query to retrieve the desired data inside each URL pageXPATH
23
4. It copys the values obtained to a new sheet whete we will accumulate the data from each iteration
24
5. It uses time-based triggers to launch multiples executions of the script if needed (max. script execution time)Time-Based Triggers
25
6. It gives an overview of the status of the execution inside the "Parameters sheet"
26
7. It can send an email with a notification of the result of the execution of the script (error or successful)Send Emails from spreadsheet
27
8. Use the Parameters sheet to enter your scraping project information
28
9. Use the Advanced sheet if you want to tune some of the execution parameters (Email notifications, ...)
29
10. Use the Scraping Menu or the buttons inside the Parameters sheet to test and scrape the data
30
31
URL CREATION OPTIONSEXAMPLE USE
32
A. If the value parameter name is informed
33
SITE + PREFIX_URL + VALUE_PARAM_NAME + "=" + iteration_assigned_value + SUFFIX_URLhttp://medicalboard.co.ke/online-services/retention/?currpage=1
34
B. If the value parameter name is not informed
35
SITE + PREFIX_URL + iteration_assigned_value + SUFFIX_URLhttp://www.formula1.com/results/season/2013/
36
37
38
USE CASESDETAILS
39
1. Retrieve all the data from a web page where the page number is being used for paginationpage attribute going 1,2,3,....
40
2. Retrieve data for many years in one go!! year inside the URLyear going 2010,2011,2012,...
41
3. Retrieve all the data from pages where they move the start element from page to pagestart attribute going 10,20,30...
42
43
EXAMPLE USE CASE 1: Get All the doctors
44
URLhttp://medicalboard.co.ke/online-services/retention/
45
TARGETRetrieve the results of all the pages
46
URL FIRST ITERATIONhttp://medicalboard.co.ke/online-services/retention/?currpage=1
47
URL SECOND ITERATIONhttp://medicalboard.co.ke/online-services/retention/?currpage=2
48
49
EXAMPLE USE CASE 1 INPUT PARAMETERS
50
SITE (*)PREFIX URLVALUE PARAM NAMESUFFIX URLStart valueStep valueStop valueXPATH to data (*)
51
http://medicalboard.co.ke/online-services/retention/?currpage11170//table[@class='zebra']//tr
52
53
54
EXAMPLE USE CASE 2: Get all the formula1 history results
55
URLhttp://www.formula1.com/results/season/2013/
56
TARGETRetrieve the results of all the years
57
URL FIRST ITERATIONhttp://www.formula1.com/results/season/1950/
58
URL SECOND ITERATIONhttp://www.formula1.com/results/season/1951/
59
60
EXAMPLE USE CASE 2 INPUT PARAMETERS
61
SITE (*)PREFIX URLVALUE PARAM NAMESUFFIX URLStart valueStep valueStop valueXPATH to data (*)
62
http://www.formula1.com/results/season//195012013//table[@class='raceResults']//tr
63
64
65
EXAMPLE USE CASE 3: Get the first 100 results from a search in google
66
URLhttp://www.google.com/search?hl=en&gl=US&q=google+apps+script
67
TARGETRetrieve all the results
68
URL FIRST ITERATIONhttp://www.google.com/search?hl=en&gl=US&q=google+apps+script&start=0
69
URL SECOND ITERATIONhttp://www.google.com/search?hl=en&gl=US&q=google+apps+script&start=10
70
71
EXAMPLE USE CASE 3 INPUT PARAMETERS
72
SITE (*)PREFIX URLVALUE PARAM NAMESUFFIX URLStart valueStep valueStop valueXPATH to data (*)
73
http://www.google.com/search?hl=en&gl=US&q=google+apps+script&start010100//h3[@class='r']/a/@href
74
75
DISCLAIMER
76
1. This script is intended to help non-coders to overcome the need to manually enter importXML or importHTML for each variation of a desired Dataset
77
2. I do not use this script to scrape my datasets, I use ruby+mechanize+nokogiri to do so....
78
3. The Google Apps Script code written for this scraping functionality is public so feel free to base on it to adapt it to other possible use cases....if you mentioned the source that would be awesome
79
4. This is a alpha version so if this script fails you can try to reach me for help on twitter
80
5. The script needs several permissions in order to be able to run:
81
- Send emails is only used to send the script result notification
82
- Access external services is used by UrlFetchApp to check if the URL we are scraping is valid
83
- Execute when you are not present because of the time-based triggers
84
- Create spreadsheets, is used to create new sheets for the script execution results
85
86
AUTHOR
87
Juan Elosua
88
89
MORE INFORMATION
90
Scraping Loop detailed explanation
91
92
93
94
95
96
97
98
99
100