A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ABSTRACT | ||||||||||||||||||
2 | Spreadsheet and script used to automate web scraping of structured data creating a loop with a numeric parameter variable | ||||||||||||||||||
3 | |||||||||||||||||||
4 | SHEETS | DESCRIPTION | STATUS | ||||||||||||||||
5 | README | Introduction to the template functionality and examples | Created by the template | ||||||||||||||||
6 | Advanced | Advanced execution parameters at the beginning we can leave the default values | Created by the template | ||||||||||||||||
7 | Parameters | This is the main sheet where we need to inform the input parameters and where the script writes information about the execution status | Created by the template | ||||||||||||||||
8 | Logs | Information about errors on the script execution | Created/Deleted by the script | ||||||||||||||||
9 | Aux | Sheet used to retrieve each dataset with the ImportXML function | Created/Deleted by the script | ||||||||||||||||
10 | Data | Sheet where the script accumulates the data from each ImportXML function. This is our output sheet | Created/Deleted by the script | ||||||||||||||||
11 | Test | Sheet used when we are testing the scraping parameters | Created/Deleted by the script | ||||||||||||||||
12 | |||||||||||||||||||
13 | SCRIPT FUNCTIONS (Custom Menu "Scraping" and buttons inside the parameters sheet) | DESCRIPTION | |||||||||||||||||
14 | Test Scraping Parameters | Used to test that the URL parameters and XPATH works in a simple case. Use this before launching the complete scraping execution | |||||||||||||||||
15 | Scrape Data | Once the test has given us the expected results. Use this function to launch the complete scraping process | |||||||||||||||||
16 | Reset Project | Use this function to restart the project from zero again | |||||||||||||||||
17 | Reset Project keeping data | Use this function to reset the parameters to start a new project It does not delete the Data Sheet so if we want to keep it, we need to rename it so that it does not get overwritten by the next execution | |||||||||||||||||
18 | |||||||||||||||||||
19 | SCRAPE DATA DETAILED DESCRIPTION | FURTHER RESOURCES | |||||||||||||||||
20 | 1. It constructs a loop for scraping web URLs varying a numeric parameter on each execution | ||||||||||||||||||
21 | 2. It uses the importXML google spreadsheet function to get the data from each constructed URL | http://www.youtube.com/watch?v=EXhmF9rjqP4 | |||||||||||||||||
22 | 3. It uses a XPATH query to retrieve the desired data inside each URL page | XPATH | |||||||||||||||||
23 | 4. It copys the values obtained to a new sheet whete we will accumulate the data from each iteration | ||||||||||||||||||
24 | 5. It uses time-based triggers to launch multiples executions of the script if needed (max. script execution time) | Time-Based Triggers | |||||||||||||||||
25 | 6. It gives an overview of the status of the execution inside the "Parameters sheet" | ||||||||||||||||||
26 | 7. It can send an email with a notification of the result of the execution of the script (error or successful) | Send Emails from spreadsheet | |||||||||||||||||
27 | 8. Use the Parameters sheet to enter your scraping project information | ||||||||||||||||||
28 | 9. Use the Advanced sheet if you want to tune some of the execution parameters (Email notifications, ...) | ||||||||||||||||||
29 | 10. Use the Scraping Menu or the buttons inside the Parameters sheet to test and scrape the data | ||||||||||||||||||
30 | |||||||||||||||||||
31 | URL CREATION OPTIONS | EXAMPLE USE | |||||||||||||||||
32 | A. If the value parameter name is informed | ||||||||||||||||||
33 | SITE + PREFIX_URL + VALUE_PARAM_NAME + "=" + iteration_assigned_value + SUFFIX_URL | http://medicalboard.co.ke/online-services/retention/?currpage=1 | |||||||||||||||||
34 | B. If the value parameter name is not informed | ||||||||||||||||||
35 | SITE + PREFIX_URL + iteration_assigned_value + SUFFIX_URL | http://www.formula1.com/results/season/2013/ | |||||||||||||||||
36 | |||||||||||||||||||
37 | |||||||||||||||||||
38 | USE CASES | DETAILS | |||||||||||||||||
39 | 1. Retrieve all the data from a web page where the page number is being used for pagination | page attribute going 1,2,3,.... | |||||||||||||||||
40 | 2. Retrieve data for many years in one go!! year inside the URL | year going 2010,2011,2012,... | |||||||||||||||||
41 | 3. Retrieve all the data from pages where they move the start element from page to page | start attribute going 10,20,30... | |||||||||||||||||
42 | |||||||||||||||||||
43 | EXAMPLE USE CASE 1: Get All the doctors | ||||||||||||||||||
44 | URL | http://medicalboard.co.ke/online-services/retention/ | |||||||||||||||||
45 | TARGET | Retrieve the results of all the pages | |||||||||||||||||
46 | URL FIRST ITERATION | http://medicalboard.co.ke/online-services/retention/?currpage=1 | |||||||||||||||||
47 | URL SECOND ITERATION | http://medicalboard.co.ke/online-services/retention/?currpage=2 | |||||||||||||||||
48 | |||||||||||||||||||
49 | EXAMPLE USE CASE 1 INPUT PARAMETERS | ||||||||||||||||||
50 | SITE (*) | PREFIX URL | VALUE PARAM NAME | SUFFIX URL | Start value | Step value | Stop value | XPATH to data (*) | |||||||||||
51 | http://medicalboard.co.ke/ | online-services/retention/? | currpage | 1 | 1 | 170 | //table[@class='zebra']//tr | ||||||||||||
52 | |||||||||||||||||||
53 | |||||||||||||||||||
54 | EXAMPLE USE CASE 2: Get all the formula1 history results | ||||||||||||||||||
55 | URL | http://www.formula1.com/results/season/2013/ | |||||||||||||||||
56 | TARGET | Retrieve the results of all the years | |||||||||||||||||
57 | URL FIRST ITERATION | http://www.formula1.com/results/season/1950/ | |||||||||||||||||
58 | URL SECOND ITERATION | http://www.formula1.com/results/season/1951/ | |||||||||||||||||
59 | |||||||||||||||||||
60 | EXAMPLE USE CASE 2 INPUT PARAMETERS | ||||||||||||||||||
61 | SITE (*) | PREFIX URL | VALUE PARAM NAME | SUFFIX URL | Start value | Step value | Stop value | XPATH to data (*) | |||||||||||
62 | http://www.formula1.com/ | results/season/ | / | 1950 | 1 | 2013 | //table[@class='raceResults']//tr | ||||||||||||
63 | |||||||||||||||||||
64 | |||||||||||||||||||
65 | EXAMPLE USE CASE 3: Get the first 100 results from a search in google | ||||||||||||||||||
66 | URL | http://www.google.com/search?hl=en&gl=US&q=google+apps+script | |||||||||||||||||
67 | TARGET | Retrieve all the results | |||||||||||||||||
68 | URL FIRST ITERATION | http://www.google.com/search?hl=en&gl=US&q=google+apps+script&start=0 | |||||||||||||||||
69 | URL SECOND ITERATION | http://www.google.com/search?hl=en&gl=US&q=google+apps+script&start=10 | |||||||||||||||||
70 | |||||||||||||||||||
71 | EXAMPLE USE CASE 3 INPUT PARAMETERS | ||||||||||||||||||
72 | SITE (*) | PREFIX URL | VALUE PARAM NAME | SUFFIX URL | Start value | Step value | Stop value | XPATH to data (*) | |||||||||||
73 | http://www.google.com/ | search?hl=en&gl=US&q=google+apps+script& | start | 0 | 10 | 100 | //h3[@class='r']/a/@href | ||||||||||||
74 | |||||||||||||||||||
75 | DISCLAIMER | ||||||||||||||||||
76 | 1. This script is intended to help non-coders to overcome the need to manually enter importXML or importHTML for each variation of a desired Dataset | ||||||||||||||||||
77 | 2. I do not use this script to scrape my datasets, I use ruby+mechanize+nokogiri to do so.... | ||||||||||||||||||
78 | 3. The Google Apps Script code written for this scraping functionality is public so feel free to base on it to adapt it to other possible use cases....if you mentioned the source that would be awesome | ||||||||||||||||||
79 | 4. This is a alpha version so if this script fails you can try to reach me for help on twitter | ||||||||||||||||||
80 | 5. The script needs several permissions in order to be able to run: | ||||||||||||||||||
81 | - Send emails is only used to send the script result notification | ||||||||||||||||||
82 | - Access external services is used by UrlFetchApp to check if the URL we are scraping is valid | ||||||||||||||||||
83 | - Execute when you are not present because of the time-based triggers | ||||||||||||||||||
84 | - Create spreadsheets, is used to create new sheets for the script execution results | ||||||||||||||||||
85 | |||||||||||||||||||
86 | AUTHOR | ||||||||||||||||||
87 | Juan Elosua | ||||||||||||||||||
88 | |||||||||||||||||||
89 | MORE INFORMATION | ||||||||||||||||||
90 | Scraping Loop detailed explanation | ||||||||||||||||||
91 | |||||||||||||||||||
92 | |||||||||||||||||||
93 | |||||||||||||||||||
94 | |||||||||||||||||||
95 | |||||||||||||||||||
96 | |||||||||||||||||||
97 | |||||||||||||||||||
98 | |||||||||||||||||||
99 | |||||||||||||||||||
100 |