1 of 48

Advanced Web Scraping

Samples at: https://github.com/esagara/advanced-web-scraping

2 of 48

Web servers generally want two types of requests:

Get requests
Post requests

3 of 48

Get requests

Most common scraper target
URLs typically contain the data a server needs to display data.
But sometimes that information can be difficult to find.

4 of 48

Post requests

The server relies on you submitting data in addition to the URL in order to complete a web request.
Sometimes these forms and the data needed for them are hidden.

5 of 48

But don’t worry

6 of 48

We can find this data using tools found in most major browsers.

7 of 48

Chrome has the JavaScript Console.

8 of 48

Firefox has the developer tools.

9 of 48

Safari and Internet Explorer also have versions.

10 of 48

Understanding how to use these tools is key to scraping the more difficult sites.

11 of 48

Post requests

In order to scrape sites that require form data, you need to know what data it wants and how it wants it.

For this next part we will be looking at:

http://bit.ly/1Bo25K1

12 of 48

ASP.NET websites

ASP.NET is a Microsoft web framework. Navigating through pages requires submitting a hidden form. The form elements and corresponding values can be found using a web inspector or by viewing the source code of the page.

13 of 48

14 of 48

If you look in the source, you can see the form at work.

//<![CDATA[

var theForm = document.forms['aspnetForm'];

if (!theForm) {

theForm = document.aspnetForm;

}

function __doPostBack(eventTarget, eventArgument) {

if (!theForm.onsubmit || (theForm.onsubmit() != false)) {

theForm.__EVENTTARGET.value = eventTarget;

theForm.__EVENTARGUMENT.value = eventArgument;

theForm.submit();

}

//]]>

</script>

15 of 48

Most web inspectors allow you to look at what form data is being submitted to the server and its response.

16 of 48

17 of 48

ASP forms have strange fields

__VIEWSTATE

__EVENTTARGET

__EVENTARGUMENT

__EVENTVALIDATION

18 of 48

You will need values for at least some of these fields to scrape a page.

19 of 48

__VIEWSTATE

VIEWSTATE is a base64-encoded string, so it will not make sense to the naked eye. It is used by the server to track properties on the page that can change dynamically - such as when a user interacts with something on the page. It may not always be needed for a scrape, but I suggest using it anyways.

20 of 48

"/wEPDwUKMTAwNTQwNzU3MA9kFgJmD2QWAgIDD2QWAgIBD2QWAgIbD2QWAmYPZBYoAgQPDxYCHgRUZXh0BSggKEphbnVhcnkgMSwgMjAxNCB0byBTZXB0ZW1iZXIgMzAsIDIwMTQpZGQCBw8PFgIfAAURIERlY2VtYmVyIDMsIDIwMTRkZAIJDxYCHgdPbkNsaWNrBWdqYXZhc2NyaXB0OnJldHVybiBmbk9wZW5NZW51SGVscFBERignUGRmL1BoeXNpY2lhbiBEaXNjbG9zdXJlIEhlbHBfV2hvIGlzIGluY2x1ZGVkIGluIHRoZSByZXBvcnQucGRmJyk7ZAILDxYCHwEFaGphdmFzY3JpcHQ6cmV0dXJuIGZuT3Blbk1lbnVIZWxwUERGKCdQZGYvUGh5c2ljaWFuIERpc2Nsb3N1cmUgSGVscF9XaGF0IGlzIGluY2x1ZGVkIGluIHRoZSByZXBvcnQucGRmJyk7ZAINDxYCHwEFcGphdmFzY3JpcHQ6cmV0dXJuIGZuT3Blbk1lbnVIZWxwUERGKCdQZGYvQXN0cmFaZW5lY2EgUGh5c2ljaWFuIERpc2Nsb3N1cmUgUmVwb3J0X0hvdyB0byBzZWFyY2ggdGhlIHJlcG9ydC5wZGYnKTtkAg8PFgIfAQVRamF2YXNjcmlwdDpyZXR1cm4gZm5PcGVuTWVudUhlbHBQREYoJ1BkZi9GQVFzUGh5c2ljaWFuRGlzY2xvc3VyZVJlcG9ydHNQREYucGRmJyk7ZAIRDw8WAh4HVmlzaWJsZWhkZAITDw8WAh8AZWRkAhUPFgIeCWlubmVyaHRtbGVkAhcPFgIfAmgWAmYPZBYCZg9kFgICAw8QDxYGHg1EYXRhVGV4dEZpZWxkBQhSb3dfdGV4dB4ORGF0YVZhbHVlRmllbGQFCFJvd190ZXh0HgtfIURhdGFCb3VuZGcWAh4IT25DaGFuZ2UFD2ZuTG9hZFN0YXR1cygpOxAVAwIxMAIyMAI1MBUDAjEwAjIwAjUwFCsDA2dnZxYBAgFkAhkPFgIfAmhkAlUPFgIfAmhkAlsPEA8WBh8EBQpSZXBvcnRZZWFyHwUFAklEHwZnZBAVBAQyMDE0CzIwMTMgQW5udWFsCzIwMTIgQW5udWFsCzIwMTEgQW5udWFsFQQcMjAxNF4zMC1TRVAtMjAxNF4wMy1ERUMtMjAxNBwyMDEzXjMxLURFQy0yMDEzXjIyLU9DVC0yMDE0HDIwMTJeMzEtREVDLTIwMTJeMDktT0NULTIwMTMcMjAxMV4zMS1ERUMtMjAxMV4xMS1ERUMtMjAxMhQrAwRnZ2dnZGQCXw8PZBYEHgpPbktleVByZXNzBRxyZXR1cm4gdHh0VmFsaWRhdGlvbihldmVudCk7HglPbktleURvd24FF3JldHVybiBmbktleUVudGVyKCcxJyk7ZAJjDw9kFgQfCAUccmV0dXJuIHR4dFZhbGlkYXRpb24oZXZlbnQpOx8JBRdyZXR1cm4gZm5LZXlFbnRlcignMScpO2QCZw8PZBYEHwgFHHJldHVybiB0eHRWYWxpZGF0aW9uKGV2ZW50KTsfCQUXcmV0dXJuIGZuS2V5RW50ZXIoJzEnKTtkAmsPD2QWBB8IBRxyZXR1cm4gdHh0VmFsaWRhdGlvbihldmVudCk7HwkFF3JldHVybiBmbktleUVudGVyKCcxJyk7ZAJvDxAPFgYfBAUIU1RBVEVfSUQfBQUIU1RBVEVfSUQfBmcWAh8JBRVyZXR1cm4gZm5LZXlFbnRlcigxKTsQFTUKLS1TZWxlY3QtLQJBSwJBTAJBUgJBWgJDQQJDTwJDVAJEQwJERQJGTAJHQQJISQJJQQJJRAJJTAJJTgJLUwJLWQJMQQJNQQJNRAJNRQJNSQJNTgJNTwJNUwJNVAJOQwJORAJORQJOSAJOSgJOTQJOVgJOWQJPSAJPSwJPUgJQQQJQUgJSSQJTQwJTRAJUTgJUWAJVVAJWQQJWVAJXQQJXSQJXVgJXWRU1AAJBSwJBTAJBUgJBWgJDQQJDTwJDVAJEQwJERQJGTAJHQQJISQJJQQJJRAJJTAJJTgJLUwJLWQJMQQJNQQJNRAJNRQJNSQJNTgJNTwJNUwJNVAJOQwJORAJORQJOSAJOSgJOTQJOVgJOWQJPSAJPSwJPUgJQQQJQUgJSSQJTQwJTRAJUTgJUWAJVVAJWQQJWVAJXQQJXSQJXVgJXWRQrAzVnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2RkAnMPD2QWBB8IBRxyZXR1cm4gdHh0VmFsaWRhdGlvbihldmVudCk7HwkFF3JldHVybiBmbktleUVudGVyKCcxJyk7ZAJ3Dw9kFgQfCAUccmV0dXJuIHR4dFZhbGlkYXRpb24oZXZlbnQpOx8JBRdyZXR1cm4gZm5LZXlFbnRlcignMScpO2RkHHsVwY8/FzwZ2aAcBIVcjv81rxg="

21 of 48

__EVENTTARGET

In simple terms, the event target is basically telling the server what function you want it to execute when submitting the form. An example could be moving on to the next page of a database.

22 of 48

Many times this will have an empty string as a value, indicating you do not need to pass a value to the form. I capture it anyways, just to be safe.

23 of 48

__EVENTARGUMENT

This is the value passed on to the function named by the __EVENTTARGET field. An example would be the next page number. Again, not always needed.

24 of 48

__EVENTVALIDATION

This feature was added in ASP.NET 2.0 and is a security precaution to prevent unauthorized requests. I still see this blank quite a bit, but look for it if your scraper is breaking.

25 of 48

The other goodies

There are quite a bit of other form values that can be found on an ASP.NET page. In most cases these are the controls - the stuff we use to navigate or alter a web page to display the results we want.

26 of 48

Some examples:

27 of 48

Using those form fields provide us a way to page through the database.

28 of 48

Making a scraper work on ASP.NET

Both Ruby and Python can handle post requests provided you have the right libraries. In fact you may already be using them.

29 of 48

Python

Ruby

Requests

BeautifulSoup4

Mechanize

Nokogiri

30 of 48

The scrape in broad strokes:

Determine what parameters are required in making a post request.

__VIEWSTATE and other ASP.NET standard form fields.
Additional fields that contain things such as date, page number or a person’s name.

31 of 48

Create a payload with the values you want to be passed to the server. This process varies depending on the programming language.

32 of 48

Submit the payload along with the URL to either Mechanize in Ruby or Requests in Python - again this varies by programming language.

33 of 48

If you are dealing with a paginated web page like in our example, collect the new form values from the server response using your HTML parser and add them to your payload, overwriting old values. Re-submit and parse out the results. Rinse and repeat.

34 of 48

Complex Get requests

Scrapes using Get requests can be fairly straightforward. Sometimes it’s only a matter of incrementing a page number.

Our example: http://bit.ly/1pw9mg8

35 of 48

But not always...

36 of 48

So how do we get at the data?

The key is to look at how information is being passed back and forth from the server. We could probably do the same thing by digging through the JavaScript, but I found looking at how the server communicates with the browser to be easier.

37 of 48

How do we do that?

In your browser open the web inspector while on the map page. Select the tab that says “Network.” Refresh the page to see how the browser is communicating with the server.

38 of 48

It should look something like this:

39 of 48

A lot of this is fairly standard and does not provide strong clues to what is happening. So we scroll down until we start seeing requests made to a URL with the word ‘data’ in it (this was a lucky guess on my part when I first looked at it).

40 of 48

What is happening

The outage map gets its data from a JSON object. That object is queried using a specific date and time included in the url path to the JSON.

41 of 48

So what do we need to do?

We need to find what date and time to include in the path. This is done with a query to a separate url. We can see this happening in the Network tab as well. This time we look for a file with the word ‘metadata’ in it.

42 of 48

Which returns an XML file that looks like this:

43 of 48

So how do we scrape this?

It’s actually two scrapes.

44 of 48

The first scrape

Using whichever language you prefer, generate a timestamp. Then append that timestamp onto the end of the URL to get the metadata:

http://outagecenter.pseg.com/data/interval_generation_data/metadata.xml?timestamp=

45 of 48

Using the HTML/XML parser of your choice (Nokogiri in Ruby and BeautifulSoup in Python) parse out the directory tag to get the value to be inserted in the URL for the second scrape:

http://outagecenter.pseg.com/data/interval_generation_data/<DATA DIRECTORY HERE>/data.js?timestamp=<TIMESTAMP HERE>

46 of 48

Conduct the final scrape and use the JSON parser for either Ruby or Python to read the result and write out to CSV or import into a database.

{"file_title":"data","file_data":{"total_customers":98,"total_served":2178738,"total_outages":10,"date_generated":"Mar 2, 5:15 PM","overwritten_etr":"n","overwritten_ca":"n","servlet_interval":900,"update_wording":"Information is updated every 15 minutes.","storm_mode":"n"}}

47 of 48

The key to these scrapes is the web inspector. You should use it to:

Examine how files are being passed between the server and the browser.
Look at the headers of the files - what requests are being made, what the parameters of those requests are and how the server responds.

48 of 48

Doing those two steps should solve most challenges presented by complex scrapes.