1 of 59

Web Scraping With Python

An introduction to the best python scraping libraries

Katharine Jarmul / @kjam

Pycon 2014

2 of 59

Why Scrape the Web?

  • Vast source of information
  • Automate simple or complex tasks
  • Keep up with news, friends, family without checking numerous sites
  • Super fun!
  • Learn more python!!

3 of 59

Copyright / Permission

If you plan to use the content you scrape for any sort of distribution or to post online, please inform yourself regarding the copyright laws and permission related to the work you are disseminating.

In many cases, you can also scrape important information like the original author / creator / photographer and links back to the initial source -- both of which help to promote a more positive environment and establish the requirements for fair use.

4 of 59

Copyright / Permission (Cont.)

Whatever you choose to use scraping for, you should inform yourself on the media law that relates to these practices and follow them when they apply to your desired result.

http://www.copyright.gov/help/faq/faq-fairuse.html

5 of 59

Don’t Be Evil

The powers of web scraping and automation are vast and can be used for dubious purposes. Please don’t use these to spam people or web sites. Please don’t use these to try and takedown sites. Use your best judgment and make sure you don’t overstep real or perceived boundaries.

6 of 59

Let’s Get Web Page!

7 of 59

Wait! What are we doing?!

  • Whenever you use a new library, read the dox! (Read the dox, they are SUPER helpful: http://docs.python.org/library/urllib2.html
  • Whenever we explore new libraries, let's do some iPython (or python, interaction)
  • Fire up a terminal! >:)

8 of 59

urllib/urllib2 POP QUIZ

Keep those terminals open!

  • Is the word 'python' on the pyladies homepage (http://pyladies.com)?
  • Does google.com have an image? (hint: img tag!)
  • What are the first 10 characters on python.org?
  • Bonus: Find your name on a web page!

9 of 59

Let’s get more complex

  • urllib / urllib2 / urllib3 are great for very simple requests, but they don’t help us interpret responses in a meaningful way
  • Usually you want to parse data the way our EYES would -- therefore you need a more complex library.
  • We will cover: BeautifulSoup, LXML, XPath and Selenium

10 of 59

But first… what’s in a web page?

  • HTML and CSS are your friends
  • Developer Tools / Firebug are your weapons
  • Let’s investigate with a favorite page of mine:
    • Visit downtownla.com > play > happy hours

11 of 59

paragraph with a class!

it’s in a table in a table...

it looks like most of the info is in other paragraphs just below the title

12 of 59

BeautifulSoup

  • One of the first Python libraries used for scraping
  • Ported to be Python 3 compatible
  • Very powerful, very simple, very few requirements

13 of 59

Some introductory BS functions

  • find_all(‘a’) == find all links in the document in a list form
  • find(‘title’) == find the first title that you find in the document
  • get(‘href’) == get the href attribute value from an element on the page
  • (element).text == retrieve text associated w/ that element (i.e. text contents of the tag)

14 of 59

Beautiful Soup: Isolating HTML

15 of 59

Beautiful Soup: Extracting Content

16 of 59

Scraping Lexicon: Family Trees

Siblings? Children? Parents? Ancestors? What makes HTML a family?

HTML, XML and numerous other markup languages have a syntax that allows for family tree relationships. We can can think of them as nodes within nodes. Let’s take a closer look.

17 of 59

BeautifulSoup: Family Functions

  • findChildren == find all children of this element in a list form
  • findChild == find the first child of this element
  • findParent == find the first parent of this elem
  • findPreviousSibling == find the closest previous sibling (only one)
  • findNextSiblings == return a list of all the next siblings

18 of 59

current element

child / first child / descendent

parent / ancestor

previous sibling

not a parent ...

descendent / child

next sibling

19 of 59

We are Family: POP QUIZ!

  • To begin, open ipython and from inside the workshop folder and type: %run family_tree.py → Now you have current_elem in your ipython window!
  • From the current element, how can you find the text of the header?
  • From the current element, how can you find the Lorem Ipsum text?
  • Can you get to the style sheet information?

20 of 59

Some other fun BS commands

  • .contents = show all children, including text not in tags
  • .(tag) = find first child with this tag (i.e. my_div.p returns first paragraph of that element)
  • .next_element / .previous_element = iterate the family tree regardless of relationship
  • .stripped_strings = generator to obtain all the strings stripped of extra whitespace that are child elements
  • .find_all([‘p’,’div’]) = use find all but with a list of elements you’d like to match

21 of 59

BS: Final Quiz!

  • What are two different ways to get the text of the title on the page?
  • In one line of code, can you print all the text in the main body of the page?
  • How many divs are on the page (using BS!)?
  • Can you print the text of just the LI elements on the page?

22 of 59

LXML: An Introduction

  • lxml.de
  • Advanced library with more complex tools for cleaning and parsing web pages
  • Lots of dependencies (less portable)
  • Scalable / flexible

23 of 59

LXML ToolKit (HTML library)

    • cssselect(‘div.content’) == find all div elements that have the class attribute “content” (GREAT tool for front end devs or others who know about CSS)
    • find_class(‘nav’) == find all elements with this class or find an elem w/in this element that matches this class
    • text_content() == text w/in this elem and its descendents
    • clean_html() == try to fix errors in the formatting of this page

24 of 59

LXML: Select a portion of the page

25 of 59

LXML: Extracting Content

26 of 59

LXML: Some other functions

  • (elem).iter_links() == this is an iterator that will find all descendant links and return them as structured tuples
  • get == same functionality as BS
  • iterdescendants() == an iterator that will find all descendants
  • (elem).attrib == list the key,value of the elements attributes

27 of 59

LXML: Pop Quiz!

  • Use the source of python.org
  • How many paragraphs are on the page?
  • What is the text content of the div with the class “shrubbery”? What are the links in that same div?
  • What is the text in the code elements?
  • Bonus: Are there any forms?

28 of 59

XPATH: Fast Parsing with LXML

XPATH is a nifty way that allows you to turn HTML into an easy to parse XML-node language. LXML has a built in xpath parser and there are several other libraries that have their own. It is pretty easy to learn and very portable. With a little bit of regex it can be a supremely powerful scraper. Let’s take a closer look.

29 of 59

XPATH Basics

Remember the family tree? XPATH uses a similar language.

  • /body/p would match all paragraphs in the body tag
  • //p/a would match all links in a paragraph in any child
  • ../p would match all paragraphs in the parent element

30 of 59

XPATH Next Steps

  • @ will single out the attribute (/a/@href)
  • * / @* is a wildcard variable that allows you to match any sub-element or any attribute of the element you are parsing
  • text() / comment() / node() match any of those elements within the current element (//p/text() would then match all text of all child paragraphs)

31 of 59

Parsing with XPATH

Let’s explore with ipython!

  • %run xpath_intro.py
  • now we have an LXML tree called simple_tree
  • simple_tree.xpath(‘//p’)
  • simple_tree.xpath(‘//p/text()’)
  • what are the direct children of simple_tree?

32 of 59

LXML and XPath

  • Every LXML element has numerous ways to utilize XPath
    • find
    • findall
    • xpath
  • Use whichever makes the most sense to you or the one that is the most explicit

33 of 59

XPATH Pop Quiz!

  • Can you show all of the text that are in lists on the page?
  • Can you find all of the attributes on the page?
  • Can you find any links on the page?
  • Can you get to the style sheet information?
  • Extra credit: //elem[@attr= “foo”] will match elements where that attribute is equal to foo. Find me just the divs that have class ‘contentblock’.
  • Extra extra: try the above without the [] and using / to separate the @attr. What kind of response do you get?

34 of 59

BS / LXML Scraping Limitations

  • Not every site has HTML that can be parsed with BS / LXML
      • DOM loaded content
      • Really broken HTML / XML
      • Proprietary / Login-required *can* be difficult (depends on how the login flow works)
      • JS form interaction

35 of 59

What to do?

  • Learn node.js ….. ORRRR….
  • SELENIUM!
  • Selenium is an open-source Java project that has a python binding library that allows you to use browser-based interactions to navigate websites after the DOM loads. It is easy to learn and manipulate and is often used for QA and testing.

36 of 59

Selenium: Getting Started

Selenium comes built in with Mozilla Firefox support as long as you have that browser installed. Everyone here should have that already put together. Let’s try opening a browser and taking a look at it.

In iPython, let’s %run start_selenium.py

37 of 59

Selenium: Basic Functions

  • get(‘url’): go to the url
  • elem.click(): click on the element you have selected
  • Element properties:
    • location: x and y coordinates
    • parent: parent element (might be browser / driver depending on how it was accessed)
    • tag_name: what is the tag of the element (eg. ‘a’)
    • text: get text of this element and it’s children

38 of 59

Selenium: A Demo

Selenium is great for user interactions that require pesky bits of javascript or user interactions. Let’s take a look at a script that can email me Netflix’s latest recommendations from my watch instantly. This is information I can’t easily scrape via any other library and that is not available via the API.

39 of 59

Selenium: A Closer Look

40 of 59

Selenium: Basic Functions

  • find_element(s) …
    • by_link_text(‘foo’): find the link where the text is foo
    • by_partial_link_text: only a part of the text needs to be identified (think ‘contains’)
    • by_css_selector: just like with lxml css
    • by_tag_name: ‘a’ for the first link or all links
    • by_xpath: again practicing your xpath regex!
    • by_class_name: CSS related, but this finds all different types that have the same class

41 of 59

Selenium: Ghost Type the Page

  • send_keys
    • for any text field you can find that element and then send it keys.
    • elem.send_keys(‘myPassword’)
    • elem.send_keys(‘heres more and then I’m entering’, Keys.RETURN)
    • there are many keys available (http://selenium-python.readthedocs.org/api.html#module-selenium.webdriver.common.keys)
    • you may also elem.clear() to clear the entered text.

42 of 59

Selenium: A Closer Look

43 of 59

Selenium: Scrolling and Moving

  • Moving around a page can prove tricky, so patience is key.
    • ActionChains provide a way of stringing together one or more actions and then implementing them. As seen in our script, we can:
      • ActionChains(browser).move_by_offset(x,y)
      • ActionChains(browser).move_to_element(elem)
      • ActionChains(browser).move_to_element_by_offset(elem, x, y)

44 of 59

Selenium: Waaaait for it

Waits are a way to make sure the page and the DOM are all properly loaded.

  • Explicit Waits: You can tell the browser to wait for a particular element (or other condition) for up to 10 seconds.
  • Implicit Waits: This will poll the DOM for up to 10 seconds and then continue.

45 of 59

Selenium: A Closer Look

46 of 59

Selenium: More on Waits (EC)

  • To use Explicit Waits properly, you’ll dive into expected conditions. This has plethora options for you to specify the type of element behavior you are expecting. The documentation is very straightforward: http://selenium-python.readthedocs.org/waits.html#explicit-waits (there is also further writing on each in the class notes)

47 of 59

Selenium: Other Functions

  • browser.execute_script(‘window.close()’): execute any javascript on a loaded page
  • browser.save_screenshot(‘foo.png’): take a screenshot of a page
  • browser.switch_to_alert(): handle pop ups
  • browser.forward() / browser.back(): handle navigation

48 of 59

Selenium: POP QUIZ

  • Can you search Google for ‘Selenium’?
  • Can you then return a list of the top 5 search results?
  • Can you print out a list of the text of the top 5 search results?
  • Can you click on the first search result you receive?

49 of 59

CSV Parsing

Parsing CSVs with python is actually very easy! Using the standard libraries CSVReader we can turn a csv with titles into a very usable dictionary. Let’s take a look.

50 of 59

CSV Parsing: A Closer Look

51 of 59

CSV Parsing: More Advanced

  • If you need to use a separate delimiter, you can pass it as an argument in your reader (e.g. DictReader(document, delimiter = ‘\t’) )
  • If you don’t have headers, you can use a simple CSV reader to simply navigate each row in a list format.

52 of 59

XLSX Parsing

If you’re handling user uploads or spreadsheets with separate sheets and macros, sometimes a CSV is too primitive. You can use several libraries to help parse XLSX, but the most widely used one is openpyxl. Let’s take a look.

53 of 59

XLSX Parsing: A Closer Look

54 of 59

XLSX Parsing: More Advanced

  • When possible, convert to CSVs as they are more standard and easy to use. I’ve also included a great script for that.
  • Dates are stored as floats. Use the xldate_as_tuple to convert.
  • XLRD has ways to investigate macros and formulas (See: Name Class).

55 of 59

JSON Parsing

JSON is a native data type to python -- the python dictionary. Therefore, it’s incredibly easy to ingest and manipulate JSON with python. Let’s take a look.

56 of 59

JSON Parsing

57 of 59

JSON Parsing: More advanced

Depending on what you’d like to do, you might find helper libraries most useful when dealing with complex JSON requests. For example, there are libraries dedicated toward making Twitter and Facebook and Google API requests that essentially handle the JSON heavy lifting for you. Don’t reinvent the wheel (just know how to use it!).

58 of 59

Parting Advice

I hope you’ve found this tutorial informative and clear! The github repository can easily be forked so you can use these initial example scripts as a starting off point of many other fun explorations into scraping. Feel free to tweet your forks to me or message me on any channel if you run into issues. Remember: most of programming is banging your head against a wall and not giving up. So make sure it’s a soft wall and feel free to reach out whenever you hit bumps. <3

59 of 59

Questions?

  • I’m always available on Twitter (@kjam) or Freenode (kjam)
  • Please fill out the reviews to help make future versions of this tutorial better! I appreciate your candid and honest feedback!