Web Scraping With Python
An introduction to the best python scraping libraries
Katharine Jarmul / @kjam
Pycon 2014
Why Scrape the Web?
Copyright / Permission
If you plan to use the content you scrape for any sort of distribution or to post online, please inform yourself regarding the copyright laws and permission related to the work you are disseminating.
In many cases, you can also scrape important information like the original author / creator / photographer and links back to the initial source -- both of which help to promote a more positive environment and establish the requirements for fair use.
Copyright / Permission (Cont.)
Whatever you choose to use scraping for, you should inform yourself on the media law that relates to these practices and follow them when they apply to your desired result.
http://www.copyright.gov/help/faq/faq-fairuse.html
Don’t Be Evil
The powers of web scraping and automation are vast and can be used for dubious purposes. Please don’t use these to spam people or web sites. Please don’t use these to try and takedown sites. Use your best judgment and make sure you don’t overstep real or perceived boundaries.
Let’s Get Web Page!
Wait! What are we doing?!
urllib/urllib2 POP QUIZ
Keep those terminals open!
Let’s get more complex
But first… what’s in a web page?
paragraph with a class!
it’s in a table in a table...
it looks like most of the info is in other paragraphs just below the title
BeautifulSoup
Some introductory BS functions
Beautiful Soup: Isolating HTML
Beautiful Soup: Extracting Content
Scraping Lexicon: Family Trees
Siblings? Children? Parents? Ancestors? What makes HTML a family?
HTML, XML and numerous other markup languages have a syntax that allows for family tree relationships. We can can think of them as nodes within nodes. Let’s take a closer look.
BeautifulSoup: Family Functions
current element
child / first child / descendent
parent / ancestor
previous sibling
not a parent ...
descendent / child
next sibling
We are Family: POP QUIZ!
Some other fun BS commands
BS: Final Quiz!
LXML: An Introduction
LXML ToolKit (HTML library)
LXML: Select a portion of the page
LXML: Extracting Content
LXML: Some other functions
LXML: Pop Quiz!
XPATH: Fast Parsing with LXML
XPATH is a nifty way that allows you to turn HTML into an easy to parse XML-node language. LXML has a built in xpath parser and there are several other libraries that have their own. It is pretty easy to learn and very portable. With a little bit of regex it can be a supremely powerful scraper. Let’s take a closer look.
XPATH Basics
Remember the family tree? XPATH uses a similar language.
XPATH Next Steps
Parsing with XPATH
Let’s explore with ipython!
LXML and XPath
XPATH Pop Quiz!
BS / LXML Scraping Limitations
What to do?
Selenium: Getting Started
Selenium comes built in with Mozilla Firefox support as long as you have that browser installed. Everyone here should have that already put together. Let’s try opening a browser and taking a look at it.
In iPython, let’s %run start_selenium.py
Selenium: Basic Functions
Selenium: A Demo
Selenium is great for user interactions that require pesky bits of javascript or user interactions. Let’s take a look at a script that can email me Netflix’s latest recommendations from my watch instantly. This is information I can’t easily scrape via any other library and that is not available via the API.
Selenium: A Closer Look
Selenium: Basic Functions
Selenium: Ghost Type the Page
Selenium: A Closer Look
Selenium: Scrolling and Moving
Selenium: Waaaait for it
Waits are a way to make sure the page and the DOM are all properly loaded.
Selenium: A Closer Look
Selenium: More on Waits (EC)
Selenium: Other Functions
Selenium: POP QUIZ
CSV Parsing
Parsing CSVs with python is actually very easy! Using the standard libraries CSVReader we can turn a csv with titles into a very usable dictionary. Let’s take a look.
CSV Parsing: A Closer Look
CSV Parsing: More Advanced
XLSX Parsing
If you’re handling user uploads or spreadsheets with separate sheets and macros, sometimes a CSV is too primitive. You can use several libraries to help parse XLSX, but the most widely used one is openpyxl. Let’s take a look.
XLSX Parsing: A Closer Look
XLSX Parsing: More Advanced
JSON Parsing
JSON is a native data type to python -- the python dictionary. Therefore, it’s incredibly easy to ingest and manipulate JSON with python. Let’s take a look.
JSON Parsing
JSON Parsing: More advanced
Depending on what you’d like to do, you might find helper libraries most useful when dealing with complex JSON requests. For example, there are libraries dedicated toward making Twitter and Facebook and Google API requests that essentially handle the JSON heavy lifting for you. Don’t reinvent the wheel (just know how to use it!).
Parting Advice
I hope you’ve found this tutorial informative and clear! The github repository can easily be forked so you can use these initial example scripts as a starting off point of many other fun explorations into scraping. Feel free to tweet your forks to me or message me on any channel if you run into issues. Remember: most of programming is banging your head against a wall and not giving up. So make sure it’s a soft wall and feel free to reach out whenever you hit bumps. <3
Questions?