Python Web Scraping 101

~1/28/2012 meeting of Hacks/Hackers @ Sunlight Foundation

Taught by:

Jackie Kazil, jackiekazil@gmail.com

Serdar Tumgoren, zstumgoren@gmail.com

Session Resources/Links

Prelude to your First Scrape

Your First Scrape

Good Programmers Are Lazy

Learning How to Learn

Your Second Scrape

What We Didn’t Cover

Programmers Toolkit

How To Keep Learning

Rules of the Road

Session Resources/Links

https://github.com/PythonJournos/LearningPython/tree/master/tutorials/webscraping101

http://staticresults.sos.la.gov/

http://fec.gov/finance/disclosure/efile_search.shtml

Prelude to your First Scrape

Like most programming languages, Python has lots of handy built-in libraries that do the heavy lifting for everyday tasks such as parsing CSV files, working with the file system, and downloading data from the Internet.

But there are many excellent 3rd-party libraries that make life even easier, or provide functionality not baked into Python. In order to easily get access to these tools, we need a “package manger” that lets us easily grab these libraries off the Internet and install them locally. 

Have class install setuptools and use easy_install to install pip.

Then use pip to install BeautifulSoup libraries. (Ask and explain what a library is if needed.)

(Alternative setuptools install methods for Windows)

Your First Scrape

FDIC Failed Banks List

http://www.fdic.gov/bank/individual/failed/banklist.html

Use urrlib plus BeautifulSoup to do this scrape

View the source and locate the table

Have class type out the exact same code as you to produce their first scrape. Go over some of the fundamentals of Python (imports, for loop, variables, white space, etc. ) once everyone has successfully scraped the page.

Good Programmers Are Lazy

Before coding, do some research. There are a ton of libraries out there -- either built-in or 3rd party -- that could do precisely what you’re thinking of coding yourself. You’ll spare yourself the hassle of re-inventing wheels and instead can focus on your actual project.

Have class google “Python http requests” and dig up the 3rd-party “requests” library.

Also visit: http://pypi.python.org/

Question: How do I know I am using the ‘right’ or ‘the best’ library?

Learning How to Learn

Okay, so we have a shiny new library that does exactly what we want -- in this case, provides a clean and easy interface for making HTTP requests. So how do we use it?

Your Second Scrape

Explore the Louisiana election results site to figure out how it’s “constructed”: http://staticresults.sos.la.gov/

Scrape the main page, get the links to pages for specific election dates, then go to the election pages to download, then step through each

What We Didn’t Cover

Programmers Toolkit

How To Keep Learning

Rules of the Road