Python Web Scraping 101
~1/28/2012 meeting of Hacks/Hackers @ Sunlight Foundation
Jackie Kazil, firstname.lastname@example.org
Serdar Tumgoren, email@example.com
Like most programming languages, Python has lots of handy built-in libraries that do the heavy lifting for everyday tasks such as parsing CSV files, working with the file system, and downloading data from the Internet.
But there are many excellent 3rd-party libraries that make life even easier, or provide functionality not baked into Python. In order to easily get access to these tools, we need a “package manger” that lets us easily grab these libraries off the Internet and install them locally.
Have class install setuptools and use easy_install to install pip.
Then use pip to install BeautifulSoup libraries. (Ask and explain what a library is if needed.)
(Alternative setuptools install methods for Windows)
FDIC Failed Banks List
Use urrlib plus BeautifulSoup to do this scrape
View the source and locate the table
Have class type out the exact same code as you to produce their first scrape. Go over some of the fundamentals of Python (imports, for loop, variables, white space, etc. ) once everyone has successfully scraped the page.
Before coding, do some research. There are a ton of libraries out there -- either built-in or 3rd party -- that could do precisely what you’re thinking of coding yourself. You’ll spare yourself the hassle of re-inventing wheels and instead can focus on your actual project.
Have class google “Python http requests” and dig up the 3rd-party “requests” library.
Also visit: http://pypi.python.org/
Question: How do I know I am using the ‘right’ or ‘the best’ library?
Okay, so we have a shiny new library that does exactly what we want -- in this case, provides a clean and easy interface for making HTTP requests. So how do we use it?
Explore the Louisiana election results site to figure out how it’s “constructed”: http://staticresults.sos.la.gov/
Scrape the main page, get the links to pages for specific election dates, then go to the election pages to download, then step through each