Published using Google Docs
CheatSheet_ScrapingWebsites
Updated automatically every 5 minutes

SCRAPING WEBSITES

install bs4 (if not done yet)

installs BeautifulSoup

HTML (hypter text markup language)

doesn't contain actual information (usually), but moreover information how to layout it (position and shape)

tags

<name key=value>

….content….

opening tag; key = value → "attributes"

</name>

closing tag

example: <b> hello! </b>

<ul>

      <li> one </li>

      <li> two </li>

      <li> three </li>

</ul>

parent tag

children of parent tag and siblings to the one next to it

<div>

      <div>

            <a> hi! </b>

            <b> wow </b>

      </div>

</div>

div,

a, and

b are all descendants to first div

<a>    </a>

anchor tag, for links

html.parser

function to return tags

result set (list of tags)

methods to investigate tags

.string()

gives content of the tag

.find()

find the first descendants of that tag meeting the () requirement

Attribute Error 'Nonetype' object no attribute 'find' → element looking for was not inside of each loop

.find_all()

find all descendants matching the requirement

From website to SQL

1) Setup a database and connect to it

2) Create tables

3) Insert

for all of these, see CheatSheet psql