How to build your own crawler, and why you should give it a try
��
�Jess Peck | Local SEO Guide
slideshare.net/JessPeck2/
@jessthebp
ABOUT ME:
I’m Jess.
I like SEO.
I work at LSG.
Follow along at home!
What is a crawler and what are we doing here?
01.
Crawlers… crawl
We all know how Google Works already
Crawl
Render
Index
SEOs Use Crawlers for all sorts of reasons
find broken links
analyze SERPs
check schema
WE LOVE PREBUILT CRAWLERS
But sometimes, there are things pre-built crawlers just can’t do…
But sometimes, there are things pre-built crawlers just can’t do…
And some things might be a little out of scope….
So sometimes you gotta do it yourself
Oh god. We have to code.
HOW TO BUILD A CRAWLER
.02
FIRST THINGS FIRST: What do we want to do?
Definitely not against the TOS
FIRST THINGS FIRST: What do we want to do?
This is the “crawling” bit
2. CRAWL THE TOP 10
FIRST THINGS FIRST: What do we want to do?
Scrub those tags outta here
3. CLEAN THE RESULTS
FIRST THINGS FIRST: What do we want to do?
By the power of Python…
4. ANALYZE THE CONTENT
I
If I didn’t have experience I’d do some research here
What programming language do I want to use?
How do I want the output to look?
What is the Input going to look like?
Where do I want to run this-- and how?
Input: keyword
Output: page on SERPs, entities on page, pages that page links to
Where do we start?
I’m gonna use a library I’m already familiar with
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Selenium to drive the controls.
So we have our order
“analyze” Google
Get results
analyze results
HARK! A SERP
I know some stuff about Python scripting already.
If I didn’t, I would ask the following questions:
How do I set up Beautiful Soup? How do I set up Python?
Do I need a Chrome Driver? What does that mean? What version do I need? What folder should I keep it in?
First: Read a Google SERP
Second: Go to a result page
Third: find the links on the page
Fourth: Follow those links!
Let’s run a script from the command line
Ah! An Error!
Common sense: What does it seem like the problem is?
Google it; Someone has done this before!
READ THE DOCS: The docs have your answers! probably!
The Fix here: update the Chrome Driver.
Now we have data: what do we do with it?
First clean it:
Then send it to a csv
df.to_csv('out.csv')
Databases, CSVs, and how to choose
You should use Databases if you have a team, if you’re going to handle big data, and if you have a budget for it!
Databases, CSVs, and how to choose
You should use CSVs if you want to be flexible, need an easy to convert format, or don’t have time/energy/patience for a database
Databases, CSVs, and how to choose
Set up your code to be reusable, so if you start with one you can move to the other
Time to analyze the results
Different ways to analyze:
APIs - ping a URL and use the code/resources on that URL, the interface to a library
Libraries - pull in other persons code
DIY- build it yourself!
Let’s use a library for analysis:
Awesome results?
So why did we do this?
03.
Current industry tooling is great…
Current industry tooling is great…
But doing it yourself is great too.
Technical SEO is
all about understanding how machines work
And it is easier to understand how Google works when you understand how crawlers work
Technical SEO is all about understanding how machines work
Nothing gets you to understand Google better than slamming your head against some code
Developers are flawed.
Machines are flawed.
Understanding how your tools are flawed can help you compensate for that.