1 of 60

How to build your own crawler, and why you should give it a try

��

�Jess Peck | Local SEO Guide

slideshare.net/JessPeck2/

@jessthebp

2 of 60

ABOUT ME:

I’m Jess.

I like SEO.

I work at LSG.

3 of 60

Follow along at home!

4 of 60

What is a crawler and what are we doing here?

01.

5 of 60

Crawlers… crawl

6 of 60

We all know how Google Works already

Crawl

Render

Index

7 of 60

SEOs Use Crawlers for all sorts of reasons

find broken links

analyze SERPs

check schema

8 of 60

WE LOVE PREBUILT CRAWLERS

9 of 60

But sometimes, there are things pre-built crawlers just can’t do…

10 of 60

But sometimes, there are things pre-built crawlers just can’t do…

11 of 60

And some things might be a little out of scope….

12 of 60

So sometimes you gotta do it yourself

13 of 60

Oh god. We have to code.

14 of 60

HOW TO BUILD A CRAWLER

.02

15 of 60

FIRST THINGS FIRST: What do we want to do?

Definitely not against the TOS

  1. GET GOOGLE RESULTS

16 of 60

FIRST THINGS FIRST: What do we want to do?

This is the “crawling” bit

2. CRAWL THE TOP 10

17 of 60

FIRST THINGS FIRST: What do we want to do?

Scrub those tags outta here

3. CLEAN THE RESULTS

18 of 60

FIRST THINGS FIRST: What do we want to do?

By the power of Python…

4. ANALYZE THE CONTENT

19 of 60

I

If I didn’t have experience I’d do some research here

20 of 60

What programming language do I want to use?

21 of 60

How do I want the output to look?

22 of 60

What is the Input going to look like?

23 of 60

Where do I want to run this-- and how?

24 of 60

Input: keyword

Output: page on SERPs, entities on page, pages that page links to

25 of 60

Where do we start?

26 of 60

I’m gonna use a library I’m already familiar with

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

Selenium to drive the controls.

27 of 60

So we have our order

“analyze” Google

Get results

analyze results

28 of 60

HARK! A SERP

29 of 60

I know some stuff about Python scripting already.

If I didn’t, I would ask the following questions:

30 of 60

How do I set up Beautiful Soup? How do I set up Python?

31 of 60

Do I need a Chrome Driver? What does that mean? What version do I need? What folder should I keep it in?

32 of 60

First: Read a Google SERP

33 of 60

Second: Go to a result page

34 of 60

Third: find the links on the page

35 of 60

Fourth: Follow those links!

36 of 60

Let’s run a script from the command line

37 of 60

Ah! An Error!

38 of 60

Common sense: What does it seem like the problem is?

39 of 60

Google it; Someone has done this before!

40 of 60

READ THE DOCS: The docs have your answers! probably!

41 of 60

The Fix here: update the Chrome Driver.

42 of 60

Now we have data: what do we do with it?

43 of 60

First clean it:

Then send it to a csv

df.to_csv('out.csv')

44 of 60

Databases, CSVs, and how to choose

You should use Databases if you have a team, if you’re going to handle big data, and if you have a budget for it!

45 of 60

Databases, CSVs, and how to choose

You should use CSVs if you want to be flexible, need an easy to convert format, or don’t have time/energy/patience for a database

46 of 60

Databases, CSVs, and how to choose

Set up your code to be reusable, so if you start with one you can move to the other

47 of 60

Time to analyze the results

48 of 60

Different ways to analyze:

APIs - ping a URL and use the code/resources on that URL, the interface to a library

Libraries - pull in other persons code

DIY- build it yourself!

49 of 60

Let’s use a library for analysis:

50 of 60

Awesome results?

51 of 60

So why did we do this?

03.

52 of 60

Current industry tooling is great…

53 of 60

Current industry tooling is great…

But doing it yourself is great too.

54 of 60

Technical SEO is

all about understanding how machines work

55 of 60

And it is easier to understand how Google works when you understand how crawlers work

56 of 60

Technical SEO is all about understanding how machines work

Nothing gets you to understand Google better than slamming your head against some code

57 of 60

Developers are flawed.

58 of 60

Machines are flawed.

59 of 60

Understanding how your tools are flawed can help you compensate for that.

60 of 60

THANKS!

Does anyone have any questions?

@jessthebp

jessbpeck.com

https://stories.freepik.com

CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, and infographics & images by Freepik and illustrations by Stories