1 of 50

Religious Nonprofit Organization

  • Located in Texas, runs two local religious centers
  • Website built in Wordpress
  • 6000 visits/month
  • Plan to switch to Google Analytics to get a better sense of user data

Goal:

  • Determine where this web traffic is coming from
  • Increase this web traffic through the production of content tailored to my audience
  • Monetize this web traffic through ad revenue

2 of 50

3 of 50

Meet Your Audience

4 of 50

Data from Incapusla

Bots have made up the vast majority of web traffic, up until 2015.

Year - Bot Traffic

2012 - 51%

2013 - 61.5%

2014 - 56%

2015 - 48.5%

===

Current 2015 Breakdown

Humans - 51.5%

Bad Bots - 29.0%

Good Bots - 19.5%

5 of 50

6 of 50

How To Manually Block A Bot (don’t do this)

Each page has a .htaccess file. This file controls what IPs can and cannot view your site. You can ban by IP or by certain user-strings.

I applied a blanket IP ban to all possible Russian IPs.

The bots just changed their IP addresses.

(These bots could be using proxies, or could have hijacked other people’s computers...honestly, it didn’t really matter how they came back.

I could also block by User-Agent, but again, that’s even more trivial to change.)

7 of 50

Example of .htaccess

order allow,deny�deny from 123.45.6.7�deny from 012.34.5�allow from all

RewriteEngine On �RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] �RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR] �RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR] �RewriteRule ^.* - [F,L]�

8 of 50

9 of 50

Bots Are Becoming More Advanced

They can use JavaScript (Google Analytics uses JavaScript to track visitors).

They can perform the same interactions as Humans can (filling out forms and pressing buttons).

They can break less-advanced CAPTCHAs (Completely Automated Turing Test To Tell Computers and Humans Apart) using OCR. Or they can rely on humans to break the CAPTCHAs for them.

They can be identified, but it requires people to pay attention to subtle clues (such as bounce rates).

And even if you do decide to identify and block them, they can just come back.

10 of 50

Bots Are Becoming More Advanced

To truly deal with the bot threat, you need to automate away the process of “discovery”. You have to algorithmically determine whether a person is a human or a bot, and to algorithmically update your “block” list.

Not only does it take some time to ‘set-up’, but you’ll probably going to get some errors...

  • False Positives (accidentally blocking humans)
  • False Negatives (accidentally letting a bot through)

Of course, you can always pay a software firm to detect and block the bots for you, but where’s the fun in that?

11 of 50

My “Solution”?

I was targeting a specific audience, the community of Texas.

I assumed that the bots weren’t specifically pretending to be Texans (but instead randomly deciding their impersonations), so I simply filtered out all non-Texan results from my Google Analytics.

If a few ‘bot visits’ got through, I’ll just accept the ‘dirty data’.

This works, except there’s no way it would be a scalable solution if I was running a more “international” website. (For example, I have a personal website that is intended to advertise myself, and not targeted to a geographic region. I can’t just filter out every person who is not from Nebraska.)

12 of 50

Our Schedule

  • Learn About The Different Types of “Bots”
  • Hear My Suggestions On How To Deal With Each of Them
  • Learn About “Project Honeypot”, an open-source project that is responsible for identifying and blocking some of these bots
  • Learn about “Rack::Attack”, an Ruby gem that can programmatically ban bots based on criteria

13 of 50

Types of Bots

14 of 50

15 of 50

The Bots

“Good” Bots

  • Crawlers
  • Website Scan Tools

“Bad” Bots

  • Scrapers
  • DDoS Bots
  • Hacking Tools
  • Ad Bots
  • Spammers

16 of 50

Good Bots (3 slides)

17 of 50

The “Good Bots”

Website Scan Tools analyze websites and make reports to their users. (Example: responsiveness, text summaries, keywords, etc.)

Crawlers visit a website, and store information on them. Then, they click on links located on those websites, and then keep on “crawling” those sites.

Examples of Crawlers:

  • Archive.org
  • RSS Feeds
  • Search Engines
  • Chatbots
  • Sites that Use The “Open Graph” Meta-Tags

Googlebot (picture from Google’s Search Engine Optimization Starter Guide, written by Google)

18 of 50

Dealing with Crawlers

  • Generally, you want to pander to crawlers, such as Googlebot or Archive.org, since they will provide more exposure to your website.
    • This means adhering to the crawlers’ quirks (or exploiting them if you happen to prefer “Black Hat SEO”).
  • However, you can block certain bots by modifying “robots.txt”, either to block all bots or to specify the crawlers you want to block (such as Googlebot or Archive.org). You can also specify what specific parts of the websites to block as well.
    • Robots.txt is a voluntary standard...bots don’t have to read it and adhere to it. Legitimate crawlers, like Googlebot, will follow robots.txt, though.
    • Some evil bots check robots.txt first to see what sites you don’t want them to go to…

Crawlers are interested in gathering websites on all sites, not just the most popular ones.

19 of 50

Dealing with Website Scan Tools

You were the one that authorized the scan, right?

20 of 50

Bad Bots (14 slides)

21 of 50

Scrapers

Scrapers load the HTML file directly and then parse it to find valuable information, to then transmit back to their human master. Scrapers tend to search for:

  • Email Addresses (for spamming)
  • Prices on e-commerce/insurance websites (to better compete against them)
  • Content (to repost on other websites)

Many people build APIs to discourage the use of scrapers (since scrapers can hurt a website’s performance).

22 of 50

Dealing with Scrapers (Email)

Make your email easy for a human to read, but difficult for a robot to copy. There are many ways of doing this, from least-effective to most-effective:

  1. example@REMOVETHISexample.com
  2. exampleATexampleDOTcom
  3. example@example.com
  4. example@example.com
  5. <script type='text/javascript'>var a = new Array('example@','example.','com');document.write("<a href='mailto:"+a[0]+a[1]+a[2]+"'>"+a[0]+a[1]+a[2]+"</a>");</script>

23 of 50

Dealing with Scrapers (Content, Prices)

  • I don’t know.
  • I mean, you could just find specific bots and block them...but they’ll just find another way...
  • You could try to change the markup of the site regularly (so that the people who are running the scraper bots have to spend time reprogramming their bots), or use CAPTCHAs, or stuff everything into images...
  • But, anything that is online can already be copied and pasted manually by a human being. And anything that annoy bots could also annoy humans too...
  • Possibly a better option is to engage in damage control: trying to report scrapers to Google (or sue the scrapers if you are a big enough company).
    • Or you can just ignore them, like I do.

24 of 50

DDoS Bots

If too many people visit your website at once, your server shuts down. (A human version of this effect is sometimes called “Slashdotting”)...

Why are people so curious about shutting down your website via DDoS?

  • They run a rival business and like to harm you by forcing your site down and causing you to potentially lose data.
  • They want to extort money from you.
  • They don’t like what you said and want to silence you.

50% of all bad bot traffic comes from DDoS Bots (“impersonators” according to Incapsula). Script kiddies, DDoS-for-hires, state actors...they all can do DDoSes.

25 of 50

Dealing with DDoS Bots

Would require a dedicated team of people who can intercept traffic and identify whether it is ‘legitimate’ or not, before allowing them to proceed…

Cloudflare and Incapsula may be examples of firms that can help analyze web traffic and limit DDoS attacks. (Note that I never actually tried them, so I don’t know whether they are actually good. Sorry.)

You could also hope that you don’t offend any possible DDoS attacker (though DDoS attacks are fairly cheap to purchase online), and then clean up the mess afterwards.

26 of 50

Hacking Tools

They scan your website, trying to find possible vulnerabilities that can then be exploited by a hacker.

Things They May Attempt To Do:

  • Bruteforce logins
  • SQL Injections
  • Taking advantage of vulnerabilities in CMS software (like Wordpress)
  • Site Defacement

27 of 50

In March 2015, pro-ISIS hackers scanned Wordpress websites to find anyone that was using a “FancyBox for Wordpress” plugin. It then injected an iframe into vulnerable websites. (This vulnerability was already discovered and patched a month before.)

28 of 50

Dealing with Hacking Tools

As long as you keep ‘up-to-date’ with current security practices, you should be safe. Keep your CMS/framework updated at all times, and you should be “safe” (although this doesn’t protect you against ‘zero-day’ attacks).

It can take a long time for a bot to bruteforce a password that is is randomly-generated by a password manager, for instance. It’s usually easier for the bot to just give up and move onto a more appeasing target.

29 of 50

Ad Bots

They visit websites and interact with ads (clicking on display ads or watching video ads). They have to continually improve their tactics to avoid detection by ad agencies and appear to be human.

According to White Ops. Inc., global advertisers lost $6.3 billion dollars this year to these bots (out of a total $40 billion spent on ad spending).

30 of 50

31 of 50

Dealing with Ad Bots

If you are an unethical content publisher that relies on ad revenue, you’ll love Ad Bots. You may even be running the bots (or hiring someone else to run those bots). However, advertisers will stop working with you once they find out.

If you are paying for the ads, be suspicious. Engage in continuous fraud monitoring, determine where the traffic is coming from at all times, use better metrics (NOT clicks and views), prepare to spend more money for ‘real’ traffic, and be careful about using “third-party traffic brokers” and programmatic ad buying.

Some clever traffic brokers are able to sell advertisers a mixture of ‘real’ traffic, ‘bot’ traffic, and ‘incentivized’ traffic (paying people in developing countries to see your site).

32 of 50

Spammers

Their goal is to post more and more content on other people’s websites...content that people don’t want.

The hope of these spammers is to post backlinks to their websites. Of course, a human may click on them, but these backlinks are really meant for search engines. (Their hope is that GoogleBot will see those backlinks and then rank their websites higher in search results.)

Spam is decreasing due to Google’s attempts to penalize spam backlinks.

33 of 50

Comment Spam

34 of 50

Dealing with Spam

  • Personally, I block all comments. I don’t really see the point of moderating comments, and I seem to detest human-written comments in general.
    • You can always outsource commenting over to a third-party like Disqus or Social Media Websites (Twitter, Facebook, Hacker News).
  • Constant moderation of messages can also help as well (as well as rejecting anything that seems suspicious or spammy).
    • Consider rejecting all comments with links.
    • Make sure to moderate all spammy comments, even if they don’t have links. Some spammers attempt to post semi-legitimate comments first to “test” the site before moving onto full-fledged spam attacks.

35 of 50

Project Honeypot

36 of 50

What is a Honeypot?

An attempt to misdirect a malicious actor into doing something that reveals it to be malicious. We are offering ‘honey’ to them, to bait a robot into revealing themselves.

Robots are good at filling out forms with arbitrary data (hence how spamming works). So why not take advantage of that by giving them a form field that only they can fill out?

37 of 50

A Basic Honeypot

<input id="real_email" type="text" name="real_email" size="25"value="" /><input id="test_email" type="text" name="email" size="25" value=""/><style>#test_email {

display: none;�}

</style>

If a user inputs in a value for “test_email”, then we assume that the user must not be a human. We can then discard the results of the form.

SOURCE: https://solutionfactor.net/blog/2014/02/01/honeypot-technique-fast-easy-spam-prevention/

38 of 50

A More Complex Honeypot

  1. We place a link on our website that is invisible to humans. A robot browsing the site will see the link though, and click on it.
  2. The robot is directed to a page (https://www.projecthoneypot.org/honey_pot_example.php) with a unique email address that is specifically generated for that IP. Sometimes, we may generate form fields instead.
  3. Robot fills out the form or sends an email, thereby confirming its existence. We can then monitor how many forms it fills out or how many emails it sends to determine the robot’s threat level.

39 of 50

Monitoring the Robot

Knowing the threat level of a robot can be helpful for those who want to avoid ‘false positives’.

Just because a robot is using an infected computer right now doesn’t mean that a human won’t use the infected computer in the future.

At the same time, a robot can pose an immediate threat (sending lots of spam messages) that could merit a temporary ban of one month. Or permanently.

I personally prefer having a low tolerance of “threat”, and can accept ‘false positives’.

40 of 50

Bots That “Project Honeypot” Targets

  • Comment Spammers
  • Email Harvesters (Scrapers)

Does not deal with DDoS bots, Hacking Tools, or Ad Bots

41 of 50

HTTP:BL

The real glory of Project Honeypot is that it is a “distributed system”; it gathers information about robots by using a variety of different honeypots throughout the Internet. The data it acquired is accessible to webmasters, so they can programmatically look at IP addresses and determine whether a user is ‘suspicious’ enough to warrant blocking.

However, to access HTTP:BL, you need an API Key and “active” status. To become “active”, you must make an account and then either install a honeypot, add a QuickLink to a honeypot, donate a “MX entry” (the ‘fake’ email addresses for the honeypots), or refer friends to join.�Example of Data Collected on an IP Address: https://www.projecthoneypot.org/ip_91.200.12.7

42 of 50

HTTP:BL

Examples of Open Source Projects That Connect To HTTP:BL:

WordPress Plugin - https://github.com/WP-http-BL/httpbl

(Currently removed from Wordpress directory due to a security vulnerability being discovered, and nobody actually maintaining the plugin)

Ruby Gem - https://github.com/cmaxw/project-honeypot

Note that you must have a API Key from Project Honeypot to be able to use HTTP:BL.

43 of 50

Rack::Attack

44 of 50

Rack::Attack

Ruby gem created by “Kickstarter” (yes, the company).

  • Identify bad behavior, and block people based on their bad behavior

Intended against Scrapers and Brute Force attackers, as the company has to spend money providing ‘service’ to these robots (and would rather want to save that money by blocking the bad traffic instead).

https://github.com/kickstarter/rack-attack

45 of 50

class Rack::Attack

# Handle “Repeated Requests”# Throttle all requests by IP (60rpm)# Key: "rack::attack:#{Time.now.to_i/:period}:req/ip:#{req.ip}"

throttle('req/ip', :limit => 300, :period => 5.minutes) do |req|� req.ip # unless req.path.start_with?('/assets')end

# Stopping “Brute Force” Login Attempts

throttle('logins/ip', :limit => 5, :period => 20.seconds) do |req|� if req.path == '/login' && req.post?� req.ip� end

throttle("logins/email", :limit => 5, :period => 20.seconds) do |req|� if req.path == '/login' && req.post?� # return the email if present, nil otherwise� req.params['email'].presence� endend

end

46 of 50

Is That Enough?

47 of 50

“When the smartest bad guys figure out how to fool you, they don't tell you you're beaten. What you see instead looks like victory: fraud numbers going down! Then you're only beating the dumb crooks with no sense of what you're missing. This is a game where losing can actually look like winning. So the top action item is to reject complacency.”---Michael Tiffany, CEO of White Ops

Source: http://www.adweek.com/news/advertising-branding/whats-being-done-rein-7-billion-ad-fraud-169743

48 of 50

Should you trust anything any more?

Comment is very generic...could have applied to any blog post.

Fra Stra also favorited this post as well…

49 of 50

50 of 50

There is no innocence, only degrees of guilt.--Warhammer 40K: Dawn of War