1 of 35

Overview of Analytics Infrastructure

July 2014

Hadoop and

More

2 of 35

Outline

History

Analytics Cluster now(ish)

Infrastructure and Technology overview

Developing in MediaWiki Vagrant

3 of 35

History

Data Sources

4 of 35

History

MediaWiki databases

webrequest logs

5 of 35

History

MediaWiki databases

Queryable slaves already available for analysts, this works (mostly) great!

webrequest logs

A log line for every WMF HTTP request

This can max at 200,000 requests per second

2014 World Cup Final

6 of 35

History

udp2log

7 of 35

History

udp2log

works great but ...

8 of 35

History

doesn’t scale - every udp2log instance must see every network packet

Works for simple use cases and lower traffic scenarios

9 of 35

History

http://stats.wikimedia.org

Sampled logs saved and post-processed by analysts.

10 of 35

What we want

ALL webrequest logs saved for easy and fast analysis.

11 of 35

Analytics Cluster

Hadoop for regular batch processing of logs

Hive tables for easy SQL querying of webrequest logs

12 of 35

Analytics Cluster

Hue

Hadoop

Hive

Kafka

Camus

Oozie

HDFS

MapReduce

13 of 35

Analytics Cluster

14 of 35

Apache Hadoop

15 of 35

Apache Hadoop

(distributed filesystem +

framework for distributed computation) == Hadoop

16 of 35

Apache Hadoop

Code is sent to data, rather than data retrieved and then processed locally.

17 of 35

Analytics Cluster

18 of 35

Apache Hadoop

19 of 35

Analytics Cluster

How to get the top referrals for an article?

20 of 35

Apache Hive

21 of 35

Apache Hive

MAPs SQL like language into Hadoop MapReduce jobs.

.

22 of 35

Apache Hive

SELECT

SUBSTR(referer,30) AS SOURCE, COUNT(DISTINCT ip) AS hits

FROM

webrequest

WHERE

uri_path = "/wiki/London" AND uri_host = "en.wikipedia.org" AND referer LIKE "http://en.wikipedia.org/wiki/%" AND http_status = 200 AND

webrequest_source = ‘text’ AND year = 2014 AND month= 07 AND day = 14 AND hour = 14

GROUP BY

SUBSTR(referer,30)

ORDER BY hits DESC LIMIT 50;

23 of 35

Apache Kafka

24 of 35

Apache Kafka

Kafka is a reliable performant pub/sub distributed log buffer.

Broker nodes act as log buffer for incoming data.

No single masters, all brokers are peers.

Horizontally scalable.

25 of 35

Apache Kafka

26 of 35

Apache Kafka

200,000 messages per second | 30 MB per second, consuming every 10 minutes into HDFS

27 of 35

Camus

28 of 35

Camus

Launches a MapReduce job to do distributed loads.

Stores data in content based time bucketed data.

Data guaranteed to be in the proper time bucketed HDFS directory. e.g. a request on 2014-07-15 00:00:00 will be in .../2014/07/15/00, and not accidentally in .../2014/07/14/23.

29 of 35

Oozie

Hadoop job scheduler.

Allows complex workflows to be chained together.

Jobs launched based on existance of new datasets, not simply on time based schedule (e.g. hourly).

30 of 35

Hue

Web GUI for interacting with Hadoop, Hive, Oozie, etc.

31 of 35

Analytics Cluster

32 of 35

Developing in...

Analytics cluster is thoroughly puppetized and works easily in Labs and in Vagrant.

33 of 35

Developing in Labs

Labs is slightly more complicated right now, as you need to add puppet group roles to your project.

ToDo: make this easier to use in any Labs Project and document usage on Wikitech. :)

34 of 35

Developing in MediaWiki Vagrant

Edit your .settings.yaml file and add:

vagrant_ram: 2048

Comment out include role::mediawiki in puppet/manifests/site.pp (unless you really need this on your VM).
Run: vagrant enable-role analytics
Run: vagrant up

Hive and Hadoop should now be available in Vagrant!

ToDo: Include sample webrequest Hive table in Vagrant.

35 of 35

Thanks! Questions?