Overview of Analytics Infrastructure
July 2014
Hadoop and
More
Outline
History
Analytics Cluster now(ish)
Infrastructure and Technology overview
Developing in MediaWiki Vagrant
History
Data Sources
History
MediaWiki databases
webrequest logs
History
MediaWiki databases
Queryable slaves already available for analysts, this works (mostly) great!
webrequest logs
A log line for every WMF HTTP request
This can max at 200,000 requests per second
2014 World Cup Final
History
udp2log
History
udp2log
works great but ...
History
doesn’t scale - every udp2log instance must see every network packet
Works for simple use cases and lower traffic scenarios
History
What we want
ALL webrequest logs saved for easy and fast analysis.
Analytics Cluster
Hadoop for regular batch processing of logs
Hive tables for easy SQL querying of webrequest logs
Analytics Cluster
Hue
Hadoop
Hive
Kafka
Camus
Oozie
HDFS
MapReduce
Analytics Cluster
Apache Hadoop
Apache Hadoop
(distributed filesystem +
framework for distributed computation) == Hadoop
Apache Hadoop
Code is sent to data, rather than data retrieved and then processed locally.
Analytics Cluster
Apache Hadoop
Analytics Cluster
How to get the top referrals for an article?
Apache Hive
Apache Hive
MAPs SQL like language into Hadoop MapReduce jobs.
.
Apache Hive
SELECT
SUBSTR(referer,30) AS SOURCE, COUNT(DISTINCT ip) AS hits
FROM
webrequest
WHERE
uri_path = "/wiki/London" AND uri_host = "en.wikipedia.org" AND referer LIKE "http://en.wikipedia.org/wiki/%" AND http_status = 200 AND
webrequest_source = ‘text’ AND year = 2014 AND month= 07 AND day = 14 AND hour = 14
GROUP BY
SUBSTR(referer,30)
ORDER BY hits DESC LIMIT 50;
Apache Kafka
Apache Kafka
Kafka is a reliable performant pub/sub distributed log buffer.
Broker nodes act as log buffer for incoming data.
No single masters, all brokers are peers.
Horizontally scalable.
Apache Kafka
Apache Kafka
200,000 messages per second | 30 MB per second, consuming every 10 minutes into HDFS
Camus
Camus
Launches a MapReduce job to do distributed loads.
Stores data in content based time bucketed data.
Data guaranteed to be in the proper time bucketed HDFS directory. e.g. a request on 2014-07-15 00:00:00 will be in .../2014/07/15/00, and not accidentally in .../2014/07/14/23.
Oozie
Hadoop job scheduler.
Allows complex workflows to be chained together.
Jobs launched based on existance of new datasets, not simply on time based schedule (e.g. hourly).
Hue
Web GUI for interacting with Hadoop, Hive, Oozie, etc.
Analytics Cluster
Developing in...
Analytics cluster is thoroughly puppetized and works easily in Labs and in Vagrant.
Developing in Labs
Labs is slightly more complicated right now, as you need to add puppet group roles to your project.
ToDo: make this easier to use in any Labs Project and document usage on Wikitech. :)
Developing in MediaWiki Vagrant
ToDo: Include sample webrequest Hive table in Vagrant.
Thanks! Questions?