1 of 19

Wikipedia Page Views

W251 Final Project

Jordan, Dave, Matt, Utthaman

2 of 19

Overview & Goals

Dataset

2 ½ Years worth of Hourly Wikipedia Page View Data
~ 3.5TB Worth of uncompressed data
Roughly 33 Billion Rows of aggregated Daily Data by page name & language

Goals

Aggregate Page View Data by Day
Store in a NoSQL Cassandra Cluster
Use Spark to Query and find interesting results
Data Visualization of Results in Tableau

3 of 19

Raw Data

Raw File Structure:

Space delimited
First column is the language
Second column is the article title
Third column is the number of page views in the hour

en Barabási–Albert_model 2 0�en Barachiel 5 0�en Barachois,_Quebec 1 0�en Barachois_(band) 1 0�en Barachois_Pond_Provincial_Park 1 0�en Barack_(disambiguation) 1 0�en Barack_Obama 296 0�en Barack_Obama's_farewell_address 1 0�en Barack_Obama,_Sr 1 0�en Barack_Obama,_Sr. 5 0�en Barack_Obama_"Hope"_poster 10 0�en Barack_Obama_"Joker"_poster 1 0�en Barack_Obama_Academy_of_International_Studies_6-12 3 0�en Barack_Obama_Democratic_Club_of_Upper_Manhattan 1 0�en Barack_Obama_Presidential_Center 18 0�en Barack_Obama_Sr 4 0�en Barack_Obama_Sr. 22 0�

Sample file:

4 of 19

Cloud Architecture

Main Cluster

4 Virtual Machines

wiki1, wiki2, wiki3, wiki4

Initially all were setup as:

Dual Core, 8GB Ram, 25GB & 1000GB Disks

Later upgraded to help with ingestion:

Quad Core, 32GB Ram, 25GB & 1000GB Disks

Connected via SSH & Private IPs on same subnet

1TB Disks formatted for Cassandra Use

Used as Primary Servers for Cassandra and Spark Clusters and is where querying scripts were run

Auxiliary Machines

3 Virtual Machines

wikistorage1, wikistorage2, wikistorage 3

Setup as:

Dual Core, 8GB Ram, 25GB & 1000GB Disks

Used as Staging Nodes for the Raw Data files

Included Preprocessing Workload + Transfer to Main Cluster

5 of 19

Technologies Used

Spark

Spark 1.6 Used

View Cluster:

http://198.23.108.53:8080/

Cassandra

Cassandra 2.2 Used

4 Node Setup (wiki1, wiki2, wiki3, wiki4)

Replication Factor = 2

PySpark - Cassandra Connector

https://github.com/TargetHolding/pyspark-cassandra

Cassandra-Loader

https://github.com/brianmhess/cassandra-loader

Microsoft Excel

Data Visualization

Other Trialed Software

Tableau

R

SQLite

Postgres

Full Cluster Setup Instructions

On Github (requires access)

https://github.com/jordankupersmith/w251_final_project/cluster_setup_and_node_information.md

6 of 19

Database Structure

7 of 19

Preprocessing

Preprocessing consists of building a single daily file that aggregates the counts from each hourly file.

Doing this with a Python dictionary used roughly 5.5GB of RAM and took about 13 minutes per 24 data files (1 day of pageview data).

Four servers were provisioned each with 8GB RAM and 2 CPUs and resulted in a total runtime of just under 48 hours.

8 of 19

Preprocessed Files & Transfer to Cluster

Following preprocessing on the individual wikistorage nodes, the compressed files were transferred and equally distributed amongst the 4 main cluster nodes

Files were then unzipped in batches of ~100

After unzipping, a cassandra_loader.py script was run to ingest each file into the Cassandra cluster (rep=2)

Files were moved to /success folder following ingestion

/success folder periodically cleared to free up space on disk

833 total daily files

Date Range:

May 2015 - July 2017

Each File

Uncompressed Size:

~1 - 1.5GB

# of Rows:

~ 25 million rows / file

9 of 19

Ingestion into Cassandra

Many challenges & learnings with ingestion of such a large amount of data into Cassandra

Ingestion Attempt #1:

Use the Cassandra cql shell and a native COPY command
Worked well for a single file (though ingestion time was slow ~30mins/file)
When looped, ingestion would hang after the first file was completed.

Ingestion Attempt #2:

Use pyspark to load each file into memory and output as an RDD
Use pyspark-cassandra connector package to save the RDD as a cassandra Table
Met with challenges in operationalizing
Gave up on this approach

Ingestion Attempt #3:

Use the cassandra-loader add-on
Wrote a python script to loop through all raw uncompressed files on a single machine and send through cassandra-loader syntax to cassandra
Worked well, though ingestion time per file is still 15-30mins

Parallelized this across all 4 wiki nodes

10 of 19

Cassandra Tables

Ingested Tables in Cassandra

Example Table Structure

11 of 19

Machine Reliability & Node Utilization

Node Name	Disk Use
wiki1	81% (784GB)
wiki2	68% (663GB)
wiki3	60% (585GB)
wiki4	53% (516GB)

* as of Aug 22

Ingestion maxed out all 4 machines for over a week (and is still not completed).

Ingestion Scripts and Cassandra services would occasionally fail during ingestion.

The processes would have to be manually restarted.

1TB Disk Utilization on each node

12 of 19

Machine Utilization

After a day of successful ingestions it became apparent that the machines needed to be upgraded to increase speed

Bumped each node up to Quad Core and 32GB RAM each

Usage Charts for wiki1 shown here:

Ingestion

Querying

13 of 19

Query Structure

Queries were performed through Spark
Spark RDDs proved error prone and difficult to debug
Spark Dataframes: Dataset API and Spark SQL

Intuitive
Joins are straightforward
Optimized execution engine

SELECT Agg.language, Agg.total + Temp1.viewCount as total

FROM aggTable as Agg

JOIN temp_table as Temp1

ON Agg.language = Temp1.language

Example Query:

Querying data out of the system was somewhat challenging. We first tried to run our queries using Spark RDDs, which none of us had any problems with before, but at this scale it became error prone and really hard to debug, things kept crashing, lost jobs, nodes going down, etc., but we didn’t really need the control provided by RDDs.

Spark Dataframes are one level up from the RDD and implement a Dataset API as well as an SQL interface.

The dataset API is really nice to work with, but in the end we decided not to let go of the intuitive feel of SQL so we chose Spark SQL for almost all of our query operations.

One of the main reasons for this decision was how easy it is to join multiple tables together when you’re aggregating over a date range.

And finally whether we used the DatasetAPI or Spark SQL, the underlying API is the same optimized execution engine and it gave us

Lower query times than when using RDDs.

The example Spark SQL query shows how easy it is to sum 2 tables of counts

Essentially this query structure along with our overall architecture lets us query cassandra using mostly SQL, which is a very powerful tool. Operations like joins aren’t possible with CQL, the native Cassandra Query Language.

14 of 19

Query Structure 1: Example 1

Tableau Chart or other

https://en.wikipedia.org/wiki/List_of_Wikipedias

15 of 19

Query Structure 2: Example 1

Tableau Chart or other

16 of 19

Query Structure 3: Example 1

Tableau Chart or other

17 of 19

Query Structure 3: Example 2

Tableau Chart or other

18 of 19

Query Structure 4: Example 1

19 of 19

Challenges & Learnings

CQL

Doesn’t support Join,Group by,Foreign key,Order by etc.
Had to rely on Spark for entire data processing
Used primarily for its Select Statement and to verify that tables had been written to Cassandra

Spark

Spark queries take long time to execute
Creating and working on RDDs from Cassandra table

Map and Reduce
Union of RDDs
Writing back to Cassandra

Queries

Struggled with the pyspark-cassandra connection

(Table -> RDD -> Map/Reduce -> Table

didn’t work)

Ultimately used both Spark-SQL and Spark dataframe syntax in combination
Very challenging to execute queries reliably

Data Visualization

For big data analysis we cannot connect Tableau directly to Cassandra DB
We have to use Spark-Cassandra connector
SQLite has no remote connectivity drivers
Postgres has many port and version compatibility issues
Ultimately used Excel since Query Results were small