1 of 19

Wikipedia Page Views

W251 Final Project

Jordan, Dave, Matt, Utthaman

2 of 19

Overview & Goals

Dataset

  • 2 ½ Years worth of Hourly Wikipedia Page View Data
  • ~ 3.5TB Worth of uncompressed data
  • Roughly 33 Billion Rows of aggregated Daily Data by page name & language

Goals

  • Aggregate Page View Data by Day
  • Store in a NoSQL Cassandra Cluster
  • Use Spark to Query and find interesting results
  • Data Visualization of Results in Tableau

3 of 19

Raw Data

Raw File Structure:

  • Space delimited
  • First column is the language
  • Second column is the article title
  • Third column is the number of page views in the hour

en Barabási–Albert_model 2 0�en Barachiel 5 0�en Barachois,_Quebec 1 0�en Barachois_(band) 1 0�en Barachois_Pond_Provincial_Park 1 0�en Barack_(disambiguation) 1 0�en Barack_Obama 296 0�en Barack_Obama's_farewell_address 1 0�en Barack_Obama,_Sr 1 0�en Barack_Obama,_Sr. 5 0�en Barack_Obama_"Hope"_poster 10 0�en Barack_Obama_"Joker"_poster 1 0�en Barack_Obama_Academy_of_International_Studies_6-12 3 0�en Barack_Obama_Democratic_Club_of_Upper_Manhattan 1 0�en Barack_Obama_Presidential_Center 18 0�en Barack_Obama_Sr 4 0�en Barack_Obama_Sr. 22 0

Sample file:

4 of 19

Cloud Architecture

Main Cluster

4 Virtual Machines

wiki1, wiki2, wiki3, wiki4

Initially all were setup as:

  • Dual Core, 8GB Ram, 25GB & 1000GB Disks

Later upgraded to help with ingestion:

  • Quad Core, 32GB Ram, 25GB & 1000GB Disks

Connected via SSH & Private IPs on same subnet

1TB Disks formatted for Cassandra Use

Used as Primary Servers for Cassandra and Spark Clusters and is where querying scripts were run

Auxiliary Machines

3 Virtual Machines

wikistorage1, wikistorage2, wikistorage 3

Setup as:

  • Dual Core, 8GB Ram, 25GB & 1000GB Disks

Used as Staging Nodes for the Raw Data files

Included Preprocessing Workload + Transfer to Main Cluster

5 of 19

Technologies Used

Spark

Spark 1.6 Used

View Cluster:

http://198.23.108.53:8080/

Cassandra

Cassandra 2.2 Used

4 Node Setup (wiki1, wiki2, wiki3, wiki4)

Replication Factor = 2

PySpark - Cassandra Connector

https://github.com/TargetHolding/pyspark-cassandra

Cassandra-Loader

https://github.com/brianmhess/cassandra-loader

Microsoft Excel

Data Visualization

Other Trialed Software

Tableau

R

SQLite

Postgres

Full Cluster Setup Instructions

On Github (requires access)

https://github.com/jordankupersmith/w251_final_project/cluster_setup_and_node_information.md

6 of 19

Database Structure

7 of 19

Preprocessing

Preprocessing consists of building a single daily file that aggregates the counts from each hourly file.

Doing this with a Python dictionary used roughly 5.5GB of RAM and took about 13 minutes per 24 data files (1 day of pageview data).

Four servers were provisioned each with 8GB RAM and 2 CPUs and resulted in a total runtime of just under 48 hours.

8 of 19

Preprocessed Files & Transfer to Cluster

Following preprocessing on the individual wikistorage nodes, the compressed files were transferred and equally distributed amongst the 4 main cluster nodes

Files were then unzipped in batches of ~100

After unzipping, a cassandra_loader.py script was run to ingest each file into the Cassandra cluster (rep=2)

Files were moved to /success folder following ingestion

/success folder periodically cleared to free up space on disk

833 total daily files

Date Range:

May 2015 - July 2017

Each File

Uncompressed Size:

~1 - 1.5GB

# of Rows:

~ 25 million rows / file

9 of 19

Ingestion into Cassandra

Many challenges & learnings with ingestion of such a large amount of data into Cassandra

Ingestion Attempt #1:

  • Use the Cassandra cql shell and a native COPY command
  • Worked well for a single file (though ingestion time was slow ~30mins/file)
  • When looped, ingestion would hang after the first file was completed.

Ingestion Attempt #2:

  • Use pyspark to load each file into memory and output as an RDD
  • Use pyspark-cassandra connector package to save the RDD as a cassandra Table
  • Met with challenges in operationalizing
  • Gave up on this approach

Ingestion Attempt #3:

  • Use the cassandra-loader add-on
  • Wrote a python script to loop through all raw uncompressed files on a single machine and send through cassandra-loader syntax to cassandra
  • Worked well, though ingestion time per file is still 15-30mins

Parallelized this across all 4 wiki nodes

10 of 19

Cassandra Tables

Ingested Tables in Cassandra

Example Table Structure

11 of 19

Machine Reliability & Node Utilization

Node Name

Disk Use

wiki1

81% (784GB)

wiki2

68% (663GB)

wiki3

60% (585GB)

wiki4

53% (516GB)

* as of Aug 22

Ingestion maxed out all 4 machines for over a week (and is still not completed).

Ingestion Scripts and Cassandra services would occasionally fail during ingestion.

The processes would have to be manually restarted.

1TB Disk Utilization on each node

12 of 19

Machine Utilization

After a day of successful ingestions it became apparent that the machines needed to be upgraded to increase speed

Bumped each node up to Quad Core and 32GB RAM each

Usage Charts for wiki1 shown here:

Ingestion

Querying

13 of 19

Query Structure

  • Queries were performed through Spark
  • Spark RDDs proved error prone and difficult to debug
  • Spark Dataframes: Dataset API and Spark SQL
    • Intuitive
    • Joins are straightforward
    • Optimized execution engine

SELECT Agg.language, Agg.total + Temp1.viewCount as total

FROM aggTable as Agg

JOIN temp_table as Temp1

ON Agg.language = Temp1.language

Example Query:

14 of 19

Query Structure 1: Example 1

Tableau Chart or other

https://en.wikipedia.org/wiki/List_of_Wikipedias

15 of 19

Query Structure 2: Example 1

Tableau Chart or other

16 of 19

Query Structure 3: Example 1

Tableau Chart or other

17 of 19

Query Structure 3: Example 2

Tableau Chart or other

18 of 19

Query Structure 4: Example 1

19 of 19

Challenges & Learnings

CQL

  • Doesn’t support Join,Group by,Foreign key,Order by etc.
  • Had to rely on Spark for entire data processing
  • Used primarily for its Select Statement and to verify that tables had been written to Cassandra

Spark

  • Spark queries take long time to execute
  • Creating and working on RDDs from Cassandra table
    • Map and Reduce
    • Union of RDDs
    • Writing back to Cassandra

Queries

  • Struggled with the pyspark-cassandra connection

(Table -> RDD -> Map/Reduce -> Table

didn’t work)

  • Ultimately used both Spark-SQL and Spark dataframe syntax in combination
  • Very challenging to execute queries reliably

Data Visualization

  • For big data analysis we cannot connect Tableau directly to Cassandra DB
  • We have to use Spark-Cassandra connector
  • SQLite has no remote connectivity drivers
  • Postgres has many port and version compatibility issues
  • Ultimately used Excel since Query Results were small