Wikipedia Page Views
W251 Final Project
Jordan, Dave, Matt, Utthaman
Overview & Goals
Dataset
Goals
Raw Data
Raw File Structure:
en Barabási–Albert_model 2 0�en Barachiel 5 0�en Barachois,_Quebec 1 0�en Barachois_(band) 1 0�en Barachois_Pond_Provincial_Park 1 0�en Barack_(disambiguation) 1 0�en Barack_Obama 296 0�en Barack_Obama's_farewell_address 1 0�en Barack_Obama,_Sr 1 0�en Barack_Obama,_Sr. 5 0�en Barack_Obama_"Hope"_poster 10 0�en Barack_Obama_"Joker"_poster 1 0�en Barack_Obama_Academy_of_International_Studies_6-12 3 0�en Barack_Obama_Democratic_Club_of_Upper_Manhattan 1 0�en Barack_Obama_Presidential_Center 18 0�en Barack_Obama_Sr 4 0�en Barack_Obama_Sr. 22 0�
Sample file:
Cloud Architecture
Main Cluster
4 Virtual Machines
wiki1, wiki2, wiki3, wiki4
Initially all were setup as:
Later upgraded to help with ingestion:
Connected via SSH & Private IPs on same subnet
1TB Disks formatted for Cassandra Use
Used as Primary Servers for Cassandra and Spark Clusters and is where querying scripts were run
Auxiliary Machines
3 Virtual Machines
wikistorage1, wikistorage2, wikistorage 3
Setup as:
Used as Staging Nodes for the Raw Data files
Included Preprocessing Workload + Transfer to Main Cluster
Technologies Used
Spark
Spark 1.6 Used
View Cluster:
Cassandra
Cassandra 2.2 Used
4 Node Setup (wiki1, wiki2, wiki3, wiki4)
Replication Factor = 2
PySpark - Cassandra Connector
https://github.com/TargetHolding/pyspark-cassandra
Cassandra-Loader
https://github.com/brianmhess/cassandra-loader
Microsoft Excel
Data Visualization
Other Trialed Software
Tableau
R
SQLite
Postgres
Full Cluster Setup Instructions
On Github (requires access)
https://github.com/jordankupersmith/w251_final_project/cluster_setup_and_node_information.md
Database Structure
Preprocessing
Preprocessing consists of building a single daily file that aggregates the counts from each hourly file.
Doing this with a Python dictionary used roughly 5.5GB of RAM and took about 13 minutes per 24 data files (1 day of pageview data).
Four servers were provisioned each with 8GB RAM and 2 CPUs and resulted in a total runtime of just under 48 hours.
Preprocessed Files & Transfer to Cluster
Following preprocessing on the individual wikistorage nodes, the compressed files were transferred and equally distributed amongst the 4 main cluster nodes
Files were then unzipped in batches of ~100
After unzipping, a cassandra_loader.py script was run to ingest each file into the Cassandra cluster (rep=2)
Files were moved to /success folder following ingestion
/success folder periodically cleared to free up space on disk
833 total daily files
Date Range:
May 2015 - July 2017
Each File
Uncompressed Size:
~1 - 1.5GB
# of Rows:
~ 25 million rows / file
Ingestion into Cassandra
Many challenges & learnings with ingestion of such a large amount of data into Cassandra
Ingestion Attempt #1:
Ingestion Attempt #2:
Ingestion Attempt #3:
Parallelized this across all 4 wiki nodes
Cassandra Tables
Ingested Tables in Cassandra
Example Table Structure
Machine Reliability & Node Utilization
Node Name | Disk Use |
wiki1 | 81% (784GB) |
wiki2 | 68% (663GB) |
wiki3 | 60% (585GB) |
wiki4 | 53% (516GB) |
* as of Aug 22
Ingestion maxed out all 4 machines for over a week (and is still not completed).
Ingestion Scripts and Cassandra services would occasionally fail during ingestion.
The processes would have to be manually restarted.
1TB Disk Utilization on each node
Machine Utilization
After a day of successful ingestions it became apparent that the machines needed to be upgraded to increase speed
Bumped each node up to Quad Core and 32GB RAM each
Usage Charts for wiki1 shown here:
Ingestion
Querying
Query Structure
SELECT Agg.language, Agg.total + Temp1.viewCount as total
FROM aggTable as Agg
JOIN temp_table as Temp1
ON Agg.language = Temp1.language
Example Query:
Query Structure 1: Example 1
Tableau Chart or other
https://en.wikipedia.org/wiki/List_of_Wikipedias
Query Structure 2: Example 1
Tableau Chart or other
Query Structure 3: Example 1
Tableau Chart or other
Query Structure 3: Example 2
Tableau Chart or other
Query Structure 4: Example 1
Challenges & Learnings
CQL
Spark
Queries
(Table -> RDD -> Map/Reduce -> Table
didn’t work)
Data Visualization