Column-family Stores, Cassandra
Lecture 7 of NoSQL Databases (PA195)�
David Novak, FI, Masaryk University, Brno
Agenda
Column-family Stores: Basics
Agenda
Data Model: Column
value: "Martin",
timestamp: 12345667890 }
Data Model: Row
// row
"martin-fowler" : {
firstName: "Martin",
lastName: "Fowler",
location: "Boston"
}
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Data Model: Column Family
user_id (row key) | column key | column key | ... |
column value | column value | ... | |
1 | login | first_name | ... |
honza | Jan | ... | |
4 | login | age | ... |
david | 35 | ... | |
5 | first_name | last_name | ... |
Irena | Holubová | ... | |
... | | | |
I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
Data Model: Column Family (2)
// row (cols from the same CF)
"martin-fowler" : {
firstName: "Martin",
lastName: "Fowler",
location: "Boston",
activ: "true" }
}
{ // row (columns from a CF)
"pramod-sadalage" : {
firstName: "Pramod",
lastName: "Sadalage",
lastVisit: "2012/12/12"
}�
source: Sadalage & Fowler: NoSQL Distilled, 2012
Data Model: Super Column Family
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Super Column Family: Example
user_id (row key) | super column key | super column key | ... | ||||
subcolumn key | subcolumn key | ... | subcolumn key | subcolumn key | ... | ... | |
subcolumn value | subcolumn value | ... | subcolumn value | subcolumn value | ... | | |
1 | home_address | work_address | | ||||
city | street | ... | city | street | ... | | |
Brno | Krásná 5 | ... | Praha | Pracovní 13 | ... | | |
4 | home_address | temporary_address | | ||||
city | street | ... | city | PSČ | | | |
Plzeň | sady Pětatřicátníků 35 | ... | Praha | 111 00 | | | |
... | | | | | | | |
I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
Super Column Family: Example (2)
{ // row
”Cath”: {
”username”: { ”firstname”: ”Cath”, ”lastname”: ”Yoon” },
”address”: { ”city”: ”Seoul”, ”postcode”: ”1234” }
}
// row
”Terry”: {
”username”: { ”firstname”: ”Terry”, ”lastname”: ”Cho” },
”account”: { ”bank”: ”Hana”, ”accounted”: 1234 },
”preferences”: { ”color”: ”blue”, ”style”: ”simple” }
}
}
Data Model: Interpretation 1
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Data Model: Interpretation 2
Map<RowKey, Map<ColumnKey, ColumnValue>>�
Map<RowKey, Map<SuperColumnKey, � Map<ColumnKey, ColumnValue>>>�
Example: Visualization
source: Sadalage & Fowler: NoSQL Distilled, 2012
Column Family Stores
Agenda
Column Family Stores: Features
Representatives
Ranked list: http://db-engines.com/en/ranking/wide+column+store
BigTable
“A table in Bigtable is a sparse, distributed, persistent multidimensional sorted map. “
(row:string, column:string, time:int64) → string
http://en.wikipedia.org/wiki/BigTable
BigTable: Example
”com.ccn.www”
column family
row key row
column column column column column
“contents:html” “param:lang” “param:enc” “a:cnnsi.com” “a:ihned.cz”
column names
<html>...
<html>...
t2
t6
t8
<html>...
EN
UTF-8
CNN.com
CNN
t2
t2
t3
t7
HBase
“Open source, non-relational, distributed database modeled
after Google's BigTable. “
“If you have hundreds of millions or billions of rows, then HBase is a good candidate. “
http://hbase.apache.org/
Cassandra
http://cassandra.apache.org/
Agenda
Cassandra 1.0: Data Model
http://cassandra.apache.org/
Cassandra 1.0: Data Model (2)
static
dynamic
Cassandra 1.0: Column Families
CREATE COLUMNFAMILY Fish (key blob PRIMARY KEY);
CREATE COLUMNFAMILY FastFoodEatings (user text PRIMARY KEY)
WITH comparator=timestamp AND default_validation=int;
CREATE COLUMNFAMILY MonkeyTypes (
key uuid PRIMARY KEY,
species text, alias text,
population varint
) WITH comment='Important biological records'
AND read_repair_chance = 1.0;
Cassandra 1.0: Column Families (2)
Cassandra 1.0: Data Manipulation
create column family users
with key_validation_class = Int32Type
and comparator = UTF8Type
and default_validation_class = UTF8Type;
// set column values in row with key 7
set users[7]['login'] = utf8('honza');
set users[7]['name'] = utf8('Jan Novák');
set users[7]['email'] = utf8('jan@novak.name');
set users[13]['login'] = utf8('fantomas');
set users[13]['name'] = utf8('incognito');
I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
Cassandra 1.0: Data Manipulation (2)
get users[7]['login'];
=> (name=login, value=honza, timestamp=1429268223462000)
get users[13];
=> (name=login, value=fantomas, timestamp=1429268224554000)
=> (name=name, value=incognito, timestamp=1429268224555000)
list users;
RowKey: 7
=> (name=email, value=jan@novak.name, timestamp=14292682...)
=> (name=login, value=honza, timestamp=1429268223462000)
=> (name=name, value=Jan Novák, timestamp=1429268223471000)
-------------------
RowKey: 13
=> (name=login, value=fantomas, timestamp=1429268224554000)
I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
Cassandra: Sparse Tables
http://www.datastax.com/documentation/cql/3.1/
Working with Tables
CREATE TABLE users (
user_id int PRIMARY KEY,
login text,
name text,
email text );
INSERT INTO users (user_id, login, name)
VALUES (3, 'honza', 'Jan Novák');
SELECT * FROM users;
user_id | email | login | name
---------+-------+-------+-----------
3 | null | honza | Jan Novák
I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
Tables: Dynamic Columns
Tables: Dynamic Columns (2)
CREATE TABLE users (
login text PRIMARY KEY,
name text,
emails set<text>, // column of type “set”
profile map<text, text> // column of type “map”
)
INSERT INTO users (login, name, emails, profile)
VALUES ( 'honza', 'Jan Novák', { 'honza@novak.cz' }, � { 'colorschema': 'green', 'design': 'simple' }
);
UPDATE users
SET emails = emails + { 'jn@firma.cz' }
WHERE login = 'honza';
I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
Dynamic Columns: Another Way
CREATE TABLE mytable (
row_id int, column_name text, column_value text,
PRIMARY KEY (row_id, column_name)
);�
INSERT INTO mytable (row_id, column_name, column_value) VALUES ( 3, 'login', 'honza' );
INSERT INTO mytable (row_id, column_name, column_value) VALUES ( 3, 'name', 'Jan Novák');
INSERT INTO mytable (row_id, column_name, column_value) VALUES ( 3, 'email', 'honza@novak.cz');
I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
Cassandra: Working with Data
Data Sharding in Columnar Systems
System | Terminology |
BigTable | tablets |
HBase | regions |
Cassandra | partitions |
I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
Data Sharding in Cassandra
CREATE TABLE tab ( a int, b text, c text, d text,
PRIMARY KEY ( a, b, c)
);
CREATE TABLE tab ( a int, b text, c text, d text,
PRIMARY KEY ( (a, b), c)
);
Data Sharding in Cassandra (2)
CREATE TABLE mytable (
row_id int, column_name text, column_value text,
PRIMARY KEY (row_id, column_name) );
Data Replication
BigTable: Architecture
HBase: Architecture
http://bigdatariding.blogspot.cz/2013/10/hbase-overview-of-architecture-and-data.html
Local Persistence
Local Persistence
I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
write op.
read op.
main memory
disk
Write Ahead Log
SSTable files
Compaction, Consolidation
I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
main memory
disk
SSTable files
merge the data
delete records
files + free disk space
Cassandra: Querying
Cassandra Query Language (CQL)
SELECT <selectExpr>
FROM [<keyspace>.]<table>
[WHERE <clause>]
[ORDER BY <clustering_colname> [DESC]]
[LIMIT m];
SELECT column_name, column_value
FROM mytable
WHERE row_id=3
ORDER BY column_value;
CQL: Limitations on “Where” Part
... WHERE row_id IN (3, 4, 5)
... WHERE row_id=3 AND column_name='login'
SELECT * FROM mytable � WHERE column_name IN ('login', 'name') ALLOW FILTERING;
CREATE TABLE mytable (
row_id int,
column_name text,
column_value text,
PRIMARY KEY
(row_id, column_name)
);
CQL: Limitations on “Where” Part (1)
SELECT login FROM users
WHERE emails CONTAINS 'jn@firma.cz';
SELECT * FROM users
WHERE profile CONTAINS KEY 'colorschema';
Indexes
CREATE INDEX ON users (emails);
Queries: HBase and Cassandra
http://www.infoworld.com/article/2610656/database/big-data-showdown--cassandra-vs--hbase.html
Transactions
INSERT INTO users (login, name, emails) �VALUES ('honza', 'Jan Novák', { 'honza@novak.cz' })
IF NOT EXISTS;��UPDATE mytable SET column_value = 'honza@firma.cz'
WHERE row_id = 3 AND column_name = 'email'
IF column_value = 'honza@firm.cz';
Summary
Questions?
Please, ANY questions?�
References