Faster conclusions using in-memory columnar SQL and machine learning
Hortonworks – May 3, 2016
Apache Arrow
DREMIO
Who
Wes McKinney
Jacques Nadeau
DREMIO
Arrow in a Slide
Calcite |
Cassandra |
Deeplearning4j |
Drill |
Hadoop |
HBase |
Ibis |
Impala |
Kudu |
Pandas |
Parquet |
Phoenix |
Spark |
Storm |
R |
DREMIO
Agenda
DREMIO
Purpose
DREMIO
Overview
DREMIO
Focus on CPU Efficiency
Traditional
Memory Buffer
Arrow
Memory Buffer
DREMIO
High Performance Sharing & Interchange
Today
With Arrow
DREMIO
Shared Need > Open Source Opportunity
“We are also considering switching to a columnar canonical in-memory format for data that needs to be materialized during query processing, in order to take advantage of SIMD instructions” -Impala Team
“A large fraction of the CPU time is spent waiting for data to be fetched from main memory…we are designing cache-friendly algorithms and data structures so Spark applications will spend less time waiting to fetch data from memory and more time doing useful work – Spark Team
DREMIO
In Memory Representation
DREMIO
Columnar data
persons = [{
name: 'wes',
iq: 180,
addresses: [
{number: 2, street 'a'},
{number: 3, street 'bb'}
]
}, {
name: 'joe',
iq: 100,
addresses: [
{number: 4, street 'ccc'},
{number: 5, street 'dddd'},
{number: 2, street 'f'}
]
}]
DREMIO
Simple Example: persons.iq
DREMIO
Simple Example: persons.addresses.number
DREMIO
Columnar data
DREMIO
Language Bindings
DREMIO
Language Bindings
DREMIO
Java: Creating Dynamic Off-heap Structures
FieldWriter w= getWriter();
w.varChar("name").write("Wes");
w.integer("iq").write(180);
ListWriter list = writer.list("addresses");
list.startList();
MapWriter map = list.map();
map.start();
map.integer("number").writeInt(2);
map.varChar("street").write("a");
map.end();
map.start();
map.integer("number").writeInt(3);
map.varChar("street").write("bb");
map.end();
list.endList();
{
name: 'wes',
iq: 180,
addresses: [
{number: 2, street 'a'},
{number: 3, street 'bb'}
]
}
Json Representation
Programmatic Construction
DREMIO
Java: Memory Management (& NVMe)
DREMIO
RPC & IPC
DREMIO
Common Message Pattern
Schema Negotiation
Dictionary Batch
Record Batch
Record Batch
Record Batch
1..N Batches
0..N Batches
DREMIO
Record Batch Construction
Schema Negotiation
Dictionary Batch
Record Batch
Record Batch
Record Batch
name (offset)
name (data)
iq (data)
addresses (list offset)
addresses.number
addresses.street (offset)
addresses.street (data)
data header (describes offsets into data)
name (bitmap)
iq (bitmap)
addresses (bitmap)
addresses.number (bitmap)
addresses.street (bitmap)
{
name: 'wes',
iq: 180,
addresses: [
{number: 2,
street 'a'},
{number: 3,
street 'bb'}
]
}
Each box is contiguous memory, entirely contiguous on wire
DREMIO
RPC & IPC: Moving Data Between Systems
RPC
IPC
DREMIO
Real World Examples
DREMIO
Real World Example: Python With Spark or Drill
DREMIO
Real World Example: Feather File Format for Python and R
DREMIO
Real World Example: Feather File Format for Python and R
library(feather)
path <- "my_data.feather"
write_feather(df, path)
df <- read_feather(path)
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)
R
Python
DREMIO
More on Feather
array 0
array 1
array 2
...
array n - 1
METADATA
Feather File
libfeather
C++ library
Rcpp
Cython
R data.frame
pandas DataFrame
DREMIO
Feather: the good and not-so-good
DREMIO
What’s Next
DREMIO
Get Involved
DREMIO