1 of 37

JSON sucks, let’s talk binary!

David Bruant

bdxio - October 16th 2015 - ENSEIRB-MATMECA

2 of 37

Overview

  1. (too long) Introduction
    1. “JSON sucks”
    2. Guess my number between 1 and 8
  2. Representing information efficiently
    • Motivations
    • Why does JSON suck?
  3. (aside) The code we write and the wire

4) Protocol Buffer (protobuf)

    • Introduction, benefits
    • Protobuf tricks
    • Example
    • Limitations
    • Protobuf on the web

5) Other equivalent protocols

6) With infinite resources...

3 of 37

(too long) Introduction

4 of 37

Two devs walk into a bar...

  • Jean-Marc Leroux
    • https://twitter.com/promethe42
    • Gave a talk here (ENSEIRB) for TTFX on Minko
    • At the bar afterwards “JSON sucks” �(maybe not an actual quote)
    • Told me about Protocol Buffers
    • (thanks again :-) )

5 of 37

In this very classroom (almost)

  • Philippe Duchon
    • Professor here (ENSEIRB)
    • Information theory class
    • Started the class with a riddle:�I’m thinking of a number between 1 and 8.�What is the minimum number of independent yes/no questions �you can ask to be sure of the number I'm thinking of?

6 of 37

Riddle

… so? how many?

7 of 37

(this slide left blank intentionally)

(hat tip Janet Switcher for this joke)

8 of 37

Riddle - answer

  • 3 questions
  • Finding a number between 1 and 8 (or 0 and 7) requires 3 bits of information (or bits of entropy)

9 of 37

Riddle 2 - biaised choice

  • 5 can be picked with probabilty 93%, the seven other with 1% (and not 12.5%)

  • (1*1*93+7*4*1)/100 = 1,21 questions

10 of 37

Representing information efficiently

11 of 37

Motivation

PERFORMANCE

12 of 37

Motivation

  • Data is expensive (in perf and $$$) on mobile
  • CPU costs (encoding/decoding) is much cheaper than transfer costs

13 of 37

Why does JSON suck

My position

  • JSON
    • {"lng":-0.5805,"lat":44.84044}
    • 30 bytes
    • Explicit, inefficient/wasteful representation of numbers
  • Binary
    • 0xa6 0x9b 0x14 0xbf 0x9c 0x5c 0x33 0x42
    • 8 bytes (4 for lng, then 4 for lat)
    • Implicit, very compact

14 of 37

(aside) The code we write and the wire

15 of 37

The code and the wire

01001001001101010101111010101

Machine 1

Machine 2

JSON.stringify(obj) => send

receive => JSON.parse(str)

encode(obj) => send

receive => decode(buf)

(30 bytes)

(8 bytes)

16 of 37

Protocol Buffer

17 of 37

Protocol buffer - Origins

  • Invented by Google for microservices
  • Machine written in different languages (C, Python, Java), they need to communicate with something more efficient than JSON
  • Forward and backward compatbility so machines can be updated separately

18 of 37

Protocol buffer - Because binary protocols are hard

https://en.wikipedia.org/wiki/IPv4#Header

19 of 37

Protocol buffer - Introduction

// message description

message measure {

float lng = 1;

float lat = 2;

repeated int32 values = 4 [packed=true];

}

function measureEncode(obj){

var b = new Buffer(...)

// read obj, write in b

return b;

}

function measureDecode(b){

var obj = {};

// read b, write in obj

return obj;

}

(protobuf wire protocol)

20 of 37

Protocol Buffer - Introduction

  • ProtocolBuffer(S: schema) : {� encode(obj<S>) : buffer� decode(buffer) : obj<S>�}
  • The Protocol Buffer wire protocol is defined byte by byte by the Protocol Buffer spec.

21 of 37

Protocol buffer tricks - Binary 101

  • 1 bit : 0 or 1
  • 1 byte (octet) : 01001010 (0x8a)
  • hexadecimal notation
  • a | b (binary or)
    • 0011 |�0110�------�0111
  • a & b (binary and)
    • 0011 & �0110�------�0010
  • a << n (shift)
    • 1101 << 3 = 1101000
    • (~ *2^n)

0 => 0

...

9 => 9

a => 10

b => 11

c => 12

d => 13

e => 14

f => 15

22 of 37

Protocol buffer tricks - varint

00000001

0000001

1

10101100 00000010

0000010 0101100

300

10101100 10000000 00010001

0010001 0000000 0101100

278572

  • Until first bit = 0, read next byte. Accumulate all last 7 bits in reverse order
  • Protobuf loves small integers!

23 of 37

Protocol buffer tricks - encoding a list

  • List size, then list elements
  • (or one special character denoting the end of the sequence like \0 in C, but this caused enough security issues)

24 of 37

Protocol buffer tricks - types

Type

Meaning

0

varint

1

64-bit

2

Length-delimited

...

(other types)

https://developers.google.com/protocol-buffers/docs/encoding#structure

First byte of a field is

(field_number << 3) | wire_type

25 of 37

Protocol buffer tricks - example

message Test4 {� repeated int32 d = 4 [packed=true];�}

encodeTest4({d: [3, 270, 86942]}) :

22 // tag (field number 4, wire type 2)�06 // payload size (6 bytes)�03 // first element (varint 3)�8E 02 // second element (varint 270)�9E A7 05 // third element (varint 86942)

https://developers.google.com/protocol-buffers/docs/encoding#packed-repeated-fields

26 of 37

Protocol buffer - limitations

  • Field number on the wire (backward and forward compat useless for the web)
  • Need to parse the entire buffer before being able to extract data (like JSON)
  • Need to pre-process data beforehand
  • Useless for big strings

27 of 37

Protocol buffer - limitations - date

  • Need to decide accuracy (ms, s, min, h ? etc.)
  • Need to decide the start date (to make the integer small)
  • … anyway, pre-processing

28 of 37

Protocol buffer - limitations - delta encoding

  • Problem : array of lots of small integers that are close in value. Order doesn’t matter.
  • Solution
    • Sort the array
    • Encode the first value, then only the delta to the previous value in 4 bits.
    • If delta is > 15, say 0b1111 and next byte is a full value
    • Put in resulting buffer “bytes” field
  • This encoding beat JSON+gzip in our use case
  • https://github.com/anthill/pheromon-codecs/blob/master/signalStrengths/delta-encode.js

29 of 37

Protobuf on the web

  • https://github.com/mafintosh/protocol-buffers
  • For both Node.js and web. It works!
  • Use protobuf (structure) + gzip (repetitive patterns)
  • Super important for HTTP POST (no built-in compression!)

30 of 37

Other equivalent ideas

  • Thrift
  • Avro
    • Send the schema with the message
  • Cap’n’proto
    • no varint (can read a field without parsing the whole message)
  • FlatBuffer (Google)
    • No JS implementations !?
  • (big pipe)

31 of 37

With infinite resources...

32 of 37

Making protobuf easier/better

  • Writing schemas is boring
    • NEED a function (any language) that takes a JSON object (or several) and outputs the schema
  • In Node.js, the package uses Buffer
    • NEED to use TypedArray so it works better in both web and Node
  • In Node.js, the package generates both encode/decode functions from the scheme. Only need one on either side, though
    • NEED code generation (for client-side code size purposes)

33 of 37

Can we do better than protobuf and the others?

// message description

message measure {

float lng = 1;

float lat = 2;

repeated int32 values = 4 [packed=true];

}

function measureEncode(obj){

var b = new Buffer(...)

// read obj, write in b

return b;

}

function measureDecode(b){

var obj = {};

// read b, write in obj

return obj;

}

(protobuf wire protocol)

34 of 37

Can we do better than protobuf and the others?

Constraints

- backward compat

- forward compat

- lazy random read

- ....

function encode(obj){

var b = new Buffer(...)

// read obj, write in b

return b;

}

function decode(b){

var obj = {};

// read b, write in obj

return obj;

}

create wire protocol

Message description

- language (string)

- Date

- some encoding (like delta-encoding, zip)

- ….

Nobody care about this!

35 of 37

Announcing Protocol Bruant

36 of 37

… no, just kidding :-)

37 of 37

Thanks!