iPhone 5 website teardown: How Apple compresses video using JPEG, JSON, and <canvas>

by @dbloom (comments on Hacker News)

Apple announced the iPhone 5 last week. And what Apple product launch would be complete without a new web presence?

Every new Apple web presence does new things using the web platform. A prominent example is the iPhone 4 retina loupe. Other noticeable examples include the spinning battery life clocks and strange parallax scrolling effects for the new iPad (ugh).

It turns out that Apple's biggest new feature in the iPhone 5 website, however, is a bit further beneath the surface:

The "Design" page for iPhone 5 includes an auto-playing video of the device being unlocked. But there's no <video> element, just a <canvas>. And if you check the "Network" tab of the inspector, you won't find any video there either.

But you will find some strange JPEG images:

Why would Apple do something so absurd!? They co-created the h.264 standard, and wrote Safari, so clearly they should know how to do this the "easy way": use a <video> element.

But the <video> element just won't work here. <video> elements can only be played fullscreen on iPhone, which kind of ruins the desired effect of this inline video.

And on desktop computers, Apple's website needs to work on all major browsers, but Firefox and Opera won't support h.264 (and there's no way that Apple would be willing to offer WebM or Theora fallback).

Apple used to solve this problem by sending a separate JPEG image for each frame of video and switching between them. You can see this in action on the Retina Macbook Pro "Features" page -- which loads about 5MB of JPEG images (using lots of separate HTTP requests) just for that 2 second effect.

For iPhone 5, they came up with a new approach that doesn't require a separate image for each frame. The animation only takes 1MB and a few HTTP requests, a significant improvement over the old approach.

Here are the files that the video is encoded in:

The actual logic for presenting the video is in ac_flow.js, which I recommend perusing in the Chrome Web Inspector with "Pretty Print" turned on (click "{ }" in the lower left corner). I'm not going to walk through the code itself though, so feel free to just keep reading and not bother.

The video compression Apple is using works by only updating parts of the image that have changed since the previous frame. "unlock_001.jpg" and "unlock_002.jpg" contain the updated parts of the image. The "unlock_manifest.json" file specifies how the updated parts are positioned.

Here's a snippet of "unlock_manifest.json":

JPEG uses 8x8 macroblocks to encode images, so Apple wisely uses this size in their compression as well (see "blockSize" in the json). This prevents artifacting caused by unrelated images sharing the same macroblock. (The JPEG uses 4:4:4 chroma sampling, so the chroma macroblocks are also 8x8)

"imagesRequired" indicates that there are two images that need to load for the animation to start. These images, "unlock_001.jpg" and "unlock_002.jpg", are basically treated as a continuous stream of 8x8 blocks, read in left-right then top-bottom order. (When I say "JPEG stream", this is what I'm referring to).

The "frames" array looks like base64, but here's the kicker: data is actually encoded at base64 offsets too. In other words, the size of a byte in this format is 6 bits. Here's how Apple is partially decoding the base64:

Each frame is made up of multiple 5 byte instructions. The first 3 bytes of the instruction encode the location to paint the update on the target <canvas>. The last 2 bytes tell how many blocks to read. For example, the first instruction of the first frame, "AAxAC", says to read 2 blocks ("AC"...remember, these are base64 bytes) from the JPEG stream and paint them at position 49 ("AAx") on the <canvas>.

Note that there is no capability provided for reusing JPEG blocks from the JPEG stream. The JPEG blocks can only be used once each, in order. This adds some potential redundancy to the JPEG, but it makes the manifest smaller and the format simpler.

You can tell that some frames have more instructions than others. Most of the early frames are very small (the shiny "Slide to unlock" effect). The frames that update the entire image are a little bigger, but still not especially large because they are just a few instructions that have a long length specified. The biggest frames in the manifest are those that update many small parts of the image. If you zoom out the JSON, it turns into a nice graph:

The "longest" frames are likely the ones where everything has finished animating except the icons, causing a lot of gaps between blocks copied from the JPEG stream.

So that's how Apple is encoding video in JPEG.

But it's not just for video...

That's right, Apple is using this compression strategy to recreate QTVR. When you drag and drop the earbuds, you're actually scrubbing a video of the earbuds rotating around. And the video is using the same JS/JSON/<canvas> compression technique.

So how well does it work? Well....not so well:

What's going on? Very little time is being spent in the <canvas> API ("drawImage"). All of the time is being spent decoding frames and applying diffs.

It turns out that seeking is very expensive in this video format. To decode an individual video frame, we must first decode every single frame before it — after all, the whole point of the format is to only encode the parts that changed, which means that now we need to calculate the parts that haven't:

To alleviate the situation somewhat, Apple includes both forwards- and backwards- playing versions of the video (otherwise, rotating the earphones backwards would be obscenely slow). Unfortunately, this effectively doubles the filesize, and does not solve the problem of not being able to "skip frames" when the user drags the earphones quickly.

So what's the next thing up Apple's sleeve? How about replacing JSON with something more efficient for binary data?

It looks like Apple is working on a new version of this script that encodes the manifest in a PNG image, instead of base64 strings in JSON. Because PNG is lossless, and <canvas> support is pretty good, it turns out to be a good format for encoding lots of bytes of data, even if it's not actually an image, and reading it back into the browser. It will be interesting to see how this PNG format affects performance and filesize of this animation technique.


Comments on Hacker News.

And if you found this interesting, you might find working for Cue interesting too.