Marbling It Up

@MarbleItUp has gone public. We will be launching soon on Switch. I wanted to talk a little bit about the game and how it came to be… and share some sweet GIFs (in high framerate black and white for download size reasons).

An extremely early baby shot of the game. Almost everything here – including the code – was later replaced. And that new stuff runs at 60hz on Switch in handheld mode.

Just over a year ago, Mark Frohnmayer came to me with a proposal – let’s build a sweet marble game and put it on Switch. The Switch at that time was a brand new platform, but the potential was immediately obvious to both of us. I agreed to partner with Mark and all of a sudden – after several years away – I was back in the game development world.

I didn’t know it on day one, but making Marble It Up! was going to be a rare privilege. I am incredibly proud of this game – the team, the levels, the visuals and the gameplay are all amazing, and I count myself privileged to be involved. This is really a great game.

As we near release, I feel confident it will be a hit. I am just hoping it will be blockbuster Michael Bay hit, not a indie niche Kevin Smith hit. Anyway, on with the story…

The Team

Like Danny Ocean, we had to get the old crew together. I already had Miha, an awesome coder, on my team from past projects, so he and I got rolling (hah) right away. Meanwhile, we immediately reached out to Alex, original level designer for Marble Blast, and Todd, one of the most talented (and hardest working) artists I know. We also pulled in the incredible Solovox to make an awesome soundtrack and Mike Jones to do the rest of the audio.

Along the way we ran into long time fans of our work on the original Marble Blast series. We brought in the team behind QuiVR, Jonathan and Fletcher, as well as long time community members, speed runners and developers Matan and Conor.

With the team thus completed, we went into full production. Development slowed down a little bit when a few months into development my Loom technology was acquired and I had to devote my focus to a roll (ho ho ho) with the acquiring company. This ended up being short lived and I was able to refocus my attention after about 10 months of distraction (so to speak).

All of our development was self funded. This gave us total control over direction, but also required all of us to really commit to the game. We’re using a studio model for the game itself – it has its own entity that holds the IP and pays out royalties. Maybe I’ll talk about the basic structure we used for that some day down the road.

Development Path

For this project, we used Unity 2017. It comes free with Switch developer access, and it looked like the best path forward on short notice. I briefly considered using Torque, the first commercial game engine I worked on. Because it was long in the tooth, it would require a solid porting effort before development could begin in earnest. We just didn’t have that luxury.

Although older versions of Unity were shaky, it has really matured in recent years and 2017 was easy to work with. Very quickly we had something up, and I started taking little progress GIFs (in black and white so they’d show up in Slack quickly). We got a ball rolling around in very short order:

Very early rolling behavior.

The movement took a while to get right (more on that later). But ProBuilder made it super easy to throw a level together, and with a little work we got elevators going.

One of the first elevators.
And collapsing platforms!


Being able to easily tie any geometry into a mover was a huge benefit for level design. It did require some discipline with prefabs to create consistent reusable elevators, but one-off showpieces like those found in Cog Valley or Staying Alive were super easy to make.

The physics simulation was very detailed, and one of the elements needed for both physics networking was a capacity for rewind. Instead of only allowing for a tick or two of rewind, I generalized to allow for arbitrary rewind. This made level testing MUCH easier:

Check out the hook while my DJ revolves it.

And it also set the stage for two other features, ghost racing and replay. (I love ghost races way more than anyone else on the team, but I’m the lead programmer so they had to ship it.) Being able to watch and race against other players routes, as well as compete against yourself, adds loads of replay value to the game.

What better challenger for you to face…. than YOURSELF!

Not to say everything was wine and roses. We quickly found that Unity’s physics were completely inadequate for our needs. In the end, about half of the game code was for our custom marble physics engine. I feel comfortable claiming that no other marble game comes close to our level of fidelity. Part of the reason we have our own physics is in order to make them unrealistic, too – sometimes bending the laws of physics in juuust the right way makes things way more fun.

An extreme case of gravity surfaces.

We also found that the default Unity renderer was really not appropriate for the game. Shadows provide important depth cues, but Unity’s dynamic shadows were fuzzy and required very careful level design and lighting to be present in all gameplay situations. In the end, a simple projected blob shadow gave us a crisp look and perfect visibility to support the gameplay. It also saved us many man hours tuning levels.

Particles were also a huge pain – especially mesh particles. Unity does horrible slow things to these and cost us massive amounts of performance on the CPU-light Switch hardware. In the end, we had to make small systems to handle the costliest cases in order to meet our performance goal of 60hz in handheld mode.

Without custom rendering, this effect could bring the Switch to its knees.

We also provide a full reflection on the marble. When I implemented this for Marble Blast Ultra, it took me a week or so to modify the scene graph to re-render the level in all six directions. Performance took some care, but ultimately it ran fine using more-or-less the default engine code. This was on the Xbox 360, which has way less RAM, a slower GPU and weaker CPU. Under Unity, on the more powerful Switch, we ended up having to write our own custom CommandBuffer based renderer and our own set of lightweight shaders in order to hit 60hz. This was an unexpected surprise late in development.

In general, Unity was excellent for prototyping and able to be coaxed into viability for production.


Man, I am stoked to see this game go out into the market. It has been a year of my life, some great challenges, some great fun, and hopefully, a game that brings a little bit of joy into the world. Let’s marble it up!

Video Conference Part 4: Making RANS the Ryg way

Last post, we got our networking stack up and running, figured out how to work around firewalls, and saw how our codec performs over a real network link. This motivated us to revisit our compression schemes, which is what we’ll do in today’s post.

We’ll start with some algorithmic improvements to reduce the amount of data we need, then implement an entropy coding technique to reduce our data rate even further.

YUV & Chroma Subsampling

Up to now, we have skipped a key part of JPEG – conversion to YUV and chroma subsampling. DCT works on any kind of data – RGB, YUV, monochrome, etc. The DCT we have been using is for 8×8 blocks. So up to now we have been sending 4 sets of 3 blocks – one each for the red, green, and blue channels in each of the four 8×8 regions in a macroblock. This is a lot of data!

What JPEG and a lot of other formats do is take advantage of limitations in the human visual system to save bandwidth. Specifically, people are much better at detecting changes in brightness then they are in color. So you can send color data at half resolution, while you keep brightness at full resolution. To do this, you must first separate brightness from color, which is accomplished by converting to the YUV color space.


Doing this saves about 50% bandwidth with a very minimal reduction in quality (6 blocks instead of 12 blocks). It brings us down to 4 bits per pixel.

Flat Blocks

Right now the shortest block we can send is around 18 bytes (8 coefficients at 2 bytes each plus 2 more bytes to encode zeroes for the remaining 56 coefficients). So a macroblock will be at least 108 bytes plus overhead. This works out to around 0.4 bits per pixel for even a very simple macroblock.

However, during high motion frames, we need to prioritize quantity over quality. Sending fully detailed macroblocks is costly in CPU and bandwidth for pixels that are blurry and rapidly changing. Further, macroblocks that aren’t updated are very obvious in this sort of situation – so we want to prefer a full frame of updates rather than part of a frame.


To this end, we create a new macroblock encoding type called “flat.” We add a single bit at the start of the macroblock indicating if it’s DCT or flat. If flat, then the format is as follows:

 6 bits - red   color quantized by 4
 6 bits - green color quantized by 4
 6 bits - blue  color quantized by 4

We select this type when the RMS error of the flat color is less than some threshold. The threshold is based on the current total error of all macroblocks, so when we have lots of motion causing lots of error, we are more likely to encode flat macroblocks. Even a very low bandwidth link is capable of sending full frames where every macroblock is flat.


Flat mode has a flat cost of 0.07 bits per macroblock, which is very low. It improves the responsiveness of video feeds quite a bit. There are a whole family of techniques along these lines used by H.264 and other codecs (32×32 or larger macroblocks, special gradient blocks, etc.) that can make a big difference in high motion or scenes with out of focus regions. These techniques do tend to make high motion frames that can’t fit in available bandwidth “blur out” but this is a lot less offensive than partial updates.

Entropy Coding with RANS

We’ve gotten pretty far – down to 4 bits per pixel on an average feed. But there’s one important piece of the JPEG approach we’ve neglected up till now: entropy coding. Right now, all coefficients are sent as 2 byte signed shorts. We’re doing a lot of work to make them close to zero, but it only reduces size if we get adjacent coefficients to zero and RLE gives us a win.

JPEG uses a Huffman coder, which uses fewer bits for numbers closer to zero. Common +1 and -1 coefficients get sent in only a couple of bits, while rare, larger coefficients such as 1000 cost more bits. The balance is a significant reduction in size. JPEG also stores some coefficients relative to previous coefficients to help reduce the average magnitude and thus the bit count.

Because I’ve worked with Huffman codes before, I decided to try something new to me – arithmetic coders. I’ve read about them but never used one directly. I had been wanting to try the RANS coder from @rygorous (found at, so I started implementing one based on his code.

Getting it fully integrated and working was a beast! I lost a couple of days to getting the RANS coder up and running with no issues. A lot of this was my learning curve, as the algorithm is subtle. Because the operation of an arithmetic coder is fairly opaque, I had to take a black box approach most of the time- I could tell via sentinels if I got right or wrong values, but it was difficult to determine the cause when I got a wrong value out of the decoder. Eventually, I ended up implementing a set of unit tests (using catch.hpp) which helped me constantly check each step of the encoding and decoding process so I could quickly spot any issues when I made a change.

Arithmetic coding has two parts – the actual coder (which is clever and small) and the context. The context models what values are likely at each point in the stream, and the arithmetic coder ensures that likely values are encoded with fewer bits than unlikely values. Good contexts make a huge difference in the effectiveness of compression.

In preparation for all of the following techniques, I captured eleven megabytes of RLE encoded DCT data at various quantizations to use as a training set. For convenience I regenerate my contexts at application launch, but in the real world we would use a much larger training set and only ship the final statistics. This data could easily be included in a header file, and would add just kilobytes to the executable footprint.

Order 0 Context

Contexts are classified by how many symbols back they consider for their prediction. An order 0 context looks only at the current symbol state to determine likely and unlikely symbols. Typically this boils down to a single static list of symbol frequencies.

Order 0 context was easy to implement (Ryg conveniently provides such an implementation in his ryg_rans project!) and I found it saved us about 1.9 bits per pixel.

Order 1 Context

I next tried to implement an order 1 context that looks at the previous symbol and the current symbol to determine its odds. However, I ran into huge issues – I just could not round trip data properly. The reason why is interesting.

RANS-style arithmetic coding is as a deterministic, symmetric process. Think of it like braiding a rope. To unbraid the rope, you do the exact same things but in reverse. Similarly, you run RANS backwards to generate the encoded data, and forward to get the decoded data.

When you are working with an order 0 context, this isn’t a big deal. You get the same statistics out whether you submit symbols forwards or backwards. However, more complex contexts aren’t symmetric (ie, odds of A after B are different than B after A). So no matter what I tried I could not get my order 1 context to work!

Subsequently, I got some advice from the excellent and wise Fabian Giesen, who pointed out that I should buffer symbols for encoding so I can process them forwards and write them backwards. This unblocked progress, but unfortunately I ran out of time to implement the order 1 context – I had already moved on to something else.

Dual Order 0 Context

Since the order 1 context was initially a bust, I tried an alternate approach. I wanted to take advantage of the structured nature of the RLE encoded data, which is short aligned. So I built two order 0 contexts, one for the first byte of all the shorts and the other for the second byte. This was simple and worked pretty well – and it was symmetric, so it bypassed the RANS order concerns. This took me down another 0.3 bits per pixel.

Extended Quantization

JPEG only has quantization levels from 1 to 22. However, we can tolerate much worse quality in video than we can in a still image – motion hides artifacts. We can extend the quantization range, lose some quality, and drop up to another 0.9 bits per pixel.

Block State Format (v3)

If you will recall how we encoded our blocks in our packets, you might have thought it wasn’t as efficient as possible. Specifically, a few things cost us in overhead. We re-send the compression type and level for every macroblock, when we don’t change them very often and could probably send them once per packet. This gains us 0.03 bits per pixel.

We also have a flag to indicate if there is another macroblock coming. This is cheap for small numbers of macroblocks, but now we have good compression so we can send a lot in a single packet. If we have 100 macroblocks, we burned 100 bits. It would be a lot cheaper to use 8 or 9 bits to encode the macroblock count. This gains us 0.003 bits per pixel. Not a huge win but it dovetails with the next change well.

Ryg kindly pointed out this win to me via Twitter. Every time you flush RANS, you have to write out 4 bytes. When we flush after every block, then we burn ~32 bits per block. If we modify our packet to have one continuous RANS stream for all the macroblocks, we only have to flush once for perhaps a hundred macroblocks. This gains us around 0.12 bits per pixel.

One cool thing about RANS is you can insert uncompressed bits directly without any additional overhead. (It’s symmetrical and deterministic, remember? So you just stuff bits in/out and if you are consistent it works itself out.) We use this bypass mode to handle special cases like the flat encoding mode data and the flag that determines if we’re a flat or a DCT block. This allows us to put all our macroblock data through RANS with no hassle.

We could arithmetically code this data based on the likelihood of flat vs. non-flat blocks, or based on common RGB values, but since flat mode is already so much cheaper than DCT, I skipped it. So, no change in bits per pixel, but it keeps our implementation straightforward.

Now the packet look like this:

4 bits - mode (raw, zip, lzo, dct) for packet
6 bits - quality level (if dct)
8 bits - block count
[RANS bytes follow to end of packet]

There is one gotcha. Because RANS pops out bytes at irregular intervals, we can only know if our packet is full by compressing all our pending data after each block is added, and checking to see if we ran out of space. Currently, I do this after every block and as a result I spend about two thirds of my encoding time re-compressing the same data over and over. Because RANS operates in reverse, I can’t easily “checkpoint” my encoding state and only do work for the new content – I have to process the whole packet’s worth of data every time.

I believe this could be substantially improved by heuristically checking more or less often based on the available space. So I might check only once every 10 blocks at first, then more and more regularly as I get towards the end of the packet. This would give a substantial reduction in CPU load.

A Word on Performance

(Notes for self: I tested commit c8e59a2, on a Core i7-2600k 3.4ghz quad core. Program is single threaded.)

Overall, the system has no problem maintaining 60hz performance in release build. Deblocking or very high bandwidth usage can cause it to miss this target, but there are clear paths for optimization. Playback is surprisingly lightweight.

Memory footprint is around 75MB and does not vary substantially. Higher latency connections will cause slightly higher usage, and a lower latency connection slightly less (as we store per-packet information until the packet is received or dropped).

% of Frame Time Description
25% Speculative compression
12% Block serialization
25% GUI rendering
5% Decompression
3% Update error metric

Overall, we see about 40% of frame time spent encoding, and 5% on decoding.

The system is designed to be easy to parallelize. Specifically, since frames are often sent with multiple packets, we would recommend divvying up pending blocks and encoding packets in worker threads. Decoding could also be done with workers for a good speedup. On a fast desktop CPU this is less of an issue, but when trying to achieve low latency on embedded hardware, it could make a big difference.

We can also move to SIMD versions of hot path code (RANS, DCT, and color conversion are all eligible). Our bitblt is also naive, and contributes to about half of the GUI rendering time. A SIMDized blitter would run much faster.

We estimate an optimized decoder implementation on the same hardware would take around 1% frametime, and an optimized encoder implementation would take around 5% frametime. This puts screensharing of 1080p on a single core well within reach – we would expect to see 7% frametime decoding and 40% frametime encoding.

Closing Thoughts

Writing my own video chat app and video codec was a great experience. Much like writing a toy compiler, language, or operating system, it is a learning experience more than a practical application. It gave me a deeper appreciation for the excellent video codecs available on the market today, and a better understanding of the issues facing things like Skype and WebRTC.

Ultimately, for someone often frustrated by video chat applications, it was cathartic to finally take a swing at the problem. And now I have something to tinker with when Skype breaks on me.


Final Statistics

Avg bits per pixel: 0.6 bits for low quality, 1.0 bits for good quality
Max usable sustained packet loss: 25%
Max survivable sustained packet loss: 75%

Thanks for reading. My team and I solve hard problems in the streaming video, AR/VR, scalable backend, gaming, and IoT spaces. If we can help you, contact us at ben AT theengine DOT co.

Video Conference Part 3: Getting Online

Last time we got up close and intimate with the core compression techniques used in the JPEG format, and applied them to our own situation for better compression. We got our data rates low enough that we have a shot at realtime video under ideal network conditions.

Now it’s time to actually send data over a network (or at least loopback)!

The Notify Protocol

We will use a UDP networking protocol derived from one used in the game engines I worked on at my first startup, GarageGames. The Torque networking protocol was state of the art when it was first used in the Tribes series from 1998 to 2001. It enabled realtime gameplay over a 56.6kb or worse modem, and represented a substantial improvement over the networking model used in Quake. You can read the original Tribes networking paper for a deeper discussion of the basic capabilities of the system.

For our purposes in this section, the essential feature of this networking model is the ability to be notified in realtime whether packets have been received by the other end of the connection, or dropped. We receive a callback in our code for each packet that was sent when its fate is known, along with a description of what was in the packet. This is the “notify” part of the protocol. For certain applications, this is a quantum leap relative to TCP, because it allows the application to respond to packet loss intelligently, rather than stalling the connection and resending old data.

You will recall back in the first post that we maintain a model of what we think the client is displaying, and prioritize macroblock updates based on the RMS error of our current local frame versus the client frame.

By using this protocol, we know when updates didn’t make it. If an update was lost, we revert the corresponding macroblock in the client frame to its state before that update was sent. Then the error metric will automatically reprioritize the macroblock for transmission. In the event that we have a newer update “in flight”, we ignore the failure – if the newer update makes it, the problem is resolved, and if it does not, it will cause a retransmission of the macroblock.


In other words, we only resend data when absolutely necessary, and only in the right order relative to other updates. This behavior makes the system robust under extremely high packet loss. You might not get full frames if on a bad connection, but you will see some changes, and eventually your view will become fully correct if a) the scene is static long enough or b) the network recovers.


The protocol, being designed for the high latency modem connections of the late 90s, handles latencies as high as 1000ms without an issue. Of course, the other end sees frames later, but they see them as soon as physically possible. There is no penalty beyond the time spent waiting on the data to get there.

We had to modify the protocol to support higher data rates, as it was originally optimized for connection speeds around 10kB/sec. With our changes, it can reliably transfer data at around 8 megabits/sec. Higher speeds are possible, but as you will see in the last post of this series, largely unnecessary.

Master & Arranged Connections

One benefit of using a game networking protocol is that it has robust support for NAT traversal. By running a light weight master server, we can track many thousands of clients on very modest hardware and make arranged connections that bypass most firewalls.

The NAT punching algorithm is simple but effective. When two peers request an arranged connection, the master server gives both a list of potential IPs/ports where the other peer might be found (typically where the master sees that peer’s traffic originating from, and the peer’s local IP/port), and part of a shared secret. Then the peers send punch packets to those IPs and ports.

For our prototype, we started with local network peer discovery (ie broadcast ping), then added support for entering an IP. Finally, we added automatic discovery of another random peer via the master server.

The master server connection can run at very low bandwidth, so a logical next step would be to do some authentication, track presence, and let you add contacts and initiate a conversation on demand. However, this is just a prototype and none of that is vital functionality!

Packet Math

One important thing to consider is all in packet overhead. It’s easy to get a false sense of confidence by focusing only on the compressed image data without considering the overhead of the full networking stack.

Our packet format header is about 94 bits or 12 bytes. UDP IPV4 header overhead is 28 bytes. Our packet overhead is therefore 40 bytes.

Reliable MTU size seems to be around 1200 bytes, leaving about 1160 bytes for data (3.3% overhead). This can/should be negotiated but is currently hard coded. On local network segments much larger MTUs are possible, and over the general Internet 1500 bytes often works.

The overhead per pixel will decrease as our compression gets better, but at a rough estimate, assuming 50 macroblocks per packet, we will see an overhead of 0.02bits/pixel.

A quick comparison to TCP suggests that on IPV4 we would see 4.3% overhead as opposed to 3.3% for our protocol. This is of course assuming an ideal connection. In the face of packet loss, TCP rapidly throttles down to try to find a data rate that lets the link behave well, losing lots of bandwidth. Our protocol accepts lost packets and rolls forward as best it can. So depending on the situation the difference will be a lot more than 1%.

Simulation Options

An important detail when building a robust system is being able to simulate network problems. If you don’t regularly test under bad conditions, your system rapidly becomes brittle. We have support for adding latency (ie delaying packet delivery) and randomly dropping packets (with a tunable frequency).

Next Steps

Here’s the best stream quality we can achieve in realtime over our network connection:


We’re running at a little over 8 bits per pixel or 7.3 megabits/sec. We can do a lot better – so now that we’re networking properly, we’ll go back for a final round of improvements in the next post (click to visit)!

Video Conference Part 2: Joint Photographic Hell (For Beginners)

Last time, I got angry at Skype and metaphorically flipped the table by starting to write my own video conference application. We got a skeleton of a chat app and a video codec up, only to find it had a ludicrous 221 megabit bandwidth requirement. We added some simple compression but didn’t get much return from our efforts.

We definitely need better compression before we implement networking. There’s no point trying to stream hundreds of megabits over a real world link. We need to slim things way down.

It turns out that most video codecs use some variant of the lossy compression techniques from the JPEG format. I’ve always wanted to learn the details of JPEG. With the excellent mango library from my friend Jukka Liimatta as a reference, I started down the well trodden road of abusing JPEG for my own purposes.


The biggest building block of JPEG is the Discrete Cosine Transform. The DCT is related to the Fourier Transform. It takes a signal (in this case, a 8 by 8 array of pixels) and breaks it down into frequency components. Later, we can take the frequency components and run them through the Inverse Discrete Cosine Transform to get our original signal back.

Lower frequencies live in the top left of the output matrix, while higher frequencies live in the bottom right. Larger numbers indicate more of that frequency, smaller numbers less. The following image shows the frequencies that each of the 64 coefficients you get from a DCT transform encode. By combining them in varying amounts, you can encode any 8×8 pixel image.

Courtesy of many JPEG explanation sites on the internet.

(A word on blocks and macroblocks. You will notice that this is an 8×8 DCT but we are working with 16×16 macroblocks. Macroblocks are so named because they are made up of a combination of smaller blocks. At this point in the game, we are breaking our 16×16 RGB pixel macroblock into 12 8×8 DCT blocks. However, we will see later how we can do much better than that. Thanks for reading, now back to your regularly scheduled programming.)

Lossy compression is all about removing data that doesn’t contribute much to the final output. By working in frequency space, we can start dropping components that are less important. For instance, imagine a macroblock with a smooth gradient. If we are working with RGB values directly, it might be difficult to efficiently detect or encode the gradient. We would be easily confused by noise in the image. By running it through a DCT, we’d immediately see that the largest components were low frequency elements, and the smallest high frequency. It is much simpler to drop the smallest components in a grid!

JPEG is built around making the best use of DCT. One of several tricks it uses is zigzag tables to reorder the DCT components for better encoding efficiency. In adapting the JPEG code from Mango, one issue that got me was that the encoding and decoding order aren’t symmetric. That is, if you zigzagged an 8×8 matrix with the encoding table, then run it through the corresponding decoding table, you get a transposed version of the matrix. This manifested as blown out colors in the output image, and I’ll discuss why in the next section…


The biggest trick JPEG uses for better encoding efficiency is quantization. This is where things get lossy – DCT, like FFT, is fully reversible if you preserve the full precision of its output. By rounding the values from the DCT – also known as quantizing them – you can control how much data is lost. Frequencies that don’t contribute much tend to have small components in the DCT, so they rapidly become zero as they are quantized by larger amounts. Frequencies that matter a great deal have larger coefficients and will not be removed or will be removed very late in the process.

The following image shows similar video encoded with varying quantization level, going from high quantization (which looks blocky) to low quantization (which looks smooth and detailed).


To quantize, JPEG divides each DCT component by a rounding factor derived from the quality level setting. In fact, it generates a whole quantization matrix – one quantization factor for each of the 64 values of DCT output – that biases towards preserving component values or low frequencies. Each value is rounded a little bit differently.

This is how I was getting blown out colors (as mentioned in the previous section). Imagine a partial DCT output matrix consisting of frequency coefficients [40, 30, 20, 10]. We generate a (partial) quantization matrix [1, 2, 3, 4]. We divide and get quantized values of [40, 15, 7, 3] which are then written out.

Originally I was not taking the zigzag’s transposition into account, so I was scaling up by a rotated version of the quantization matrix. This meant that some values were being scaled up with a different factor than they were scaled down. If we represent rotation of the full quantization matrix by scrambling our small demo matrix, it would be like dequantizing by [1, 4, 3, 2]. We would end up with [40, 60, 21, 6] at the end. In compression speak, we failed to preserve the overall “energy” of the block – we ended up with bigger coefficient factors than we started, causing the output to be brighter! The effects vary depending on specifically how you get it wrong – here’s a shot where the luminance quantization coefficients were wrong:


Great – so now we’ve gotten the coefficients for unimportant parts of the DCT to be smaller, and in many cases zero. We can adjust how much we lose by varying a quality level factor. But we didn’t actually reduce the amount of data we transfer at all! Wait what?!


The reason why JPEG tries to get everything closer to zero is simple. Zeroes are cheap to send!

Why? It’s not just because of the convenient hole in the middle of a 0. The next step in JPEG is to collapse all those zeroes with run length encoding. We walk through the array, and replace adjacent zeroes with a special value indicating how many zeroes there were. Suppose our quantization makes the last 32 coefficients zero. We would go from 128 bytes (JPEG DCT output is one signed short per coefficient) to 66 bytes (32 populated coefficients plus one special value indicating 32 zeroes). That’s 50% savings with a pretty simple compression algorithm!

It sounds super simple, and once written it is, but getting RLE encoding to work reliably took several important debugging techniques. In addition to the old fashioned “stare at it in the debugger” technique, I added sentinels – known values at specific locations in the data stream – to help catch any misalignments. If I didn’t see sentinels exactly when expected, the code would instantly assert. I also put some asserts into the RLE decoding routines to check that we always got 64 values out and that a run would not cause us to over flow the buffer (a common sign of bad data would be runs longer than the remaining space in the block).

Because any bugs in later stages of processing would corrupt the RLE data, these asserts were usually my first sign that I had gotten something wrong in later parts of the compression pipeline.

Block State Format (v1)

At this point, although the app isn’t network capable, it starts making sense to describe the format for macroblock updates. Again, keeping things simple, we have the following structure:

 1 bit - is there another block update? if false, stop processing.
 4 bits - compression type (raw, zip, lzo, dct).
 6 bits - only present for zip/dct, indicating quality level.
 6 bits - macroblock x index
 6 bits - macroblock y index
 10 bits - size of compressed data
 size * 8 bits - output from the compression algorithm

Note that the macroblock x and y index are counting macroblocks, not pixels. So in this version of the protocol we can have images up to 1024 16px macroblocks or 16,384px on a side. We could shave a few bits if needed by taking into account the actual image size.

As you can see, this incurs a fixed overhead of 33 bits per macroblock, or 0.51 bits per pixel.

We see an average data rate of 8 bits per pixel with the subset of JPEG implemented above at average quality (ie, quality level 50). Lowest quality setting ran at 6 bits per pixel, and highest quality setting was 20 bits per pixel.

Now we’re down to only 40-60 megabits per second. It’s progress! Our video codec is starting to be feasible over very fast broadband connections, but we can do a lot better.

Next Time

Now that we can transfer little enough data to run the codec over a network, we should implement some networking and test it for real. And that’s exactly what we’ll do in Part 3!

Video Conference Part 1: These Things Suck

What I cannot create, I do not understand. – Richard Feynman

I do a lot of video chat for work. If it’s not a one on one, it’s pair programming. If it’s not pair programming, it’s a client meeting. I use a lot of Skype and Hangouts.

Sometimes they don’t work for unclear reasons. Sometimes file transfers fail. Sometimes screenshare breaks, or when it’s active you don’t get webcam, too. Or the connection lags or drops even though everything is running fast.

Every time I experience such a failure, I get really angry and think, “I could do this better!” But I never quite got angry enough… until now. I guess the weight of years of frustration finally got to me.

I wrote my own (prototype) video conferencing app. It turned out pretty well. And that’s what these posts are about.

Conventions & Caveats

We will be referencing a 640×480 24 bit color 24fps video stream throughout this series of posts. We will spell out bits vs bytes in most cases to avoid confusion, and when abbreviating will use Mb (lowercase) for bits and MB (uppercase) for bytes.

I am not a video codec professional and this is an experiment. MPEG-2, H.264, VP9 and other codecs represent the state of the art and are tremendously more complex and capable than what I will describe. I believe there are some good tradeoffs to my system (which I will discuss later), but it is by no means exhaustively optimized or tuned. There are plenty of obvious improvements I simply didn’t have time to explore in this side project.

Basic Update Algorithm

I started by prototyping a basic algorithm with no network communication – just moving data around in-process. I used dear imgui for the UI, and videoinput for the webcam capture. (I really enjoyed working with both for the first time.) I maintain two buffers, one holding the current frame from the webcam, and the other holding a model of what I’ve “sent” over the simulated network. I also show the per pixel error between the two.


I divide the images into 16px by 16px macroblocks, and calculate the error for each macroblock by taking the root mean square (RMS) of the client frame’s RGB values vs. the local frame’s RGB values for that region. I prioritize blocks with high error and transfer as many as I can every packet. I went with 16px macroblocks out of laziness – there’s lots of research and sample code based on that size.


As you can see, this is a self correcting system. Macroblocks with large errors – indicated by white – are constantly being “transmitted” and reduced to lower error – indicated by black. That is, it’s always trying to minimize the difference between what the client is seeing and the current state of the video feed. The rate at which the system converges on the new state is proportional to how fast we can transfer macroblocks. This allows us to scale to varying bandwidth situations, and it also strongly motivates us to have good compression.


As long as we have some feedback from the client, we can also handle data corrupted or dropped by the network. When we learn about lost data, we’ll update our model of the client state, and let the error correcting behavior handle it. More on that in a later post.

The main flaws with the system at this point are a) we aren’t networking anything yet and b) even if we did, it would require 221 megabits/second for 480p 30hz video sent as uncompressed RGB24. This means you’d have to have a well tuned 802.11n network at minimum – 802.11g would not be even close to fast enough. Peak LTE performance also would not come close to handling this much traffic.

Currently, macroblock updates cost us approximately 26 bits per pixel. We are a bit worse than just sending 8 bit RGB values because of overhead in the data protocol – we have to send macroblock positions, note the current compression settings, and so on.

Raw, zip & lzo

So we have an overall approach, but the bandwidth is way too high for any sort of real world use. We need to reduce our bandwidth by a factor of 30 for this system to be remotely plausible for use on the average 7 megabit broadband internet connection!

As MPEG-2, H.264, HEVC, VP9 and other codecs demonstrate, compressing video is definitely possible. 🙂 But those codecs are all complicated and often complex to integrate, modify or debug (said as someone maintaining a production system using ffmpeg). For instance, x264 is 100k lines of code without a lot of comments. Some codecs have substantial licensing fees. They also tend to have problems when data is lost during transmission. And many introduce substantial latency due to complex (but highly efficient) encoding processes.

A good rule for prototyping is to do the simplest thing first, then iterate. Add complexity as needed.

So I grabbed miniz and minilzo, and set it up so I could choose which compression technique to use on the macroblock data.

Since these are lossless compression algorithms, there was no change in image quality. However, we do see changes in macroblock update size. ZLib at level 9 achieved 23.8 bits per pixel. LZO achieved 28.9 bits per pixel. Not so good! Why are we getting such terrible results?

The biggest reason is that neither algorithm is particularly good at short blocks of data. Both have a “startup” phase where they can’t efficiently compress until a history is built up. Since every packet in our data stream must be self contained, we can’t rely on a shared history. This leads to a big efficiency loss. Even if we could have big blocks, noisy image data is hard to compress with this family of compressors – they are much better at repetitive, byte aligned data such as text.

We found basic run of the mill compressors to be a bust, but we did build the infrastructure to have compressed macroblocks, which is a vital step forward.

Next Time

Enough for prototyping today! We built a basic algorithm, got some basic performance parameters, and took our first baby steps with compression. We also set up an application framework that can display a complex UI and capture video.


The full development UI. Click to expand.


Join us next time in Part 2 as we get our bandwidth down to a plausible range with some lossy compression!

Huffman Serialization Techniques

Networking expert Glenn Fielder’s article on network serialization techniques is a great read. It discusses a number of scenarios including serializing a sparse set of items from an array. This blog post discusses Glenn’s implementation of sparse array serialization and my thoughts on it. To be sure, Glenn’s code has the virtue of simplicity, and the concepts I’m going to discuss are more complex. Don’t assume the space or time trade offs of different serialization techniques without measuring them in practice.

Without further ado, the original code:

template <typename Stream> 
bool serialize_object_index_internal( Stream & stream, 
                                      int & previous, 
                                      int & current )
    uint32_t difference;
    if ( Stream::IsWriting )
        assert( previous < current );
        difference = current - previous;
        assert( difference > 0 );

    // +1 (1 bit)
    bool plusOne;
    if ( Stream::IsWriting )
       plusOne = difference == 1;
    serialize_bool( stream, plusOne );
    if ( plusOne )
        if ( Stream::IsReading )
            current = previous + 1;
        previous = current;
        return true;

    // [+2,5] -> [0,3] (2 bits)
    bool twoBits;
    if ( Stream::IsWriting )
        twoBits = difference <= 5;
    serialize_bool( stream, twoBits );
    if ( twoBits )
        serialize_int( stream, difference, 2, 5 );
        if ( Stream::IsReading )
            current = previous + difference;
        previous = current;
        return true;

    // [6,13] -> [0,7] (3 bits)
    bool threeBits;
    if ( Stream::IsWriting )
        threeBits = difference <= 13;
    serialize_bool( stream, threeBits );
    if ( threeBits )
        serialize_int( stream, difference, 6, 13 );
        if ( Stream::IsReading )
            current = previous + difference;
        previous = current;
        return true;

    // [14,29] -> [0,15] (4 bits)
    bool fourBits;
    if ( Stream::IsWriting )
        fourBits = difference <= 29;
    serialize_bool( stream, fourBits );
    if ( fourBits )
        serialize_int( stream, difference, 14, 29 );
        if ( Stream::IsReading )
            current = previous + difference;
        previous = current;
        return true;

    // [30,61] -> [0,31] (5 bits)
    bool fiveBits;
    if ( Stream::IsWriting )
        fiveBits = difference <= 61;
    serialize_bool( stream, fiveBits );
    if ( fiveBits )
        serialize_int( stream, difference, 30, 61 );
        if ( Stream::IsReading )
            current = previous + difference;
        previous = current;
        return true;

    // [62,125] -> [0,63] (6 bits)
    bool sixBits;
    if ( Stream::IsWriting )
        sixBits = difference <= 125;
    serialize_bool( stream, sixBits );
    if ( sixBits )
        serialize_int( stream, difference, 62, 125 );
        if ( Stream::IsReading )
            current = previous + difference;
        previous = current;
        return true;

    // [126,MaxObjects+1] 
    serialize_int( stream, difference, 126, MaxObjects + 1 );
    if ( Stream::IsReading )
        current = previous + difference;
    previous = current;
    return true;

Intuition and Rules of Thumb

Let’s talk about why this code immediately stood out to me – it triggered an intuition: code structure correlates with entropy.  Entropy in the information theory sense means that repetitive data has lower entropy, and unique data has high entropy. This intuition cues me to think about refactoring along DRY principles or moving to a more data driven approach. These kinds of changes take low entropy code and “compress” it so it has higher entropy – in other words, code that is less repetitive to accomplish the same results.

Consider the following example:

// Choose largest of provided parameters.
int pickHighest(int a, int b, int c)
   if(a > b)
      if(a > c)
         return a;
         return c;
      if (b > c)
         return b;
         return c;

The repetitive tree structure of the code immediately suggests that it could be written more concisely, especially as you have more arguments from which to pick. A sort algorithm is an obvious choice, e.g.

int pickHighest(vector<int> items)
   items.sort(); // Sort items in descending order.
   return items[0]; // Pick the highest item.

This version is much shorter, it can handle any number of items, and gets right to the point.

(Caveats: Of course, the astute reader will note that there are situations where for small argument counts the tree can be a faster choice. Rewriting it with ternary operators doesn’t count – we’re concerned with execution structure here, not syntactic structure. Finally, there are plenty of loose ends to tune in the new implementation for correctness and speed.)

Huffman Codes

How does this apply to Glenn’s implementation above? Well, we see it has a repetitive structure. It’s attempting to encode more common numbers in fewer bits, and rarer numbers with more bits. That smells like a Huffman code to me! A Huffman code is a tree-like encoding structure that assigns unique bit strings of varying lengths to different values based on their expected frequency. So the most common value is encoded with a couple of bits, while the least common value might take ten or twenty bits. This lets you get a big win assuming that your assumptions about what’s common are right. As an example, algorithms like deflate/zip rely on Huffman codes. There are also techniques like arithmetic coding that are more powerful and complex expressions of this concept.

A canonical Huffman code is one which generates N codes where the first value is expected to be most frequent, and the last value is least frequent. Implicitly, this means the first value’s code is shortest and the last value’s code is longest. Fortunately, this is exactly the situation we want to deal with! If you knew more about the actual distribution of values you could generate your own tree based on the observed values and get better efficiency in your serialization.

(In the following code, I’m going to assume you have a serialize function to read/write a canonical Huffman code that can represent values between 0 and 127. This is definitely sweeping complexity under the rug, although a Huffman coder is a handy thing to have in general. If I was implementing this, I’d start with the HuffmanProcessor from Torque, although there are surely more efficient implementations.)

Here’s the implementation using Huffman coding:

bool serialize_object_index_internal( Stream & stream, 
                                      int & previous, 
                                      int & current )
    uint32_t difference;
    uint32_t cappedDifference;
    if ( Stream::IsWriting )
        assert( previous < current );
        difference = current - previous;
        assert( difference > 0 );

        // Store a capped difference so we know when to 
        // write overflow values.
        cappedDifference = difference > 127 ? 127 : difference;

    // Write huffman encoded delta.
    serialize_huffman127(stream, cappedDifference);

    if(cappedDifference == 127)
        // Highest value keys us to code large gaps.
        serialize_int( stream, difference, 126, MaxObjects + 1 );
        // The serialized valus is the actual value.
        difference = cappedDifference;

    if ( Stream::IsWriting )
        current = previous + difference;
    return true;

Notice that there is now no repetitive structure to the code. We have to deal with a couple of details (what if it’s a number bigger than we can Huffman code?), but the basic flow is much simpler. And it achieves a very similar result.


What are the trade offs of this approach? Well, first off, it may not be as bit efficient in the “every item is consecutive” case with the canonical Huffman code because the canonical Huffman code won’t assign a single bit to that most common case. You could add a special case for that (and spend an extra bit per serialization in the non-consecutive case).

Second, Huffman codes are more complex to read and write. If you are doing a LOT of serialization this might affect CPU usage. Most likely it will not be a significant difference – I’ve rarely seen any game be bottleneck on network serialization.

Third, tuning is more indirect. With Glenn’s implementation, you can modify the code to directly affect the bit cost of various scenarios. With this approach, your main tuning option is to regenerate the Huffman code based on actual network traffic. You can also add special cases e.g. for the consecutive item case. Regardless of the approach you use, you always want to check your changes by running your game and measuring actual update sizes in various scenarios to confirm you are really getting a win.

Fourth, there are better coding techniques than Huffman – in practice you probably want a variant of arithmetic coding. Compressing the entire packet with a standard compression algorithm like LZMA is also an option. This article is just a piece in the puzzle of the ultimate network protocol.

Is the Huffman approach a slam dunk win over the manual approach? Unclear. But it gives a more formal perspective on the smart serialization technique Glenn uses, and is a good entree into a larger discussion on coding techniques for network serialization. I hope you found this article interesting!

(Thanks to  @glennfiedler for prompting me to write this.)

Fundamentally F$#ed Up Projects


My buddy Chris Benjaminsen asked me for a list of the ways in which game projects are fundamentally “f#$ed”. I couldn’t ask for a better prompt for a blog post.

Games are often developed by passionate people. Most game developers didn’t stumble into games – they had a longstanding interest and worked hard to make their way into the field. As a result, they are often “interesting” in addition to being passionate. More on that in a moment.

The natural forces surrounding game projects are chaotic. Games often take a non-linear path during development, meaning it’s hard for managers to determine if the project is “on track” or not. Small changes can make them dramatically more or less viable. Game studios are chronically over-extended and under-funded, adding another level of risk and stress.

Finally, games are demanding technically. They must deliver a high fidelity, responsive experience using meager resources on a wide variety of hardware. They have to adapt to a wide variety of needs from the artistic, design, accessibility, and business parts of the team. And there is never enough time.

How does this translate to projects being fundamentally f#$ed up? Let me share a few source of chaos from my own experience:

  • Battle of the wills. Working on a sequel for a beloved franchise. The upside: everyone on the team was a long time fan, passionate and engaged. The downside: Everyone on the team had their own conflicting ideas on design and implementation. This project didn’t last very long.
  • Lacking key development hardware. I’ve worked on several projects where we were developing software against hardware that didn’t exist yet. To understate the situation: it is very challenging to build a great experience when you are building it on an approximation of final hardware!
  • Burn out. I’ve worked with a surprising number of companies where the founders didn’t work on the core product for long periods of time.
  • Upper management. Game projects mature at varying rates. You have to keep upper management happy and involved so that they want to keep funding it. At the same time, you have to be careful what they see – a series of bad demo sessions can make them defund/deprioritize. In addition, much like almighty Zeus, they have a tendency to see random things from their exalted positions and hurl lightning bolts at the masses.
  • Unstable personalities. They come into work on the first day and everything is fine. The next day, they’re in tears. The day after, they want to fight you. They may not last, but they just might take the project down with them… Or drive away your most talented developers. Or maybe it’s the CEO and you’re gonna have to live with it.
  • Bad processes and continuity. QA processes that can’t catch showstopper bugs. Build systems that have no consistency from build to build. The staff with knowledge of deployment process leaving and being replaced frequently.
  • Fear and loathing. Very often team members will get concerned about something. It could be something serious, or something minor. Maybe a studio head came by and said something, or maybe they saw a detail that isn’t right. Or it could even be a broken process causing them extra work/stress. Often if they aren’t in a lead role they will have a hard time evaluating the significance and/or the steps being taken to deal with it. An important job for every lead is to help fight these fires and keep everyone on an even keel. Often the best thing to do is simply ignore it and move ahead with the plan. But if left unchecked it can destroy morale.
  • Technical debt. Maybe you inherited some code that is of questionable quality or maybe you built it yourself. Maybe some corner cases you swept under the rug are coming out to bite you. Whatever the case, technical debt can accrue very rapidly in a game, and if it’s not managed right it kills forward progress.

Stressed out yet?! As a technical contributor to games, I’ve long viewed my role as one of providing flexibility and stability. This takes two parts. First, there is a continuous and ongoing negotiation with the other elements of the team. Design needs gameplay mechanics. Art wants to execute a certain look. Production has schedule and budget constraints and needs to manage upwards to keep the project funded. The business guys need monetization to work and want all sorts of weird SDKs integrated so they can track their users. Executives are putting their two cents on all sides, introducing random changes. QA is reporting bugs and asking for diagnostic tools.

My job as a technical lead is to take all of these conflicting needs and somehow produce a reasonable series of stable builds of the game suitable for testing and release without anyone going insane. Most of the time this takes the form of communication so everyone can stay calm about where things are going. Sometimes we have to take real technical steps to resolve something, but it’s amazing how often just talking through stuff results in no changes or only a minor change to the course.

The second part of my role is more subtle. I need to tackle the technical elements of the project that will best allow it to weather the inevitable storms. For instance, suppose I think the basic idea of the game is good but the game design is poorly thought out. I may bring it up in a casual way, but rather than getting in a big fight about it, what I’ll do instead is rough out the gameplay such that the flaws in the design can be exposed. Once those issues are visible to the designers, I can focus on building systems that I know will be crucial for the game’s success. That might be a scalable websockets server, or some advanced rendering tech, or some other independent piece that is crucial for the game but not deeply affected by the specifics of the gameplay.

Fundamentally, I am using my technical capabilities and resources to act as an anchor for the project. By getting builds in front of the team and prioritizing effort based on risk/uncertainty, I can help ensure there is visible progress that keeps everyone happy and engaged even in the face of serious systemic badness. And sometimes, when everything is on fire, that’s all you need to find success