Marbling It Up

@MarbleItUp has gone public. We will be launching soon on Switch. I wanted to talk a little bit about the game and how it came to be… and share some sweet GIFs (in high framerate black and white for download size reasons).

An extremely early baby shot of the game. Almost everything here – including the code – was later replaced. And that new stuff runs at 60hz on Switch in handheld mode.

Just over a year ago, Mark Frohnmayer came to me with a proposal – let’s build a sweet marble game and put it on Switch. The Switch at that time was a brand new platform, but the potential was immediately obvious to both of us. I agreed to partner with Mark and all of a sudden – after several years away – I was back in the game development world.

I didn’t know it on day one, but making Marble It Up! was going to be a rare privilege. I am incredibly proud of this game – the team, the levels, the visuals and the gameplay are all amazing, and I count myself privileged to be involved. This is really a great game.

As we near release, I feel confident it will be a hit. I am just hoping it will be blockbuster Michael Bay hit, not a indie niche Kevin Smith hit. Anyway, on with the story…

The Team

Like Danny Ocean, we had to get the old crew together. I already had Miha, an awesome coder, on my team from past projects, so he and I got rolling (hah) right away. Meanwhile, we immediately reached out to Alex, original level designer for Marble Blast, and Todd, one of the most talented (and hardest working) artists I know. We also pulled in the incredible Solovox to make an awesome soundtrack and Mike Jones to do the rest of the audio.

Along the way we ran into long time fans of our work on the original Marble Blast series. We brought in the team behind QuiVR, Jonathan and Fletcher, as well as long time community members, speed runners and developers Matan and Conor.

With the team thus completed, we went into full production. Development slowed down a little bit when a few months into development my Loom technology was acquired and I had to devote my focus to a roll (ho ho ho) with the acquiring company. This ended up being short lived and I was able to refocus my attention after about 10 months of distraction (so to speak).

All of our development was self funded. This gave us total control over direction, but also required all of us to really commit to the game. We’re using a studio model for the game itself – it has its own entity that holds the IP and pays out royalties. Maybe I’ll talk about the basic structure we used for that some day down the road.

Development Path

For this project, we used Unity 2017. It comes free with Switch developer access, and it looked like the best path forward on short notice. I briefly considered using Torque, the first commercial game engine I worked on. Because it was long in the tooth, it would require a solid porting effort before development could begin in earnest. We just didn’t have that luxury.

Although older versions of Unity were shaky, it has really matured in recent years and 2017 was easy to work with. Very quickly we had something up, and I started taking little progress GIFs (in black and white so they’d show up in Slack quickly). We got a ball rolling around in very short order:

Very early rolling behavior.

The movement took a while to get right (more on that later). But ProBuilder made it super easy to throw a level together, and with a little work we got elevators going.

One of the first elevators.
And collapsing platforms!


Being able to easily tie any geometry into a mover was a huge benefit for level design. It did require some discipline with prefabs to create consistent reusable elevators, but one-off showpieces like those found in Cog Valley or Staying Alive were super easy to make.

The physics simulation was very detailed, and one of the elements needed for both physics networking was a capacity for rewind. Instead of only allowing for a tick or two of rewind, I generalized to allow for arbitrary rewind. This made level testing MUCH easier:

Check out the hook while my DJ revolves it.

And it also set the stage for two other features, ghost racing and replay. (I love ghost races way more than anyone else on the team, but I’m the lead programmer so they had to ship it.) Being able to watch and race against other players routes, as well as compete against yourself, adds loads of replay value to the game.

What better challenger for you to face…. than YOURSELF!

Not to say everything was wine and roses. We quickly found that Unity’s physics were completely inadequate for our needs. In the end, about half of the game code was for our custom marble physics engine. I feel comfortable claiming that no other marble game comes close to our level of fidelity. Part of the reason we have our own physics is in order to make them unrealistic, too – sometimes bending the laws of physics in juuust the right way makes things way more fun.

An extreme case of gravity surfaces.

We also found that the default Unity renderer was really not appropriate for the game. Shadows provide important depth cues, but Unity’s dynamic shadows were fuzzy and required very careful level design and lighting to be present in all gameplay situations. In the end, a simple projected blob shadow gave us a crisp look and perfect visibility to support the gameplay. It also saved us many man hours tuning levels.

Particles were also a huge pain – especially mesh particles. Unity does horrible slow things to these and cost us massive amounts of performance on the CPU-light Switch hardware. In the end, we had to make small systems to handle the costliest cases in order to meet our performance goal of 60hz in handheld mode.

Without custom rendering, this effect could bring the Switch to its knees.

We also provide a full reflection on the marble. When I implemented this for Marble Blast Ultra, it took me a week or so to modify the scene graph to re-render the level in all six directions. Performance took some care, but ultimately it ran fine using more-or-less the default engine code. This was on the Xbox 360, which has way less RAM, a slower GPU and weaker CPU. Under Unity, on the more powerful Switch, we ended up having to write our own custom CommandBuffer based renderer and our own set of lightweight shaders in order to hit 60hz. This was an unexpected surprise late in development.

In general, Unity was excellent for prototyping and able to be coaxed into viability for production.


Man, I am stoked to see this game go out into the market. It has been a year of my life, some great challenges, some great fun, and hopefully, a game that brings a little bit of joy into the world. Let’s marble it up!

Video Conference Part 4: Making RANS the Ryg way

Last post, we got our networking stack up and running, figured out how to work around firewalls, and saw how our codec performs over a real network link. This motivated us to revisit our compression schemes, which is what we’ll do in today’s post.

We’ll start with some algorithmic improvements to reduce the amount of data we need, then implement an entropy coding technique to reduce our data rate even further.

YUV & Chroma Subsampling

Up to now, we have skipped a key part of JPEG – conversion to YUV and chroma subsampling. DCT works on any kind of data – RGB, YUV, monochrome, etc. The DCT we have been using is for 8×8 blocks. So up to now we have been sending 4 sets of 3 blocks – one each for the red, green, and blue channels in each of the four 8×8 regions in a macroblock. This is a lot of data!

What JPEG and a lot of other formats do is take advantage of limitations in the human visual system to save bandwidth. Specifically, people are much better at detecting changes in brightness then they are in color. So you can send color data at half resolution, while you keep brightness at full resolution. To do this, you must first separate brightness from color, which is accomplished by converting to the YUV color space.


Doing this saves about 50% bandwidth with a very minimal reduction in quality (6 blocks instead of 12 blocks). It brings us down to 4 bits per pixel.

Flat Blocks

Right now the shortest block we can send is around 18 bytes (8 coefficients at 2 bytes each plus 2 more bytes to encode zeroes for the remaining 56 coefficients). So a macroblock will be at least 108 bytes plus overhead. This works out to around 0.4 bits per pixel for even a very simple macroblock.

However, during high motion frames, we need to prioritize quantity over quality. Sending fully detailed macroblocks is costly in CPU and bandwidth for pixels that are blurry and rapidly changing. Further, macroblocks that aren’t updated are very obvious in this sort of situation – so we want to prefer a full frame of updates rather than part of a frame.


To this end, we create a new macroblock encoding type called “flat.” We add a single bit at the start of the macroblock indicating if it’s DCT or flat. If flat, then the format is as follows:

 6 bits - red   color quantized by 4
 6 bits - green color quantized by 4
 6 bits - blue  color quantized by 4

We select this type when the RMS error of the flat color is less than some threshold. The threshold is based on the current total error of all macroblocks, so when we have lots of motion causing lots of error, we are more likely to encode flat macroblocks. Even a very low bandwidth link is capable of sending full frames where every macroblock is flat.


Flat mode has a flat cost of 0.07 bits per macroblock, which is very low. It improves the responsiveness of video feeds quite a bit. There are a whole family of techniques along these lines used by H.264 and other codecs (32×32 or larger macroblocks, special gradient blocks, etc.) that can make a big difference in high motion or scenes with out of focus regions. These techniques do tend to make high motion frames that can’t fit in available bandwidth “blur out” but this is a lot less offensive than partial updates.

Entropy Coding with RANS

We’ve gotten pretty far – down to 4 bits per pixel on an average feed. But there’s one important piece of the JPEG approach we’ve neglected up till now: entropy coding. Right now, all coefficients are sent as 2 byte signed shorts. We’re doing a lot of work to make them close to zero, but it only reduces size if we get adjacent coefficients to zero and RLE gives us a win.

JPEG uses a Huffman coder, which uses fewer bits for numbers closer to zero. Common +1 and -1 coefficients get sent in only a couple of bits, while rare, larger coefficients such as 1000 cost more bits. The balance is a significant reduction in size. JPEG also stores some coefficients relative to previous coefficients to help reduce the average magnitude and thus the bit count.

Because I’ve worked with Huffman codes before, I decided to try something new to me – arithmetic coders. I’ve read about them but never used one directly. I had been wanting to try the RANS coder from @rygorous (found at, so I started implementing one based on his code.

Getting it fully integrated and working was a beast! I lost a couple of days to getting the RANS coder up and running with no issues. A lot of this was my learning curve, as the algorithm is subtle. Because the operation of an arithmetic coder is fairly opaque, I had to take a black box approach most of the time- I could tell via sentinels if I got right or wrong values, but it was difficult to determine the cause when I got a wrong value out of the decoder. Eventually, I ended up implementing a set of unit tests (using catch.hpp) which helped me constantly check each step of the encoding and decoding process so I could quickly spot any issues when I made a change.

Arithmetic coding has two parts – the actual coder (which is clever and small) and the context. The context models what values are likely at each point in the stream, and the arithmetic coder ensures that likely values are encoded with fewer bits than unlikely values. Good contexts make a huge difference in the effectiveness of compression.

In preparation for all of the following techniques, I captured eleven megabytes of RLE encoded DCT data at various quantizations to use as a training set. For convenience I regenerate my contexts at application launch, but in the real world we would use a much larger training set and only ship the final statistics. This data could easily be included in a header file, and would add just kilobytes to the executable footprint.

Order 0 Context

Contexts are classified by how many symbols back they consider for their prediction. An order 0 context looks only at the current symbol state to determine likely and unlikely symbols. Typically this boils down to a single static list of symbol frequencies.

Order 0 context was easy to implement (Ryg conveniently provides such an implementation in his ryg_rans project!) and I found it saved us about 1.9 bits per pixel.

Order 1 Context

I next tried to implement an order 1 context that looks at the previous symbol and the current symbol to determine its odds. However, I ran into huge issues – I just could not round trip data properly. The reason why is interesting.

RANS-style arithmetic coding is as a deterministic, symmetric process. Think of it like braiding a rope. To unbraid the rope, you do the exact same things but in reverse. Similarly, you run RANS backwards to generate the encoded data, and forward to get the decoded data.

When you are working with an order 0 context, this isn’t a big deal. You get the same statistics out whether you submit symbols forwards or backwards. However, more complex contexts aren’t symmetric (ie, odds of A after B are different than B after A). So no matter what I tried I could not get my order 1 context to work!

Subsequently, I got some advice from the excellent and wise Fabian Giesen, who pointed out that I should buffer symbols for encoding so I can process them forwards and write them backwards. This unblocked progress, but unfortunately I ran out of time to implement the order 1 context – I had already moved on to something else.

Dual Order 0 Context

Since the order 1 context was initially a bust, I tried an alternate approach. I wanted to take advantage of the structured nature of the RLE encoded data, which is short aligned. So I built two order 0 contexts, one for the first byte of all the shorts and the other for the second byte. This was simple and worked pretty well – and it was symmetric, so it bypassed the RANS order concerns. This took me down another 0.3 bits per pixel.

Extended Quantization

JPEG only has quantization levels from 1 to 22. However, we can tolerate much worse quality in video than we can in a still image – motion hides artifacts. We can extend the quantization range, lose some quality, and drop up to another 0.9 bits per pixel.

Block State Format (v3)

If you will recall how we encoded our blocks in our packets, you might have thought it wasn’t as efficient as possible. Specifically, a few things cost us in overhead. We re-send the compression type and level for every macroblock, when we don’t change them very often and could probably send them once per packet. This gains us 0.03 bits per pixel.

We also have a flag to indicate if there is another macroblock coming. This is cheap for small numbers of macroblocks, but now we have good compression so we can send a lot in a single packet. If we have 100 macroblocks, we burned 100 bits. It would be a lot cheaper to use 8 or 9 bits to encode the macroblock count. This gains us 0.003 bits per pixel. Not a huge win but it dovetails with the next change well.

Ryg kindly pointed out this win to me via Twitter. Every time you flush RANS, you have to write out 4 bytes. When we flush after every block, then we burn ~32 bits per block. If we modify our packet to have one continuous RANS stream for all the macroblocks, we only have to flush once for perhaps a hundred macroblocks. This gains us around 0.12 bits per pixel.

One cool thing about RANS is you can insert uncompressed bits directly without any additional overhead. (It’s symmetrical and deterministic, remember? So you just stuff bits in/out and if you are consistent it works itself out.) We use this bypass mode to handle special cases like the flat encoding mode data and the flag that determines if we’re a flat or a DCT block. This allows us to put all our macroblock data through RANS with no hassle.

We could arithmetically code this data based on the likelihood of flat vs. non-flat blocks, or based on common RGB values, but since flat mode is already so much cheaper than DCT, I skipped it. So, no change in bits per pixel, but it keeps our implementation straightforward.

Now the packet look like this:

4 bits - mode (raw, zip, lzo, dct) for packet
6 bits - quality level (if dct)
8 bits - block count
[RANS bytes follow to end of packet]

There is one gotcha. Because RANS pops out bytes at irregular intervals, we can only know if our packet is full by compressing all our pending data after each block is added, and checking to see if we ran out of space. Currently, I do this after every block and as a result I spend about two thirds of my encoding time re-compressing the same data over and over. Because RANS operates in reverse, I can’t easily “checkpoint” my encoding state and only do work for the new content – I have to process the whole packet’s worth of data every time.

I believe this could be substantially improved by heuristically checking more or less often based on the available space. So I might check only once every 10 blocks at first, then more and more regularly as I get towards the end of the packet. This would give a substantial reduction in CPU load.

A Word on Performance

(Notes for self: I tested commit c8e59a2, on a Core i7-2600k 3.4ghz quad core. Program is single threaded.)

Overall, the system has no problem maintaining 60hz performance in release build. Deblocking or very high bandwidth usage can cause it to miss this target, but there are clear paths for optimization. Playback is surprisingly lightweight.

Memory footprint is around 75MB and does not vary substantially. Higher latency connections will cause slightly higher usage, and a lower latency connection slightly less (as we store per-packet information until the packet is received or dropped).

% of Frame Time Description
25% Speculative compression
12% Block serialization
25% GUI rendering
5% Decompression
3% Update error metric

Overall, we see about 40% of frame time spent encoding, and 5% on decoding.

The system is designed to be easy to parallelize. Specifically, since frames are often sent with multiple packets, we would recommend divvying up pending blocks and encoding packets in worker threads. Decoding could also be done with workers for a good speedup. On a fast desktop CPU this is less of an issue, but when trying to achieve low latency on embedded hardware, it could make a big difference.

We can also move to SIMD versions of hot path code (RANS, DCT, and color conversion are all eligible). Our bitblt is also naive, and contributes to about half of the GUI rendering time. A SIMDized blitter would run much faster.

We estimate an optimized decoder implementation on the same hardware would take around 1% frametime, and an optimized encoder implementation would take around 5% frametime. This puts screensharing of 1080p on a single core well within reach – we would expect to see 7% frametime decoding and 40% frametime encoding.

Closing Thoughts

Writing my own video chat app and video codec was a great experience. Much like writing a toy compiler, language, or operating system, it is a learning experience more than a practical application. It gave me a deeper appreciation for the excellent video codecs available on the market today, and a better understanding of the issues facing things like Skype and WebRTC.

Ultimately, for someone often frustrated by video chat applications, it was cathartic to finally take a swing at the problem. And now I have something to tinker with when Skype breaks on me.


Final Statistics

Avg bits per pixel: 0.6 bits for low quality, 1.0 bits for good quality
Max usable sustained packet loss: 25%
Max survivable sustained packet loss: 75%

Thanks for reading. My team and I solve hard problems in the streaming video, AR/VR, scalable backend, gaming, and IoT spaces. If we can help you, contact us at ben AT theengine DOT co.

Video Conference Part 3: Getting Online

Last time we got up close and intimate with the core compression techniques used in the JPEG format, and applied them to our own situation for better compression. We got our data rates low enough that we have a shot at realtime video under ideal network conditions.

Now it’s time to actually send data over a network (or at least loopback)!

The Notify Protocol

We will use a UDP networking protocol derived from one used in the game engines I worked on at my first startup, GarageGames. The Torque networking protocol was state of the art when it was first used in the Tribes series from 1998 to 2001. It enabled realtime gameplay over a 56.6kb or worse modem, and represented a substantial improvement over the networking model used in Quake. You can read the original Tribes networking paper for a deeper discussion of the basic capabilities of the system.

For our purposes in this section, the essential feature of this networking model is the ability to be notified in realtime whether packets have been received by the other end of the connection, or dropped. We receive a callback in our code for each packet that was sent when its fate is known, along with a description of what was in the packet. This is the “notify” part of the protocol. For certain applications, this is a quantum leap relative to TCP, because it allows the application to respond to packet loss intelligently, rather than stalling the connection and resending old data.

You will recall back in the first post that we maintain a model of what we think the client is displaying, and prioritize macroblock updates based on the RMS error of our current local frame versus the client frame.

By using this protocol, we know when updates didn’t make it. If an update was lost, we revert the corresponding macroblock in the client frame to its state before that update was sent. Then the error metric will automatically reprioritize the macroblock for transmission. In the event that we have a newer update “in flight”, we ignore the failure – if the newer update makes it, the problem is resolved, and if it does not, it will cause a retransmission of the macroblock.


In other words, we only resend data when absolutely necessary, and only in the right order relative to other updates. This behavior makes the system robust under extremely high packet loss. You might not get full frames if on a bad connection, but you will see some changes, and eventually your view will become fully correct if a) the scene is static long enough or b) the network recovers.


The protocol, being designed for the high latency modem connections of the late 90s, handles latencies as high as 1000ms without an issue. Of course, the other end sees frames later, but they see them as soon as physically possible. There is no penalty beyond the time spent waiting on the data to get there.

We had to modify the protocol to support higher data rates, as it was originally optimized for connection speeds around 10kB/sec. With our changes, it can reliably transfer data at around 8 megabits/sec. Higher speeds are possible, but as you will see in the last post of this series, largely unnecessary.

Master & Arranged Connections

One benefit of using a game networking protocol is that it has robust support for NAT traversal. By running a light weight master server, we can track many thousands of clients on very modest hardware and make arranged connections that bypass most firewalls.

The NAT punching algorithm is simple but effective. When two peers request an arranged connection, the master server gives both a list of potential IPs/ports where the other peer might be found (typically where the master sees that peer’s traffic originating from, and the peer’s local IP/port), and part of a shared secret. Then the peers send punch packets to those IPs and ports.

For our prototype, we started with local network peer discovery (ie broadcast ping), then added support for entering an IP. Finally, we added automatic discovery of another random peer via the master server.

The master server connection can run at very low bandwidth, so a logical next step would be to do some authentication, track presence, and let you add contacts and initiate a conversation on demand. However, this is just a prototype and none of that is vital functionality!

Packet Math

One important thing to consider is all in packet overhead. It’s easy to get a false sense of confidence by focusing only on the compressed image data without considering the overhead of the full networking stack.

Our packet format header is about 94 bits or 12 bytes. UDP IPV4 header overhead is 28 bytes. Our packet overhead is therefore 40 bytes.

Reliable MTU size seems to be around 1200 bytes, leaving about 1160 bytes for data (3.3% overhead). This can/should be negotiated but is currently hard coded. On local network segments much larger MTUs are possible, and over the general Internet 1500 bytes often works.

The overhead per pixel will decrease as our compression gets better, but at a rough estimate, assuming 50 macroblocks per packet, we will see an overhead of 0.02bits/pixel.

A quick comparison to TCP suggests that on IPV4 we would see 4.3% overhead as opposed to 3.3% for our protocol. This is of course assuming an ideal connection. In the face of packet loss, TCP rapidly throttles down to try to find a data rate that lets the link behave well, losing lots of bandwidth. Our protocol accepts lost packets and rolls forward as best it can. So depending on the situation the difference will be a lot more than 1%.

Simulation Options

An important detail when building a robust system is being able to simulate network problems. If you don’t regularly test under bad conditions, your system rapidly becomes brittle. We have support for adding latency (ie delaying packet delivery) and randomly dropping packets (with a tunable frequency).

Next Steps

Here’s the best stream quality we can achieve in realtime over our network connection:


We’re running at a little over 8 bits per pixel or 7.3 megabits/sec. We can do a lot better – so now that we’re networking properly, we’ll go back for a final round of improvements in the next post (click to visit)!

Video Conference Part 2: Joint Photographic Hell (For Beginners)

Last time, I got angry at Skype and metaphorically flipped the table by starting to write my own video conference application. We got a skeleton of a chat app and a video codec up, only to find it had a ludicrous 221 megabit bandwidth requirement. We added some simple compression but didn’t get much return from our efforts.

We definitely need better compression before we implement networking. There’s no point trying to stream hundreds of megabits over a real world link. We need to slim things way down.

It turns out that most video codecs use some variant of the lossy compression techniques from the JPEG format. I’ve always wanted to learn the details of JPEG. With the excellent mango library from my friend Jukka Liimatta as a reference, I started down the well trodden road of abusing JPEG for my own purposes.


The biggest building block of JPEG is the Discrete Cosine Transform. The DCT is related to the Fourier Transform. It takes a signal (in this case, a 8 by 8 array of pixels) and breaks it down into frequency components. Later, we can take the frequency components and run them through the Inverse Discrete Cosine Transform to get our original signal back.

Lower frequencies live in the top left of the output matrix, while higher frequencies live in the bottom right. Larger numbers indicate more of that frequency, smaller numbers less. The following image shows the frequencies that each of the 64 coefficients you get from a DCT transform encode. By combining them in varying amounts, you can encode any 8×8 pixel image.

Courtesy of many JPEG explanation sites on the internet.

(A word on blocks and macroblocks. You will notice that this is an 8×8 DCT but we are working with 16×16 macroblocks. Macroblocks are so named because they are made up of a combination of smaller blocks. At this point in the game, we are breaking our 16×16 RGB pixel macroblock into 12 8×8 DCT blocks. However, we will see later how we can do much better than that. Thanks for reading, now back to your regularly scheduled programming.)

Lossy compression is all about removing data that doesn’t contribute much to the final output. By working in frequency space, we can start dropping components that are less important. For instance, imagine a macroblock with a smooth gradient. If we are working with RGB values directly, it might be difficult to efficiently detect or encode the gradient. We would be easily confused by noise in the image. By running it through a DCT, we’d immediately see that the largest components were low frequency elements, and the smallest high frequency. It is much simpler to drop the smallest components in a grid!

JPEG is built around making the best use of DCT. One of several tricks it uses is zigzag tables to reorder the DCT components for better encoding efficiency. In adapting the JPEG code from Mango, one issue that got me was that the encoding and decoding order aren’t symmetric. That is, if you zigzagged an 8×8 matrix with the encoding table, then run it through the corresponding decoding table, you get a transposed version of the matrix. This manifested as blown out colors in the output image, and I’ll discuss why in the next section…


The biggest trick JPEG uses for better encoding efficiency is quantization. This is where things get lossy – DCT, like FFT, is fully reversible if you preserve the full precision of its output. By rounding the values from the DCT – also known as quantizing them – you can control how much data is lost. Frequencies that don’t contribute much tend to have small components in the DCT, so they rapidly become zero as they are quantized by larger amounts. Frequencies that matter a great deal have larger coefficients and will not be removed or will be removed very late in the process.

The following image shows similar video encoded with varying quantization level, going from high quantization (which looks blocky) to low quantization (which looks smooth and detailed).


To quantize, JPEG divides each DCT component by a rounding factor derived from the quality level setting. In fact, it generates a whole quantization matrix – one quantization factor for each of the 64 values of DCT output – that biases towards preserving component values or low frequencies. Each value is rounded a little bit differently.

This is how I was getting blown out colors (as mentioned in the previous section). Imagine a partial DCT output matrix consisting of frequency coefficients [40, 30, 20, 10]. We generate a (partial) quantization matrix [1, 2, 3, 4]. We divide and get quantized values of [40, 15, 7, 3] which are then written out.

Originally I was not taking the zigzag’s transposition into account, so I was scaling up by a rotated version of the quantization matrix. This meant that some values were being scaled up with a different factor than they were scaled down. If we represent rotation of the full quantization matrix by scrambling our small demo matrix, it would be like dequantizing by [1, 4, 3, 2]. We would end up with [40, 60, 21, 6] at the end. In compression speak, we failed to preserve the overall “energy” of the block – we ended up with bigger coefficient factors than we started, causing the output to be brighter! The effects vary depending on specifically how you get it wrong – here’s a shot where the luminance quantization coefficients were wrong:


Great – so now we’ve gotten the coefficients for unimportant parts of the DCT to be smaller, and in many cases zero. We can adjust how much we lose by varying a quality level factor. But we didn’t actually reduce the amount of data we transfer at all! Wait what?!


The reason why JPEG tries to get everything closer to zero is simple. Zeroes are cheap to send!

Why? It’s not just because of the convenient hole in the middle of a 0. The next step in JPEG is to collapse all those zeroes with run length encoding. We walk through the array, and replace adjacent zeroes with a special value indicating how many zeroes there were. Suppose our quantization makes the last 32 coefficients zero. We would go from 128 bytes (JPEG DCT output is one signed short per coefficient) to 66 bytes (32 populated coefficients plus one special value indicating 32 zeroes). That’s 50% savings with a pretty simple compression algorithm!

It sounds super simple, and once written it is, but getting RLE encoding to work reliably took several important debugging techniques. In addition to the old fashioned “stare at it in the debugger” technique, I added sentinels – known values at specific locations in the data stream – to help catch any misalignments. If I didn’t see sentinels exactly when expected, the code would instantly assert. I also put some asserts into the RLE decoding routines to check that we always got 64 values out and that a run would not cause us to over flow the buffer (a common sign of bad data would be runs longer than the remaining space in the block).

Because any bugs in later stages of processing would corrupt the RLE data, these asserts were usually my first sign that I had gotten something wrong in later parts of the compression pipeline.

Block State Format (v1)

At this point, although the app isn’t network capable, it starts making sense to describe the format for macroblock updates. Again, keeping things simple, we have the following structure:

 1 bit - is there another block update? if false, stop processing.
 4 bits - compression type (raw, zip, lzo, dct).
 6 bits - only present for zip/dct, indicating quality level.
 6 bits - macroblock x index
 6 bits - macroblock y index
 10 bits - size of compressed data
 size * 8 bits - output from the compression algorithm

Note that the macroblock x and y index are counting macroblocks, not pixels. So in this version of the protocol we can have images up to 1024 16px macroblocks or 16,384px on a side. We could shave a few bits if needed by taking into account the actual image size.

As you can see, this incurs a fixed overhead of 33 bits per macroblock, or 0.51 bits per pixel.

We see an average data rate of 8 bits per pixel with the subset of JPEG implemented above at average quality (ie, quality level 50). Lowest quality setting ran at 6 bits per pixel, and highest quality setting was 20 bits per pixel.

Now we’re down to only 40-60 megabits per second. It’s progress! Our video codec is starting to be feasible over very fast broadband connections, but we can do a lot better.

Next Time

Now that we can transfer little enough data to run the codec over a network, we should implement some networking and test it for real. And that’s exactly what we’ll do in Part 3!

Video Conference Part 1: These Things Suck

What I cannot create, I do not understand. – Richard Feynman

I do a lot of video chat for work. If it’s not a one on one, it’s pair programming. If it’s not pair programming, it’s a client meeting. I use a lot of Skype and Hangouts.

Sometimes they don’t work for unclear reasons. Sometimes file transfers fail. Sometimes screenshare breaks, or when it’s active you don’t get webcam, too. Or the connection lags or drops even though everything is running fast.

Every time I experience such a failure, I get really angry and think, “I could do this better!” But I never quite got angry enough… until now. I guess the weight of years of frustration finally got to me.

I wrote my own (prototype) video conferencing app. It turned out pretty well. And that’s what these posts are about.

Conventions & Caveats

We will be referencing a 640×480 24 bit color 24fps video stream throughout this series of posts. We will spell out bits vs bytes in most cases to avoid confusion, and when abbreviating will use Mb (lowercase) for bits and MB (uppercase) for bytes.

I am not a video codec professional and this is an experiment. MPEG-2, H.264, VP9 and other codecs represent the state of the art and are tremendously more complex and capable than what I will describe. I believe there are some good tradeoffs to my system (which I will discuss later), but it is by no means exhaustively optimized or tuned. There are plenty of obvious improvements I simply didn’t have time to explore in this side project.

Basic Update Algorithm

I started by prototyping a basic algorithm with no network communication – just moving data around in-process. I used dear imgui for the UI, and videoinput for the webcam capture. (I really enjoyed working with both for the first time.) I maintain two buffers, one holding the current frame from the webcam, and the other holding a model of what I’ve “sent” over the simulated network. I also show the per pixel error between the two.


I divide the images into 16px by 16px macroblocks, and calculate the error for each macroblock by taking the root mean square (RMS) of the client frame’s RGB values vs. the local frame’s RGB values for that region. I prioritize blocks with high error and transfer as many as I can every packet. I went with 16px macroblocks out of laziness – there’s lots of research and sample code based on that size.


As you can see, this is a self correcting system. Macroblocks with large errors – indicated by white – are constantly being “transmitted” and reduced to lower error – indicated by black. That is, it’s always trying to minimize the difference between what the client is seeing and the current state of the video feed. The rate at which the system converges on the new state is proportional to how fast we can transfer macroblocks. This allows us to scale to varying bandwidth situations, and it also strongly motivates us to have good compression.


As long as we have some feedback from the client, we can also handle data corrupted or dropped by the network. When we learn about lost data, we’ll update our model of the client state, and let the error correcting behavior handle it. More on that in a later post.

The main flaws with the system at this point are a) we aren’t networking anything yet and b) even if we did, it would require 221 megabits/second for 480p 30hz video sent as uncompressed RGB24. This means you’d have to have a well tuned 802.11n network at minimum – 802.11g would not be even close to fast enough. Peak LTE performance also would not come close to handling this much traffic.

Currently, macroblock updates cost us approximately 26 bits per pixel. We are a bit worse than just sending 8 bit RGB values because of overhead in the data protocol – we have to send macroblock positions, note the current compression settings, and so on.

Raw, zip & lzo

So we have an overall approach, but the bandwidth is way too high for any sort of real world use. We need to reduce our bandwidth by a factor of 30 for this system to be remotely plausible for use on the average 7 megabit broadband internet connection!

As MPEG-2, H.264, HEVC, VP9 and other codecs demonstrate, compressing video is definitely possible. 🙂 But those codecs are all complicated and often complex to integrate, modify or debug (said as someone maintaining a production system using ffmpeg). For instance, x264 is 100k lines of code without a lot of comments. Some codecs have substantial licensing fees. They also tend to have problems when data is lost during transmission. And many introduce substantial latency due to complex (but highly efficient) encoding processes.

A good rule for prototyping is to do the simplest thing first, then iterate. Add complexity as needed.

So I grabbed miniz and minilzo, and set it up so I could choose which compression technique to use on the macroblock data.

Since these are lossless compression algorithms, there was no change in image quality. However, we do see changes in macroblock update size. ZLib at level 9 achieved 23.8 bits per pixel. LZO achieved 28.9 bits per pixel. Not so good! Why are we getting such terrible results?

The biggest reason is that neither algorithm is particularly good at short blocks of data. Both have a “startup” phase where they can’t efficiently compress until a history is built up. Since every packet in our data stream must be self contained, we can’t rely on a shared history. This leads to a big efficiency loss. Even if we could have big blocks, noisy image data is hard to compress with this family of compressors – they are much better at repetitive, byte aligned data such as text.

We found basic run of the mill compressors to be a bust, but we did build the infrastructure to have compressed macroblocks, which is a vital step forward.

Next Time

Enough for prototyping today! We built a basic algorithm, got some basic performance parameters, and took our first baby steps with compression. We also set up an application framework that can display a complex UI and capture video.


The full development UI. Click to expand.


Join us next time in Part 2 as we get our bandwidth down to a plausible range with some lossy compression!

Huffman Serialization Techniques

Networking expert Glenn Fielder’s article on network serialization techniques is a great read. It discusses a number of scenarios including serializing a sparse set of items from an array. This blog post discusses Glenn’s implementation of sparse array serialization and my thoughts on it. To be sure, Glenn’s code has the virtue of simplicity, and the concepts I’m going to discuss are more complex. Don’t assume the space or time trade offs of different serialization techniques without measuring them in practice.

Without further ado, the original code:

template <typename Stream> 
bool serialize_object_index_internal( Stream & stream, 
                                      int & previous, 
                                      int & current )
    uint32_t difference;
    if ( Stream::IsWriting )
        assert( previous < current );
        difference = current - previous;
        assert( difference > 0 );

    // +1 (1 bit)
    bool plusOne;
    if ( Stream::IsWriting )
       plusOne = difference == 1;
    serialize_bool( stream, plusOne );
    if ( plusOne )
        if ( Stream::IsReading )
            current = previous + 1;
        previous = current;
        return true;

    // [+2,5] -> [0,3] (2 bits)
    bool twoBits;
    if ( Stream::IsWriting )
        twoBits = difference <= 5;
    serialize_bool( stream, twoBits );
    if ( twoBits )
        serialize_int( stream, difference, 2, 5 );
        if ( Stream::IsReading )
            current = previous + difference;
        previous = current;
        return true;

    // [6,13] -> [0,7] (3 bits)
    bool threeBits;
    if ( Stream::IsWriting )
        threeBits = difference <= 13;
    serialize_bool( stream, threeBits );
    if ( threeBits )
        serialize_int( stream, difference, 6, 13 );
        if ( Stream::IsReading )
            current = previous + difference;
        previous = current;
        return true;

    // [14,29] -> [0,15] (4 bits)
    bool fourBits;
    if ( Stream::IsWriting )
        fourBits = difference <= 29;
    serialize_bool( stream, fourBits );
    if ( fourBits )
        serialize_int( stream, difference, 14, 29 );
        if ( Stream::IsReading )
            current = previous + difference;
        previous = current;
        return true;

    // [30,61] -> [0,31] (5 bits)
    bool fiveBits;
    if ( Stream::IsWriting )
        fiveBits = difference <= 61;
    serialize_bool( stream, fiveBits );
    if ( fiveBits )
        serialize_int( stream, difference, 30, 61 );
        if ( Stream::IsReading )
            current = previous + difference;
        previous = current;
        return true;

    // [62,125] -> [0,63] (6 bits)
    bool sixBits;
    if ( Stream::IsWriting )
        sixBits = difference <= 125;
    serialize_bool( stream, sixBits );
    if ( sixBits )
        serialize_int( stream, difference, 62, 125 );
        if ( Stream::IsReading )
            current = previous + difference;
        previous = current;
        return true;

    // [126,MaxObjects+1] 
    serialize_int( stream, difference, 126, MaxObjects + 1 );
    if ( Stream::IsReading )
        current = previous + difference;
    previous = current;
    return true;

Intuition and Rules of Thumb

Let’s talk about why this code immediately stood out to me – it triggered an intuition: code structure correlates with entropy.  Entropy in the information theory sense means that repetitive data has lower entropy, and unique data has high entropy. This intuition cues me to think about refactoring along DRY principles or moving to a more data driven approach. These kinds of changes take low entropy code and “compress” it so it has higher entropy – in other words, code that is less repetitive to accomplish the same results.

Consider the following example:

// Choose largest of provided parameters.
int pickHighest(int a, int b, int c)
   if(a > b)
      if(a > c)
         return a;
         return c;
      if (b > c)
         return b;
         return c;

The repetitive tree structure of the code immediately suggests that it could be written more concisely, especially as you have more arguments from which to pick. A sort algorithm is an obvious choice, e.g.

int pickHighest(vector<int> items)
   items.sort(); // Sort items in descending order.
   return items[0]; // Pick the highest item.

This version is much shorter, it can handle any number of items, and gets right to the point.

(Caveats: Of course, the astute reader will note that there are situations where for small argument counts the tree can be a faster choice. Rewriting it with ternary operators doesn’t count – we’re concerned with execution structure here, not syntactic structure. Finally, there are plenty of loose ends to tune in the new implementation for correctness and speed.)

Huffman Codes

How does this apply to Glenn’s implementation above? Well, we see it has a repetitive structure. It’s attempting to encode more common numbers in fewer bits, and rarer numbers with more bits. That smells like a Huffman code to me! A Huffman code is a tree-like encoding structure that assigns unique bit strings of varying lengths to different values based on their expected frequency. So the most common value is encoded with a couple of bits, while the least common value might take ten or twenty bits. This lets you get a big win assuming that your assumptions about what’s common are right. As an example, algorithms like deflate/zip rely on Huffman codes. There are also techniques like arithmetic coding that are more powerful and complex expressions of this concept.

A canonical Huffman code is one which generates N codes where the first value is expected to be most frequent, and the last value is least frequent. Implicitly, this means the first value’s code is shortest and the last value’s code is longest. Fortunately, this is exactly the situation we want to deal with! If you knew more about the actual distribution of values you could generate your own tree based on the observed values and get better efficiency in your serialization.

(In the following code, I’m going to assume you have a serialize function to read/write a canonical Huffman code that can represent values between 0 and 127. This is definitely sweeping complexity under the rug, although a Huffman coder is a handy thing to have in general. If I was implementing this, I’d start with the HuffmanProcessor from Torque, although there are surely more efficient implementations.)

Here’s the implementation using Huffman coding:

bool serialize_object_index_internal( Stream & stream, 
                                      int & previous, 
                                      int & current )
    uint32_t difference;
    uint32_t cappedDifference;
    if ( Stream::IsWriting )
        assert( previous < current );
        difference = current - previous;
        assert( difference > 0 );

        // Store a capped difference so we know when to 
        // write overflow values.
        cappedDifference = difference > 127 ? 127 : difference;

    // Write huffman encoded delta.
    serialize_huffman127(stream, cappedDifference);

    if(cappedDifference == 127)
        // Highest value keys us to code large gaps.
        serialize_int( stream, difference, 126, MaxObjects + 1 );
        // The serialized valus is the actual value.
        difference = cappedDifference;

    if ( Stream::IsWriting )
        current = previous + difference;
    return true;

Notice that there is now no repetitive structure to the code. We have to deal with a couple of details (what if it’s a number bigger than we can Huffman code?), but the basic flow is much simpler. And it achieves a very similar result.


What are the trade offs of this approach? Well, first off, it may not be as bit efficient in the “every item is consecutive” case with the canonical Huffman code because the canonical Huffman code won’t assign a single bit to that most common case. You could add a special case for that (and spend an extra bit per serialization in the non-consecutive case).

Second, Huffman codes are more complex to read and write. If you are doing a LOT of serialization this might affect CPU usage. Most likely it will not be a significant difference – I’ve rarely seen any game be bottleneck on network serialization.

Third, tuning is more indirect. With Glenn’s implementation, you can modify the code to directly affect the bit cost of various scenarios. With this approach, your main tuning option is to regenerate the Huffman code based on actual network traffic. You can also add special cases e.g. for the consecutive item case. Regardless of the approach you use, you always want to check your changes by running your game and measuring actual update sizes in various scenarios to confirm you are really getting a win.

Fourth, there are better coding techniques than Huffman – in practice you probably want a variant of arithmetic coding. Compressing the entire packet with a standard compression algorithm like LZMA is also an option. This article is just a piece in the puzzle of the ultimate network protocol.

Is the Huffman approach a slam dunk win over the manual approach? Unclear. But it gives a more formal perspective on the smart serialization technique Glenn uses, and is a good entree into a larger discussion on coding techniques for network serialization. I hope you found this article interesting!

(Thanks to  @glennfiedler for prompting me to write this.)

Fundamentally F$#ed Up Projects


My buddy Chris Benjaminsen asked me for a list of the ways in which game projects are fundamentally “f#$ed”. I couldn’t ask for a better prompt for a blog post.

Games are often developed by passionate people. Most game developers didn’t stumble into games – they had a longstanding interest and worked hard to make their way into the field. As a result, they are often “interesting” in addition to being passionate. More on that in a moment.

The natural forces surrounding game projects are chaotic. Games often take a non-linear path during development, meaning it’s hard for managers to determine if the project is “on track” or not. Small changes can make them dramatically more or less viable. Game studios are chronically over-extended and under-funded, adding another level of risk and stress.

Finally, games are demanding technically. They must deliver a high fidelity, responsive experience using meager resources on a wide variety of hardware. They have to adapt to a wide variety of needs from the artistic, design, accessibility, and business parts of the team. And there is never enough time.

How does this translate to projects being fundamentally f#$ed up? Let me share a few source of chaos from my own experience:

  • Battle of the wills. Working on a sequel for a beloved franchise. The upside: everyone on the team was a long time fan, passionate and engaged. The downside: Everyone on the team had their own conflicting ideas on design and implementation. This project didn’t last very long.
  • Lacking key development hardware. I’ve worked on several projects where we were developing software against hardware that didn’t exist yet. To understate the situation: it is very challenging to build a great experience when you are building it on an approximation of final hardware!
  • Burn out. I’ve worked with a surprising number of companies where the founders didn’t work on the core product for long periods of time.
  • Upper management. Game projects mature at varying rates. You have to keep upper management happy and involved so that they want to keep funding it. At the same time, you have to be careful what they see – a series of bad demo sessions can make them defund/deprioritize. In addition, much like almighty Zeus, they have a tendency to see random things from their exalted positions and hurl lightning bolts at the masses.
  • Unstable personalities. They come into work on the first day and everything is fine. The next day, they’re in tears. The day after, they want to fight you. They may not last, but they just might take the project down with them… Or drive away your most talented developers. Or maybe it’s the CEO and you’re gonna have to live with it.
  • Bad processes and continuity. QA processes that can’t catch showstopper bugs. Build systems that have no consistency from build to build. The staff with knowledge of deployment process leaving and being replaced frequently.
  • Fear and loathing. Very often team members will get concerned about something. It could be something serious, or something minor. Maybe a studio head came by and said something, or maybe they saw a detail that isn’t right. Or it could even be a broken process causing them extra work/stress. Often if they aren’t in a lead role they will have a hard time evaluating the significance and/or the steps being taken to deal with it. An important job for every lead is to help fight these fires and keep everyone on an even keel. Often the best thing to do is simply ignore it and move ahead with the plan. But if left unchecked it can destroy morale.
  • Technical debt. Maybe you inherited some code that is of questionable quality or maybe you built it yourself. Maybe some corner cases you swept under the rug are coming out to bite you. Whatever the case, technical debt can accrue very rapidly in a game, and if it’s not managed right it kills forward progress.

Stressed out yet?! As a technical contributor to games, I’ve long viewed my role as one of providing flexibility and stability. This takes two parts. First, there is a continuous and ongoing negotiation with the other elements of the team. Design needs gameplay mechanics. Art wants to execute a certain look. Production has schedule and budget constraints and needs to manage upwards to keep the project funded. The business guys need monetization to work and want all sorts of weird SDKs integrated so they can track their users. Executives are putting their two cents on all sides, introducing random changes. QA is reporting bugs and asking for diagnostic tools.

My job as a technical lead is to take all of these conflicting needs and somehow produce a reasonable series of stable builds of the game suitable for testing and release without anyone going insane. Most of the time this takes the form of communication so everyone can stay calm about where things are going. Sometimes we have to take real technical steps to resolve something, but it’s amazing how often just talking through stuff results in no changes or only a minor change to the course.

The second part of my role is more subtle. I need to tackle the technical elements of the project that will best allow it to weather the inevitable storms. For instance, suppose I think the basic idea of the game is good but the game design is poorly thought out. I may bring it up in a casual way, but rather than getting in a big fight about it, what I’ll do instead is rough out the gameplay such that the flaws in the design can be exposed. Once those issues are visible to the designers, I can focus on building systems that I know will be crucial for the game’s success. That might be a scalable websockets server, or some advanced rendering tech, or some other independent piece that is crucial for the game but not deeply affected by the specifics of the gameplay.

Fundamentally, I am using my technical capabilities and resources to act as an anchor for the project. By getting builds in front of the team and prioritizing effort based on risk/uncertainty, I can help ensure there is visible progress that keeps everyone happy and engaged even in the face of serious systemic badness. And sometimes, when everything is on fire, that’s all you need to find success

Putting Money on the Screen


Jeff Tunnell is an awesome producer (among his other talents), and I’ve learned a lot from working with him. One of his frequent words of guidance while wearing his producer hat? “Put the money on the screen.”

There are a lot of ways to prioritize building software. What feature/change do you tackle first? Depending on the phase of the project you might want to focus on…

  • Vertical slice order. Get full functionality going in a small area first. In games this is usually one complete and playable level. In apps it might be the central interaction, like making a purchase or accessing a service. In websites it might be one complete page out of the entire site. Programmers like to think of this as depth-first traversal.
  • Horizontal slice order. Focus on roughing out broad areas before getting into the details. When I do this, my project usually ends up with a bunch of brightly colored screens with words like “STORE” and “MAIN MENU” and “WORLD MAP” on them. They don’t do much but you can quickly experience the flow of the app and adjust. In programmer land, this is a breadth first traversal.
  • Severity order. Also known as “QA driven development” – you ship builds off to be tested, get bugs ranked by severity, and tackle the most serious first. You end up with stable software, but functionality, usability, and polish can suffer depending on the criteria of evaluation. It’s also hard to really change the overall feel of the project when working in this mode.
  • Money on the Screen order. Maximize bang for buck. In other words, look for features that require small amounts of developer time but move the perceived value/quality of the project ahead substantially. The converse is also valuable – look for things that detract from perceived quality that are also easy to fix.

At different times, these are all valid paths. Early on, to flesh out your project, you probably want to do a horizontal or vertical slice. To ship a stable release, you might need to strictly follow severity order. Everything is contextual.

The magic of “money on the screen” is that it gets you thinking about your project from a fresh angle. What is the smallest change I can make to improve this product the most? It’s not the same as polish – you can have a very polished interaction that’s actively bad! Sometimes it’s the last little step to show off a ton of behind the scenes hard work.

grunts Here’s an example. When I was working on Grunts: Skirmish, a tower defense game I built at PushButton Labs, I did a ton of work on the unit behavior. Each on screen character ran a finite state machine driven by complex AI routines. An animation engine sequenced hundreds of pre-rendered 3D sprites. A data-driven component system tracked weapon status, health status, and updated the navmap. But despite all that the gameplay felt flat until I added little numbers that floated up when characters were damaged, healed, or upgraded. All of a sudden the complex behaviors behind the scenes were more visible to the users and it felt GOOD! Money on the screen!

MarbleBlastUltra_screenshotAnother example is Marble Blast Ultra, an XBox 360 launch title I helped build at GarageGames. We had a great (and for the time cutting edge) shader system but until very late in the project, our marbles were pretty lackluster in appearance. Similarly, we struggled with repetitive looking tiles on level geometry. Not good for the central element of the game! Small amounts of work enhancing the shaders in both cases dramatically improved the look of the game and really leveraged our investment in a new rendering architecture. Money on the screen!

loomdocs_searchA non-game example is the documentation for the Loom SDK. Actually writing the documentation and auto-generation scripts for the docs was a ton of work. But even with all that work, it wasn’t very useful. It was cumbersome to find the right page in the docs and generally the hard work we put in was wasted. Then we realized that the best way to get money on the screen was by adding a quick search box. Users could quickly see their available options, choose them, or even just try typing terms in to find a match. It only took a day or two of work and made the docs dramatically more useful. Money on the screen!

Every project is an investment, if only in time. Don’t you owe it to yourself to maximize your return? Too many times I’ve seen amazing systems that never put their full value front and center. From time to time, take a step back, look at what you’re doing and ask yourself – “How can I put money on the screen?” More often than you’d think, a small change can have a huge pay off.

Code Caving

Lately, we’ve been doing a lot of code caving at The Engine Co. Expert code explorers, like Fabien Sanglard, have an amazing ability to take very complex codebases and understand their core concepts. They can extend them and make them turn all kinds of tricks.

While you shouldn’t miss Fabien’s analysis of Quake 3, you don’t have to be an expert game developer working with major engines to gain huge benefits from a little code literacy. You are far more likely to work within the confines of a big codebase than you are to be working on a blue sky project with no dependencies or baggage.

Being able to dig through layers of code and understand complex existing systems is a huge benefit to your capabilities as a coder. So let me lay out my guidelines for becoming an awesome code caver able to dig through the muddiest of codebases to get things done. Starting with…

Look Deeper Than The Surface

“But wait,” you say, “I’m a JavaScript coder. My language is so simple and elegant! All I have to know is jQuery and I’m set.”

You poor, poor fool. What you aren’t seeing when you declare your very clever closure-based MVC microframework is the 5 million lines of C++ and 1500 man-years that went into the browser that is running it. The JIT that converts your template parser to machine code so it runs fast. The highly tuned GC that lets you kind of forget about memory management. And the vast cross platform rendering codebase that actually lets you draw “Hello World” and funny mustache pictures on screen.

And that browser sits on top of the operating system, which is ten times bigger in lines of code and years of effort. Now hopefully you don’t have to worry about the operating system. But suppose you have a neat little node.js server, and you run a neat little JS powered web service with it, and you get a neat little denial of service attack from some jerk in Eastern Europe. Now all of a sudden those unimportant details like how the operating system handles sockets become vastly important.

Tools of the Trade

How do you cut through a huge codebase like butter? How can you hone in on the insights that make you look like a genius? How can you do all this without breaking a sweat?

grep to grok

The first essential tool is grep. grep looks for a string in a set of files. There are many equivalents to grep (WinGrep, find-in-files in a favorite IDE, find on Windows/DOS), but the key is you can give it a folder or set of files to search and a regular expression to look for, and get back a list of results and the files that contained them. I like to use find-in-files in Sublime Text
for this because I can click search results and jump to them.

Grepping is super powerful. Suppose you get some weird error message – “Floorblop invalid.” – and nothing else from your app or service. You might be left with some questions. What’s a floorblop? Why is it invalid? Well, it might not matter. Just grep the error string and look at the surrounding code to determine a cause. Sometimes errors are generated dynamically so you might have to look for part of the string like ‘ invalid.”‘ if Floorblop was determined at runtime. With a little bit of cleverness, you can almost always reduce the number of potential sites for the error in the codebase to a manageable number – then just inspect the search results manually to find the place triggering the error.

Now you don’t care about the text of the error, but instead can start looking for functions that might be failing and causing the error to be emitted. In effect you can think of the error string as a unique identifier for some situation rather than a human-relevant string. (And in many codebases, error messages are cryptic at best.)

In the course of investigating your error message, you might encounter some variables or functions that seem to be related. Grep comes to the rescue once again: grep for them! Generally symbols are unique in a project, so you can quickly find all the code relating to a variable or function. You can also take advantage of the language’s syntax to help narrow down your results – for instance, in C++ function implementations in a class or namespace often look like this:

void Foo::bar()

So if you want to find the definition of bar, you can grep for “::bar”, or if it’s a global function, use a return type, ie, “void bar” and go right to the definition. Think about fragments that would match lines you want to see – you don’t have to get an exact match, just one that’s close enough you can find what you want quickly.

It should be obvious that these techniques can apply to any language, not just C/C++ – in Ruby you might use “def” in your search strings, in JS you might use “function”, in assembly, colons. The goal is to just filter down the number of results you have to consider, not take them to zero.

Binary Spelunking with strings

It’s easy to think of a binary as an opaque file. So it can be frustrating when you have a codebase with binaries in it. But fear not: DLLs, EXEs, .SOs, and a.out files have a ton of useful information in them. They have to – otherwise the operating system wouldn’t be able to run them!

One of the first tools to pull out is strings. Strings simply scans any binary file looking for bytes that look like they might be a string (displayable ascii characters ending with \0). When it finds a match, it prints it out. You can store the results in a file, or run them through grep or another tool, to help narrow results.

So suppose you have a function for which you cannot find the implementation – but you notice that the codebase includes a binary .so or DLL. You can check if the function is prepackaged in the binary by running strings and grepping for the name. If you get some hits there and nowhere else, you now have a good hint that you might not have all the source code.

Pulling Back the Hood: strace and wireshark

Maybe you’re getting a weird failure when interacting with your app. The first thing to do is establish exactly what’s happening! strace is a super powerful tool on Linux that dumps every system call a program makes. With this, you can see when and what is being read from disk, what network activity is happening, if it’s requesting memory from the system, and so on. All of a sudden a cryptic dynamic linker error will be easy to diagnose – you can see if libraries are being loaded and if symbols are being found or not. Or you can see what OS call is hanging the process. Super useful, if a bit tedious.

On Windows, SysInternals offers similar tools to inspect process activity.

Cross platform Wireshark is a packet sniffer that can really help you understand network activity. Suppose you are getting a cryptic error relating to some HTTP service. You might not be sure if the service is up, if you have a bad header, if the connection is being dropped… Wireshark will let you tell right away if data is making it to the server and what it is exactly, as well as what the server is sending back to you. It will identify things like corrupt TCP packets that are very rare failures but can totally break your app.

Once you’ve established the lowest levels are working properly, it’s easy to move up the stack to HTTP inspection tools like Charles or Telerik Fiddler. These can inspect HTTPS traffic and give you a higher level view of your app’s behavior. Chrome and Firefox also have built in tools that are similar.

Abusing Your Profiler and Debugger

You can also abuse profiling tools like Instruments to find out what an app is doing. On Windows, Windows Performance Analyzer and xperfview are the tools to try first. These tools allow you to attach to a running process on your system and see the callstacks and where time is spent in them.

In other words, they show you HOW the code is running in various situations and which are most important (due to being at top of call stack or being called often). It’s like a self-guided tour through an unknown codebase – how convenient!

You can also attach a debugger and use the time honored trigger of hitting pause a lot to see where time is spent during execution – or stepping through function calls to see what calls what. This is a bit more tedious but also very useful in understanding a large codebase.

Study the Fundamentals

The best way to learn to understand big codebases is to… wait for it… study big codebases. While every project is unique, after a while you start to internalize common layouts and development philosophies. As you learn to recognize patterns and see how bigger systems fit together, you gain building blocks that let you map out unfamiliar systems and build your own systems to higher standards.

A few books I’ve found educational in this area:

  • Operating Systems Design and Implementation – A walkthrough of Minix, a small self contained POSIX environment. This goes through every detail of a “pocket” operating system and explains it in detail. Great introduction for the next three books…
  • Mac OS Internals – A great guide to OS X’s structure with detailed examples and code walkthroughs. Understand the fundamentals of OS X while building valuable code analysis skills.
  • Windows Internals, Part 1 (6th Edition) – The same idea, but for Windows. A completely different take on OS design with a different set of constraints on development. In depth and fascinating.

Conclusions: Fighting Dirty & Getting to the Bottom

At the highest level of mastery, you should be prepared to reason about the behavior of any program from its highest levels of UI down to the behavior of the CPU. Every abstraction is leaky, and even though you want to preserve abstractions to keep development manageable, you shouldn’t let that limit your understanding. Get to the bottom of the stack first, then work your way back.

Don’t be afraid to fight dirty. Do you think something is important? Do you have an assumption about how some code works? Try breaking it to see how it fails. You’d be amazed how many problems can be solved by asking “Is it plugged in?”. Use version control or backups to make sure you can safely make dangerous changes and roll back to a safe state.

Finally – optimize for your sanity and development time. Don’t spend 12 hours when you can spend 1. Think about how you can narrow down your problem and check your assumptions quickly so you can focus on gaining knowledge and making the system do what you need. There are a universe of tools to make it possible to wrangle complex systems. Use them!

Some Notes on Mac Configuration

I was recently advising a fellow TEC-er on setting up their Mac, and they suggested I write it up. These are the tools I use daily for development! Here we go…

  1. Alfred is an awesome tool. Like the Windows key on Win8, you can hit a keyboard shortcut (cmd-space by default) and type an app name or other command. Super handy.
  2. iTerm2 is a replacement for the terminal. The best thing about OS X is that it has a great UNIX command line alongside a nice GUI, and iTerm2 makes the terminal even better. Don’t forget to configure and use the global terminal hotkey. So handy!
  3. Spectacles lets you control your windows with the keyboard. So handy, especially on my Air with limited screenspace.
  4. Sublime Text 3 is my go to text editor on every platform, and especially Mac. Favorite features include cmd-d (to select multiple matching text ranges and edit them simultaneously) and using the subl command line tool to open folders/files from the terminal.
  5. Oh My ZSH! is a great upgrade for your terminal. Maintaining your own custom config is even better… but as a default setup it’s pretty good.
  6. RVM is a good way to get assorted Rubies install on your system. Ruby is handy but it suffers from versioning hell. RVM can help, sometimes.
  7. The GitHub app is also mega handy. No replacement for command line but it simplifies auth and basic commit/branch switching.

There are also some miscellaneous things you probably need to do: get the latest git, possibly install brew and ports (but I find more and more I prefer to compile and install from source), get all the latest updates from the App Store, and install XCode 5 (from the App Store these days). You might need CMake for Loom builds, along with Android SDK and Android NDK. We like to use the HipChat app for communication amongst our team.

As you can see, I really like a keyboard oriented workflow. Perhaps it is the inevitable outcome of programming for so long. I’ve trended more and more towards it. I haven’t quite gotten to the point of using tmux and emacs in a fullscreen terminal, but who knows…

%d bloggers like this: