Video Conference Part 2: Joint Photographic Hell (For Beginners)

Last time, I got angry at Skype and metaphorically flipped the table by starting to write my own video conference application. We got a skeleton of a chat app and a video codec up, only to find it had a ludicrous 221 megabit bandwidth requirement. We added some simple compression but didn’t get much return from our efforts.

We definitely need better compression before we implement networking. There’s no point trying to stream hundreds of megabits over a real world link. We need to slim things way down.

It turns out that most video codecs use some variant of the lossy compression techniques from the JPEG format. I’ve always wanted to learn the details of JPEG. With the excellent mango library from my friend Jukka Liimatta as a reference, I started down the well trodden road of abusing JPEG for my own purposes.


The biggest building block of JPEG is the Discrete Cosine Transform. The DCT is related to the Fourier Transform. It takes a signal (in this case, a 8 by 8 array of pixels) and breaks it down into frequency components. Later, we can take the frequency components and run them through the Inverse Discrete Cosine Transform to get our original signal back.

Lower frequencies live in the top left of the output matrix, while higher frequencies live in the bottom right. Larger numbers indicate more of that frequency, smaller numbers less. The following image shows the frequencies that each of the 64 coefficients you get from a DCT transform encode. By combining them in varying amounts, you can encode any 8×8 pixel image.

Courtesy of many JPEG explanation sites on the internet.

(A word on blocks and macroblocks. You will notice that this is an 8×8 DCT but we are working with 16×16 macroblocks. Macroblocks are so named because they are made up of a combination of smaller blocks. At this point in the game, we are breaking our 16×16 RGB pixel macroblock into 12 8×8 DCT blocks. However, we will see later how we can do much better than that. Thanks for reading, now back to your regularly scheduled programming.)

Lossy compression is all about removing data that doesn’t contribute much to the final output. By working in frequency space, we can start dropping components that are less important. For instance, imagine a macroblock with a smooth gradient. If we are working with RGB values directly, it might be difficult to efficiently detect or encode the gradient. We would be easily confused by noise in the image. By running it through a DCT, we’d immediately see that the largest components were low frequency elements, and the smallest high frequency. It is much simpler to drop the smallest components in a grid!

JPEG is built around making the best use of DCT. One of several tricks it uses is zigzag tables to reorder the DCT components for better encoding efficiency. In adapting the JPEG code from Mango, one issue that got me was that the encoding and decoding order aren’t symmetric. That is, if you zigzagged an 8×8 matrix with the encoding table, then run it through the corresponding decoding table, you get a transposed version of the matrix. This manifested as blown out colors in the output image, and I’ll discuss why in the next section…


The biggest trick JPEG uses for better encoding efficiency is quantization. This is where things get lossy – DCT, like FFT, is fully reversible if you preserve the full precision of its output. By rounding the values from the DCT – also known as quantizing them – you can control how much data is lost. Frequencies that don’t contribute much tend to have small components in the DCT, so they rapidly become zero as they are quantized by larger amounts. Frequencies that matter a great deal have larger coefficients and will not be removed or will be removed very late in the process.

The following image shows similar video encoded with varying quantization level, going from high quantization (which looks blocky) to low quantization (which looks smooth and detailed).


To quantize, JPEG divides each DCT component by a rounding factor derived from the quality level setting. In fact, it generates a whole quantization matrix – one quantization factor for each of the 64 values of DCT output – that biases towards preserving component values or low frequencies. Each value is rounded a little bit differently.

This is how I was getting blown out colors (as mentioned in the previous section). Imagine a partial DCT output matrix consisting of frequency coefficients [40, 30, 20, 10]. We generate a (partial) quantization matrix [1, 2, 3, 4]. We divide and get quantized values of [40, 15, 7, 3] which are then written out.

Originally I was not taking the zigzag’s transposition into account, so I was scaling up by a rotated version of the quantization matrix. This meant that some values were being scaled up with a different factor than they were scaled down. If we represent rotation of the full quantization matrix by scrambling our small demo matrix, it would be like dequantizing by [1, 4, 3, 2]. We would end up with [40, 60, 21, 6] at the end. In compression speak, we failed to preserve the overall “energy” of the block – we ended up with bigger coefficient factors than we started, causing the output to be brighter! The effects vary depending on specifically how you get it wrong – here’s a shot where the luminance quantization coefficients were wrong:


Great – so now we’ve gotten the coefficients for unimportant parts of the DCT to be smaller, and in many cases zero. We can adjust how much we lose by varying a quality level factor. But we didn’t actually reduce the amount of data we transfer at all! Wait what?!


The reason why JPEG tries to get everything closer to zero is simple. Zeroes are cheap to send!

Why? It’s not just because of the convenient hole in the middle of a 0. The next step in JPEG is to collapse all those zeroes with run length encoding. We walk through the array, and replace adjacent zeroes with a special value indicating how many zeroes there were. Suppose our quantization makes the last 32 coefficients zero. We would go from 128 bytes (JPEG DCT output is one signed short per coefficient) to 66 bytes (32 populated coefficients plus one special value indicating 32 zeroes). That’s 50% savings with a pretty simple compression algorithm!

It sounds super simple, and once written it is, but getting RLE encoding to work reliably took several important debugging techniques. In addition to the old fashioned “stare at it in the debugger” technique, I added sentinels – known values at specific locations in the data stream – to help catch any misalignments. If I didn’t see sentinels exactly when expected, the code would instantly assert. I also put some asserts into the RLE decoding routines to check that we always got 64 values out and that a run would not cause us to over flow the buffer (a common sign of bad data would be runs longer than the remaining space in the block).

Because any bugs in later stages of processing would corrupt the RLE data, these asserts were usually my first sign that I had gotten something wrong in later parts of the compression pipeline.

Block State Format (v1)

At this point, although the app isn’t network capable, it starts making sense to describe the format for macroblock updates. Again, keeping things simple, we have the following structure:

 1 bit - is there another block update? if false, stop processing.
 4 bits - compression type (raw, zip, lzo, dct).
 6 bits - only present for zip/dct, indicating quality level.
 6 bits - macroblock x index
 6 bits - macroblock y index
 10 bits - size of compressed data
 size * 8 bits - output from the compression algorithm

Note that the macroblock x and y index are counting macroblocks, not pixels. So in this version of the protocol we can have images up to 1024 16px macroblocks or 16,384px on a side. We could shave a few bits if needed by taking into account the actual image size.

As you can see, this incurs a fixed overhead of 33 bits per macroblock, or 0.51 bits per pixel.

We see an average data rate of 8 bits per pixel with the subset of JPEG implemented above at average quality (ie, quality level 50). Lowest quality setting ran at 6 bits per pixel, and highest quality setting was 20 bits per pixel.

Now we’re down to only 40-60 megabits per second. It’s progress! Our video codec is starting to be feasible over very fast broadband connections, but we can do a lot better.

Next Time

Now that we can transfer little enough data to run the codec over a network, we should implement some networking and test it for real. And that’s exactly what we’ll do in Part 3!

Author: Ben Garney

See and @bengarney for details.

4 thoughts on “Video Conference Part 2: Joint Photographic Hell (For Beginners)”

  1. “breaking our 16×16 RGB pixel macroblock into 12 8×8 DCT blocks” that doesn’t check out

    1. Hmm. 16px square macroblock divided into 8px square blocks equals 4 blocks. RGB means 3 channels. So you end up with 3 channels of 4 blocks each equals 12 DCT blocks? If not perhaps I have found an amazing new way to do lossy compression. 😛

Comments are closed.

%d bloggers like this: