Last post, we got our networking stack up and running, figured out how to work around firewalls, and saw how our codec performs over a real network link. This motivated us to revisit our compression schemes, which is what we’ll do in today’s post.
We’ll start with some algorithmic improvements to reduce the amount of data we need, then implement an entropy coding technique to reduce our data rate even further.
YUV & Chroma Subsampling
Up to now, we have skipped a key part of JPEG – conversion to YUV and chroma subsampling. DCT works on any kind of data – RGB, YUV, monochrome, etc. The DCT we have been using is for 8×8 blocks. So up to now we have been sending 4 sets of 3 blocks – one each for the red, green, and blue channels in each of the four 8×8 regions in a macroblock. This is a lot of data!
What JPEG and a lot of other formats do is take advantage of limitations in the human visual system to save bandwidth. Specifically, people are much better at detecting changes in brightness then they are in color. So you can send color data at half resolution, while you keep brightness at full resolution. To do this, you must first separate brightness from color, which is accomplished by converting to the YUV color space.
Doing this saves about 50% bandwidth with a very minimal reduction in quality (6 blocks instead of 12 blocks). It brings us down to 4 bits per pixel.
Right now the shortest block we can send is around 18 bytes (8 coefficients at 2 bytes each plus 2 more bytes to encode zeroes for the remaining 56 coefficients). So a macroblock will be at least 108 bytes plus overhead. This works out to around 0.4 bits per pixel for even a very simple macroblock.
However, during high motion frames, we need to prioritize quantity over quality. Sending fully detailed macroblocks is costly in CPU and bandwidth for pixels that are blurry and rapidly changing. Further, macroblocks that aren’t updated are very obvious in this sort of situation – so we want to prefer a full frame of updates rather than part of a frame.
To this end, we create a new macroblock encoding type called “flat.” We add a single bit at the start of the macroblock indicating if it’s DCT or flat. If flat, then the format is as follows:
6 bits - red color quantized by 4 6 bits - green color quantized by 4 6 bits - blue color quantized by 4
We select this type when the RMS error of the flat color is less than some threshold. The threshold is based on the current total error of all macroblocks, so when we have lots of motion causing lots of error, we are more likely to encode flat macroblocks. Even a very low bandwidth link is capable of sending full frames where every macroblock is flat.
Flat mode has a flat cost of 0.07 bits per macroblock, which is very low. It improves the responsiveness of video feeds quite a bit. There are a whole family of techniques along these lines used by H.264 and other codecs (32×32 or larger macroblocks, special gradient blocks, etc.) that can make a big difference in high motion or scenes with out of focus regions. These techniques do tend to make high motion frames that can’t fit in available bandwidth “blur out” but this is a lot less offensive than partial updates.
Entropy Coding with RANS
We’ve gotten pretty far – down to 4 bits per pixel on an average feed. But there’s one important piece of the JPEG approach we’ve neglected up till now: entropy coding. Right now, all coefficients are sent as 2 byte signed shorts. We’re doing a lot of work to make them close to zero, but it only reduces size if we get adjacent coefficients to zero and RLE gives us a win.
JPEG uses a Huffman coder, which uses fewer bits for numbers closer to zero. Common +1 and -1 coefficients get sent in only a couple of bits, while rare, larger coefficients such as 1000 cost more bits. The balance is a significant reduction in size. JPEG also stores some coefficients relative to previous coefficients to help reduce the average magnitude and thus the bit count.
Because I’ve worked with Huffman codes before, I decided to try something new to me – arithmetic coders. I’ve read about them but never used one directly. I had been wanting to try the RANS coder from @rygorous (found at https://github.com/rygorous/ryg_rans), so I started implementing one based on his code.
Getting it fully integrated and working was a beast! I lost a couple of days to getting the RANS coder up and running with no issues. A lot of this was my learning curve, as the algorithm is subtle. Because the operation of an arithmetic coder is fairly opaque, I had to take a black box approach most of the time- I could tell via sentinels if I got right or wrong values, but it was difficult to determine the cause when I got a wrong value out of the decoder. Eventually, I ended up implementing a set of unit tests (using catch.hpp) which helped me constantly check each step of the encoding and decoding process so I could quickly spot any issues when I made a change.
Arithmetic coding has two parts – the actual coder (which is clever and small) and the context. The context models what values are likely at each point in the stream, and the arithmetic coder ensures that likely values are encoded with fewer bits than unlikely values. Good contexts make a huge difference in the effectiveness of compression.
In preparation for all of the following techniques, I captured eleven megabytes of RLE encoded DCT data at various quantizations to use as a training set. For convenience I regenerate my contexts at application launch, but in the real world we would use a much larger training set and only ship the final statistics. This data could easily be included in a header file, and would add just kilobytes to the executable footprint.
Order 0 Context
Contexts are classified by how many symbols back they consider for their prediction. An order 0 context looks only at the current symbol state to determine likely and unlikely symbols. Typically this boils down to a single static list of symbol frequencies.
Order 0 context was easy to implement (Ryg conveniently provides such an implementation in his ryg_rans project!) and I found it saved us about 1.9 bits per pixel.
Order 1 Context
I next tried to implement an order 1 context that looks at the previous symbol and the current symbol to determine its odds. However, I ran into huge issues – I just could not round trip data properly. The reason why is interesting.
RANS-style arithmetic coding is as a deterministic, symmetric process. Think of it like braiding a rope. To unbraid the rope, you do the exact same things but in reverse. Similarly, you run RANS backwards to generate the encoded data, and forward to get the decoded data.
When you are working with an order 0 context, this isn’t a big deal. You get the same statistics out whether you submit symbols forwards or backwards. However, more complex contexts aren’t symmetric (ie, odds of A after B are different than B after A). So no matter what I tried I could not get my order 1 context to work!
Subsequently, I got some advice from the excellent and wise Fabian Giesen, who pointed out that I should buffer symbols for encoding so I can process them forwards and write them backwards. This unblocked progress, but unfortunately I ran out of time to implement the order 1 context – I had already moved on to something else.
Dual Order 0 Context
Since the order 1 context was initially a bust, I tried an alternate approach. I wanted to take advantage of the structured nature of the RLE encoded data, which is short aligned. So I built two order 0 contexts, one for the first byte of all the shorts and the other for the second byte. This was simple and worked pretty well – and it was symmetric, so it bypassed the RANS order concerns. This took me down another 0.3 bits per pixel.
JPEG only has quantization levels from 1 to 22. However, we can tolerate much worse quality in video than we can in a still image – motion hides artifacts. We can extend the quantization range, lose some quality, and drop up to another 0.9 bits per pixel.
Block State Format (v3)
If you will recall how we encoded our blocks in our packets, you might have thought it wasn’t as efficient as possible. Specifically, a few things cost us in overhead. We re-send the compression type and level for every macroblock, when we don’t change them very often and could probably send them once per packet. This gains us 0.03 bits per pixel.
We also have a flag to indicate if there is another macroblock coming. This is cheap for small numbers of macroblocks, but now we have good compression so we can send a lot in a single packet. If we have 100 macroblocks, we burned 100 bits. It would be a lot cheaper to use 8 or 9 bits to encode the macroblock count. This gains us 0.003 bits per pixel. Not a huge win but it dovetails with the next change well.
Ryg kindly pointed out this win to me via Twitter. Every time you flush RANS, you have to write out 4 bytes. When we flush after every block, then we burn ~32 bits per block. If we modify our packet to have one continuous RANS stream for all the macroblocks, we only have to flush once for perhaps a hundred macroblocks. This gains us around 0.12 bits per pixel.
One cool thing about RANS is you can insert uncompressed bits directly without any additional overhead. (It’s symmetrical and deterministic, remember? So you just stuff bits in/out and if you are consistent it works itself out.) We use this bypass mode to handle special cases like the flat encoding mode data and the flag that determines if we’re a flat or a DCT block. This allows us to put all our macroblock data through RANS with no hassle.
We could arithmetically code this data based on the likelihood of flat vs. non-flat blocks, or based on common RGB values, but since flat mode is already so much cheaper than DCT, I skipped it. So, no change in bits per pixel, but it keeps our implementation straightforward.
Now the packet look like this:
4 bits - mode (raw, zip, lzo, dct) for packet 6 bits - quality level (if dct) 8 bits - block count [RANS bytes follow to end of packet]
There is one gotcha. Because RANS pops out bytes at irregular intervals, we can only know if our packet is full by compressing all our pending data after each block is added, and checking to see if we ran out of space. Currently, I do this after every block and as a result I spend about two thirds of my encoding time re-compressing the same data over and over. Because RANS operates in reverse, I can’t easily “checkpoint” my encoding state and only do work for the new content – I have to process the whole packet’s worth of data every time.
I believe this could be substantially improved by heuristically checking more or less often based on the available space. So I might check only once every 10 blocks at first, then more and more regularly as I get towards the end of the packet. This would give a substantial reduction in CPU load.
A Word on Performance
(Notes for self: I tested commit c8e59a2, on a Core i7-2600k 3.4ghz quad core. Program is single threaded.)
Overall, the system has no problem maintaining 60hz performance in release build. Deblocking or very high bandwidth usage can cause it to miss this target, but there are clear paths for optimization. Playback is surprisingly lightweight.
Memory footprint is around 75MB and does not vary substantially. Higher latency connections will cause slightly higher usage, and a lower latency connection slightly less (as we store per-packet information until the packet is received or dropped).
|% of Frame Time||Description|
|3%||Update error metric|
Overall, we see about 40% of frame time spent encoding, and 5% on decoding.
The system is designed to be easy to parallelize. Specifically, since frames are often sent with multiple packets, we would recommend divvying up pending blocks and encoding packets in worker threads. Decoding could also be done with workers for a good speedup. On a fast desktop CPU this is less of an issue, but when trying to achieve low latency on embedded hardware, it could make a big difference.
We can also move to SIMD versions of hot path code (RANS, DCT, and color conversion are all eligible). Our bitblt is also naive, and contributes to about half of the GUI rendering time. A SIMDized blitter would run much faster.
We estimate an optimized decoder implementation on the same hardware would take around 1% frametime, and an optimized encoder implementation would take around 5% frametime. This puts screensharing of 1080p on a single core well within reach – we would expect to see 7% frametime decoding and 40% frametime encoding.
Writing my own video chat app and video codec was a great experience. Much like writing a toy compiler, language, or operating system, it is a learning experience more than a practical application. It gave me a deeper appreciation for the excellent video codecs available on the market today, and a better understanding of the issues facing things like Skype and WebRTC.
Ultimately, for someone often frustrated by video chat applications, it was cathartic to finally take a swing at the problem. And now I have something to tinker with when Skype breaks on me.
Final Statistics Avg bits per pixel: 0.6 bits for low quality, 1.0 bits for good quality Max usable sustained packet loss: 25% Max survivable sustained packet loss: 75%
Thanks for reading. My team and I solve hard problems in the streaming video, AR/VR, scalable backend, gaming, and IoT spaces. If we can help you, contact us at ben AT theengine DOT co.