    This commit reworks the SSSE3 implementation of the forward 8x8
    2D-DCT. It uses a cyclic rotation approach to the temporary xmm
    registers. It reduces the average cycles from 158 to 154. The SSE2
    version uses 169 cycles.
