1. 11 Jul, 2013 - 1 commit
  2. 10 Jul, 2013 - 1 commit
    • Jingning Han's avatar
      SSE2 16x16 ADST/DCT hybrid transform · 11442353
      Jingning Han authored
      This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2
      operations. It reduces the runtime from 5433 cycles to 1621 cycles, at
      no compression performance loss.
      
      Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230
      11442353
  3. 03 Jul, 2013 - 1 commit
    • Jingning Han's avatar
      Refactor SSE2 8x8 functional units · 2cb75c96
      Jingning Han authored
      These serve as building blocks for SSE2 8x8 and 16x16 ADST/DCT
      hybrid transform coding.
      
      Change-Id: I4089a754c66e0c986f67d9b8ec4dfb9627ad430d
      2cb75c96
  4. 02 Jul, 2013 - 1 commit
    • Ronald S. Bultje's avatar
      Use pmovmskb to skip quantize loops over empty coefficients. · e5fb4b61
      Ronald S. Bultje authored
      If none of the 16 coefficients that we quantize per loop iteration
      are larger than the zbin, directly skip to the next round of coeffs,
      rather than doing a full quantize loop that will eventually result
      in 16 zeroes. This incurs a jump cost, but saves a lot of other work.
      32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded
      no significantly positive results for smaller transforms, so is not
      used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles).
      
      Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647
      e5fb4b61
  5. 01 Jul, 2013 - 2 commits
    • Ronald S. Bultje's avatar
      Update quantize SSSE3 SIMD to cover 32x32 transform case also. · c8defcfd
      Ronald S. Bultje authored
      Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
      2min10.1, i.e. a 2.3% overall speed increase.
      
      Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
      c8defcfd
    • Ronald S. Bultje's avatar
      Quantize (64-bit only, for now) SSSE3 SIMD. · 7353ceab
      Ronald S. Bultje authored
      Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
      goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
      x86-64 only, it needs some minor modifications to be 32bit compatible,
      because it uses 15 xmm registers, whereas 32bit only has 8.
      
      Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
      7353ceab
  6. 29 Jun, 2013 - 2 commits
  7. 28 Jun, 2013 - 2 commits
    • Jingning Han's avatar
      Fix switch statement in 8x8 transform · 9def7f72
      Jingning Han authored
      Change-Id: I7c46354c4983feb5f6202c3ab4a1d9534da7e30f
      9def7f72
    • Ronald S. Bultje's avatar
      Make coefficient skip condition an explicit RD choice. · af660715
      Ronald S. Bultje authored
      This commit replaces zrun_zbin_boost, a method of biasing non-zero
      coefficients following runs of zero-coefficients to be rounded towards
      zero, with an explicit skip-block choice in the RD loop.
      
      The logic is basically that if individual coefficients should be rounded
      towards zero (from a RD point of view), the trellis/optimize loop should
      take care of it. If whole blocks should be zero (from a RD point of
      view), a single RD check is much more efficient than a complete
      serialization of the quantization loop.
      
      Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
      SIMD for quantize will follow in a separate patch. Results for other
      test sets pending.
      
      Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
      af660715
  8. 26 Jun, 2013 - 1 commit
    • Yaowu Xu's avatar
      fixed a compiling problem with MSVC win32 build · 60dc7375
      Yaowu Xu authored
      The aligned array in parameter list caused win32 build to report
      c2719 error. This commit fixed the issue by make the parameter
      type a pointer instead of an array.
      
      Change-Id: I4ed654ce4eba2db4995d9cdc136c68e9a6acc992
      60dc7375
  9. 25 Jun, 2013 - 4 commits
    • Ronald S. Bultje's avatar
      Add averaging-SAD functions for 8-point comp-inter motion search. · c24d9223
      Ronald S. Bultje authored
      Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
      i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
      the variance of the averaging predictor. This is slightly suboptimal
      because the function is subpixel-position-aware, but it will (at least
      for the SSE2 version) not actually use a bilinear filter for a full-pixel
      position, thus leading to approximately the same performance compared to
      if we implemented an actual average-aware full-pixel variance function.
      That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
      leading to a total gain of 2.7%.
      
      Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
      c24d9223
    • Jingning Han's avatar
      Tune the rounding operations in 8x8 ADST/DCT sse2 · 0084e61d
      Jingning Han authored
      Improve the round-trip precision to meet the unit test setttings.
      
      Change-Id: I303febae56b4b990ea3798b8ebed94c0510ecf79
      0084e61d
    • Jingning Han's avatar
      Use aligned buffer operations in 8x8/16x16 2D-DCT · 82d504b5
      Jingning Han authored
      This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles.
      
      Change-Id: I137758b81cd127b936175284310e81378db64552
      82d504b5
    • Jingning Han's avatar
      Enable sse2 implmentation of 8x8 ADST/DCT · a32a086d
      Jingning Han authored
      This commit makes use of the butterfly structure to enable the sse2
      version implementation of 8x8 ADST/DCT hybrid transform coding.
      
      The runtime of hybrid transform module goes down from 1170 cycles
      to 245 cycles. Overall speed-up around 1.5%.
      
      Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
      a32a086d
  10. 21 Jun, 2013 - 4 commits
    • Ronald S. Bultje's avatar
      Remove emms - that shouldn't be there. · fc033b38
      Ronald S. Bultje authored
      Change-Id: I8fcab81e390f93dc17e9666bbf8f77883b5aa897
      fc033b38
    • Ronald S. Bultje's avatar
      Add missing SECTION .text marker in assembly file. · ba42c026
      Ronald S. Bultje authored
      Fixes a crash on Windows when building with MSVC.
      
      Change-Id: I124ac756a1be55d190fadda5fcc46d23b1445dbf
      ba42c026
    • Ronald S. Bultje's avatar
      Implement SSE2 block_error. · 54b2a596
      Ronald S. Bultje authored
      Change vp9_block_error() to return a 64bit error variable, change all
      callers to expect a 64bit return value (this will prevent overflows,
      which we basically don't check for at all right now). Remove duplicate
      block_error() function, which fixed that through truncation. Remove
      old (incompatible) mmx/sse2 block_error SIMD versions and replace with
      a new one that returns a 64bit value.
      
      Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to
      3min23, i.e. a 3% overall speedup.
      
      Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68
      54b2a596
    • Ronald S. Bultje's avatar
      Add subtract_block SSE2 version and unit test. · 25c588b1
      Ronald S. Bultje authored
      3% faster overall (3min35.0 to 3min28.5).
      
      Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
      25c588b1
  11. 20 Jun, 2013 - 2 commits
    • Ronald S. Bultje's avatar
      SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance(). · 1e6a32f1
      Ronald S. Bultje authored
      Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
      3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
      which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
      perfectly interleaved, and can probably be improved further in the
      future. I've marked this with a few TODOs/FIXMEs in the code.
      
      Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
      1e6a32f1
    • Ronald S. Bultje's avatar
      Implement sse2 and ssse3 versions for all sub_pixel_variance sizes. · 8fb6c581
      Ronald S. Bultje authored
      Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 ->
      3min58). Specific changes to timings for each function compared to
      original assembly-optimized versions (or just new version timings if
      no previous assembly-optimized version was available):
      
      sse2   4x4:    99 ->   82 cycles
      sse2   4x8:           128 cycles
      sse2   8x4:           121 cycles
      sse2   8x8:   149 ->  129 cycles
      sse2   8x16:  235 ->  245 cycles (?)
      sse2  16x8:   269 ->  203 cycles
      sse2  16x16:  441 ->  349 cycles
      sse2  16x32:          641 cycles
      sse2  32x16:          643 cycles
      sse2  32x32: 1733 -> 1154 cycles
      sse2  32x64:         2247 cycles
      sse2  64x32:         2323 cycles
      sse2  64x64: 6984 -> 4442 cycles
      
      ssse3  4x4:           100 cycles (?)
      ssse3  4x8:           103 cycles
      ssse3  8x4:            71 cycles
      ssse3  8x8:           147 cycles
      ssse3  8x16:          158 cycles
      ssse3 16x8:   188 ->  162 cycles
      ssse3 16x16:  316 ->  273 cycles
      ssse3 16x32:          535 cycles
      ssse3 32x16:          564 cycles
      ssse3 32x32:          973 cycles
      ssse3 32x64:         1930 cycles
      ssse3 64x32:         1922 cycles
      ssse3 64x64:         3760 cycles
      
      Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d
      8fb6c581
  12. 17 Jun, 2013 - 1 commit
  13. 14 Jun, 2013 - 1 commit
    • Jingning Han's avatar
      Enable sse2 version of sad8x4/4x8 · c43af9a8
      Jingning Han authored
      The encoding time for bus at CIF goes from 661s to 625s. This commit
      also enabled unit test of sad8x4/4x8 in sad_test.cc.
      
      Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
      c43af9a8
  14. 13 Jun, 2013 - 1 commit
    • Jingning Han's avatar
      Enable sse2 version of sad8x4/4x8 · 15f50e7b
      Jingning Han authored
      The encoding time for bus at CIF goes from 661s to 625s. This commit
      also enabled unit test of sad8x4/4x8 in sad_test.cc.
      
      Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
      15f50e7b
  15. 12 Jun, 2013 - 1 commit
  16. 22 May, 2013 - 1 commit
    • Yunqing Wang's avatar
      Optimize variance functions · f4fcfe30
      Yunqing Wang authored
      Added SSE2 version of variance functions for super blocks.
      
      Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d
      f4fcfe30
  17. 01 May, 2013 - 1 commit
  18. 26 Apr, 2013 - 2 commits
    • Johann's avatar
      Whitespace nit · e3038ca8
      Johann authored
      Change-Id: I7486970c57cda75d26ec2c6d1f36bd668c955f66
      e3038ca8
    • Johann's avatar
      Normalize more intrinsic filenames · 863601c5
      Johann authored
      vp9_dequantize_x86 has only sse2 functions.
      
      vp9_dct_sse2_intrinsics has no namespace collision and can drop
      _intrinsics.
      
      vp9_idct_mmx.h is unused.
      
      Change-Id: Ic16e31fb372a1d1e841a62ecb4189fe8f95808ec
      863601c5
  19. 25 Apr, 2013 - 1 commit
  20. 23 Apr, 2013 - 1 commit
    • John Koleszar's avatar
      Move src_diff to per-plane MACROBLOCK data · cbd1315a
      John Koleszar authored
      First in a series of commits making certain MACROBLOCK members
      addressable per-plane. This commit also refactors the block subtraction
      functions vp9_subtract_b, vp9_subtract_sby_c, etc to be
      loops-over-planes and variable subsampling aware.
      
      Change-Id: I371d092b914ae0a495dfd852ea1a3d2467be6ec3
      cbd1315a
  21. 18 Apr, 2013 - 1 commit
    • Jingning Han's avatar
      Make the use of pred buffers consistent in MB/SB · 6f43ff58
      Jingning Han authored
      Use in-place buffers (dst of MACROBLOCKD) for  macroblock prediction.
      This makes the macroblock buffer handling consistent with those of
      superblock. Remove predictor buffer MACROBLOCKD.
      
      Change-Id: Id1bcd898961097b1e6230c10f0130753a59fc6df
      6f43ff58
  22. 17 Apr, 2013 - 1 commit
  23. 16 Apr, 2013 - 2 commits
  24. 10 Apr, 2013 - 1 commit
    • Ronald S. Bultje's avatar
      Make RD superblock mode search size-agnostic. · b4f6098e
      Ronald S. Bultje authored
      Merge various super_block_yrd and super_block_uvrd versions into one
      common function that works for all sizes. Make transform size selection
      size-agnostic also. This fixes a slight bug in the intra UV superblock
      code where it used the wrong transform size for txsz > 8x8, and stores
      the txsz selection for superblocks properly (instead of forgetting it).
      Lastly, it removes the trellis search that was done for 16x16 intra
      predictors, since trellis is relatively expensive and should thus only
      be done after RD mode selection.
      
      Gives basically identical results on derf (+0.009%).
      
      Change-Id: If4485c6f0a0fe4038b3172f7a238477c35a6f8d3
      b4f6098e
  25. 04 Apr, 2013 - 1 commit
  26. 18 Mar, 2013 - 1 commit
    • Yunqing Wang's avatar
      Optimize 8x8 idct function · 6344c84c
      Yunqing Wang authored
      Wrote sse2 functions of vp9_short_idct8x8 and vp9_short_idct10_8x8.
      Compared to c version, the sse2 version is 2X faster. The decoder
      test didn't show noticeable gain since 8x8 idct doesn't take much
      of decoding time (less than 1% in my test).
      
      Change-Id: I56313e18cd481700b3b52c4eda5ca204ca6365f3
      6344c84c
  27. 15 Mar, 2013 - 1 commit
    • Christian Duvivier's avatar
      Faster vp9_short_fdct16x16. · 4418b790
      Christian Duvivier authored
      Scalar path is about 1.5x faster (3.1% overall encoder speedup).
      SSE2 path is about 7.2x faster (7.8% overall encoder speedup).
      
      Change-Id: I06da5ad0cdae2488431eabf002b0d898d66d8289
      4418b790
  28. 28 Feb, 2013 - 1 commit