1. 24 Sep, 2013 - 1 commit
  2. 07 Sep, 2013 - 1 commit
    • Jingning Han's avatar
      Fix overflow issue in 16x16 quantization SSSE3 · 09bc942b
      Jingning Han authored
      The 16x16 transform unit test suggested that the peak coefficient
      value can reach 32639. This could cause potential overflow issue
      in the SSSE3 implmentation of 16x16 block quantization. This commit
      fixes this issue by replacing addition with saturated addition.
      
      Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e
      09bc942b
  3. 05 Sep, 2013 - 1 commit
    • Jingning Han's avatar
      Use saturated addition in SSSE3 of 32x32 quant · 458c2833
      Jingning Han authored
      The 32x32 forward transform can potentially reach peak coefficient
      value close to 32700, while the rounding factor can go upto 610.
      This could cause overflow issue in the SSSE3 implementation of 32x32
      quantization process.
      
      This commit resolves this issue by replacing the addition operations
      with saturated addition operations in 32x32 block quantization.
      
      Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70
      458c2833
  4. 01 Sep, 2013 - 1 commit
    • Jingning Han's avatar
      Fix 32x32 forward transform SSE2 version · 3cf46fa5
      Jingning Han authored
      This commit fixed the potential overflow issue in the SSE2
      implementation of 32x32 forward DCT. It resolved the corrupted
      coded frames in the border of scenes.
      
      Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9
      3cf46fa5
  5. 29 Aug, 2013 - 1 commit
    • Jingning Han's avatar
      Fix overflow issue in SSSE3 32x32 quantization · abff6788
      Jingning Han authored
      The 32x32 quantization process can potentially have the intermediate
      stacks over 16-bit range, thereby causing enc/dec mismatch. This commit
      fixes this overflow issue in the SSSE3 implementation, as well as the
      prototype, of 32x32 quantization.
      
      This fixes issue 607 from webm@googlecode.
      
      Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806
      abff6788
  6. 27 Aug, 2013 - 1 commit
    • Yaowu Xu's avatar
      fixed the reading too many bytes · 9482c079
      Yaowu Xu authored
      In subpel_avg_variance functions, code similar to the following
      
      punpkldq m2, [addr]
      
      actually reads 8 bytes. For functions that are supposed to work on
      buffers only have less 8 bytes a line, this caused valgrind error
      of reading uninitialized memory.
      
      Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34
      9482c079
  7. 26 Aug, 2013 - 1 commit
  8. 12 Aug, 2013 - 1 commit
    • Jingning Han's avatar
      SSE2 high precision 32x32 forward DCT · 78136edc
      Jingning Han authored
      Enable SSE2 implementation of high precision 32x32 forward DCT. The
      intermediate stacks are of 32-bits. The run-time goes down from
      32126 cycles to 13442 cycles.
      
      Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
      78136edc
  9. 06 Aug, 2013 - 3 commits
  10. 11 Jul, 2013 - 1 commit
  11. 10 Jul, 2013 - 1 commit
    • Jingning Han's avatar
      SSE2 16x16 ADST/DCT hybrid transform · 11442353
      Jingning Han authored
      This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2
      operations. It reduces the runtime from 5433 cycles to 1621 cycles, at
      no compression performance loss.
      
      Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230
      11442353
  12. 03 Jul, 2013 - 1 commit
    • Jingning Han's avatar
      Refactor SSE2 8x8 functional units · 2cb75c96
      Jingning Han authored
      These serve as building blocks for SSE2 8x8 and 16x16 ADST/DCT
      hybrid transform coding.
      
      Change-Id: I4089a754c66e0c986f67d9b8ec4dfb9627ad430d
      2cb75c96
  13. 02 Jul, 2013 - 1 commit
    • Ronald S. Bultje's avatar
      Use pmovmskb to skip quantize loops over empty coefficients. · e5fb4b61
      Ronald S. Bultje authored
      If none of the 16 coefficients that we quantize per loop iteration
      are larger than the zbin, directly skip to the next round of coeffs,
      rather than doing a full quantize loop that will eventually result
      in 16 zeroes. This incurs a jump cost, but saves a lot of other work.
      32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded
      no significantly positive results for smaller transforms, so is not
      used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles).
      
      Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647
      e5fb4b61
  14. 01 Jul, 2013 - 2 commits
    • Ronald S. Bultje's avatar
      Update quantize SSSE3 SIMD to cover 32x32 transform case also. · c8defcfd
      Ronald S. Bultje authored
      Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
      2min10.1, i.e. a 2.3% overall speed increase.
      
      Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
      c8defcfd
    • Ronald S. Bultje's avatar
      Quantize (64-bit only, for now) SSSE3 SIMD. · 7353ceab
      Ronald S. Bultje authored
      Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
      goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
      x86-64 only, it needs some minor modifications to be 32bit compatible,
      because it uses 15 xmm registers, whereas 32bit only has 8.
      
      Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
      7353ceab
  15. 29 Jun, 2013 - 2 commits
  16. 28 Jun, 2013 - 2 commits
    • Jingning Han's avatar
      Fix switch statement in 8x8 transform · 9def7f72
      Jingning Han authored
      Change-Id: I7c46354c4983feb5f6202c3ab4a1d9534da7e30f
      9def7f72
    • Ronald S. Bultje's avatar
      Make coefficient skip condition an explicit RD choice. · af660715
      Ronald S. Bultje authored
      This commit replaces zrun_zbin_boost, a method of biasing non-zero
      coefficients following runs of zero-coefficients to be rounded towards
      zero, with an explicit skip-block choice in the RD loop.
      
      The logic is basically that if individual coefficients should be rounded
      towards zero (from a RD point of view), the trellis/optimize loop should
      take care of it. If whole blocks should be zero (from a RD point of
      view), a single RD check is much more efficient than a complete
      serialization of the quantization loop.
      
      Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
      SIMD for quantize will follow in a separate patch. Results for other
      test sets pending.
      
      Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
      af660715
  17. 26 Jun, 2013 - 1 commit
    • Yaowu Xu's avatar
      fixed a compiling problem with MSVC win32 build · 60dc7375
      Yaowu Xu authored
      The aligned array in parameter list caused win32 build to report
      c2719 error. This commit fixed the issue by make the parameter
      type a pointer instead of an array.
      
      Change-Id: I4ed654ce4eba2db4995d9cdc136c68e9a6acc992
      60dc7375
  18. 25 Jun, 2013 - 4 commits
    • Ronald S. Bultje's avatar
      Add averaging-SAD functions for 8-point comp-inter motion search. · c24d9223
      Ronald S. Bultje authored
      Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
      i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
      the variance of the averaging predictor. This is slightly suboptimal
      because the function is subpixel-position-aware, but it will (at least
      for the SSE2 version) not actually use a bilinear filter for a full-pixel
      position, thus leading to approximately the same performance compared to
      if we implemented an actual average-aware full-pixel variance function.
      That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
      leading to a total gain of 2.7%.
      
      Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
      c24d9223
    • Jingning Han's avatar
      Tune the rounding operations in 8x8 ADST/DCT sse2 · 0084e61d
      Jingning Han authored
      Improve the round-trip precision to meet the unit test setttings.
      
      Change-Id: I303febae56b4b990ea3798b8ebed94c0510ecf79
      0084e61d
    • Jingning Han's avatar
      Use aligned buffer operations in 8x8/16x16 2D-DCT · 82d504b5
      Jingning Han authored
      This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles.
      
      Change-Id: I137758b81cd127b936175284310e81378db64552
      82d504b5
    • Jingning Han's avatar
      Enable sse2 implmentation of 8x8 ADST/DCT · a32a086d
      Jingning Han authored
      This commit makes use of the butterfly structure to enable the sse2
      version implementation of 8x8 ADST/DCT hybrid transform coding.
      
      The runtime of hybrid transform module goes down from 1170 cycles
      to 245 cycles. Overall speed-up around 1.5%.
      
      Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
      a32a086d
  19. 21 Jun, 2013 - 4 commits
    • Ronald S. Bultje's avatar
      Remove emms - that shouldn't be there. · fc033b38
      Ronald S. Bultje authored
      Change-Id: I8fcab81e390f93dc17e9666bbf8f77883b5aa897
      fc033b38
    • Ronald S. Bultje's avatar
      Add missing SECTION .text marker in assembly file. · ba42c026
      Ronald S. Bultje authored
      Fixes a crash on Windows when building with MSVC.
      
      Change-Id: I124ac756a1be55d190fadda5fcc46d23b1445dbf
      ba42c026
    • Ronald S. Bultje's avatar
      Implement SSE2 block_error. · 54b2a596
      Ronald S. Bultje authored
      Change vp9_block_error() to return a 64bit error variable, change all
      callers to expect a 64bit return value (this will prevent overflows,
      which we basically don't check for at all right now). Remove duplicate
      block_error() function, which fixed that through truncation. Remove
      old (incompatible) mmx/sse2 block_error SIMD versions and replace with
      a new one that returns a 64bit value.
      
      Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to
      3min23, i.e. a 3% overall speedup.
      
      Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68
      54b2a596
    • Ronald S. Bultje's avatar
      Add subtract_block SSE2 version and unit test. · 25c588b1
      Ronald S. Bultje authored
      3% faster overall (3min35.0 to 3min28.5).
      
      Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
      25c588b1
  20. 20 Jun, 2013 - 2 commits
    • Ronald S. Bultje's avatar
      SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance(). · 1e6a32f1
      Ronald S. Bultje authored
      Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
      3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
      which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
      perfectly interleaved, and can probably be improved further in the
      future. I've marked this with a few TODOs/FIXMEs in the code.
      
      Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
      1e6a32f1
    • Ronald S. Bultje's avatar
      Implement sse2 and ssse3 versions for all sub_pixel_variance sizes. · 8fb6c581
      Ronald S. Bultje authored
      Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 ->
      3min58). Specific changes to timings for each function compared to
      original assembly-optimized versions (or just new version timings if
      no previous assembly-optimized version was available):
      
      sse2   4x4:    99 ->   82 cycles
      sse2   4x8:           128 cycles
      sse2   8x4:           121 cycles
      sse2   8x8:   149 ->  129 cycles
      sse2   8x16:  235 ->  245 cycles (?)
      sse2  16x8:   269 ->  203 cycles
      sse2  16x16:  441 ->  349 cycles
      sse2  16x32:          641 cycles
      sse2  32x16:          643 cycles
      sse2  32x32: 1733 -> 1154 cycles
      sse2  32x64:         2247 cycles
      sse2  64x32:         2323 cycles
      sse2  64x64: 6984 -> 4442 cycles
      
      ssse3  4x4:           100 cycles (?)
      ssse3  4x8:           103 cycles
      ssse3  8x4:            71 cycles
      ssse3  8x8:           147 cycles
      ssse3  8x16:          158 cycles
      ssse3 16x8:   188 ->  162 cycles
      ssse3 16x16:  316 ->  273 cycles
      ssse3 16x32:          535 cycles
      ssse3 32x16:          564 cycles
      ssse3 32x32:          973 cycles
      ssse3 32x64:         1930 cycles
      ssse3 64x32:         1922 cycles
      ssse3 64x64:         3760 cycles
      
      Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d
      8fb6c581
  21. 17 Jun, 2013 - 1 commit
  22. 14 Jun, 2013 - 1 commit
    • Jingning Han's avatar
      Enable sse2 version of sad8x4/4x8 · c43af9a8
      Jingning Han authored
      The encoding time for bus at CIF goes from 661s to 625s. This commit
      also enabled unit test of sad8x4/4x8 in sad_test.cc.
      
      Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
      c43af9a8
  23. 13 Jun, 2013 - 1 commit
    • Jingning Han's avatar
      Enable sse2 version of sad8x4/4x8 · 15f50e7b
      Jingning Han authored
      The encoding time for bus at CIF goes from 661s to 625s. This commit
      also enabled unit test of sad8x4/4x8 in sad_test.cc.
      
      Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
      15f50e7b
  24. 12 Jun, 2013 - 1 commit
  25. 22 May, 2013 - 1 commit
    • Yunqing Wang's avatar
      Optimize variance functions · f4fcfe30
      Yunqing Wang authored
      Added SSE2 version of variance functions for super blocks.
      
      Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d
      f4fcfe30
  26. 01 May, 2013 - 1 commit
  27. 26 Apr, 2013 - 2 commits
    • Johann's avatar
      Whitespace nit · e3038ca8
      Johann authored
      Change-Id: I7486970c57cda75d26ec2c6d1f36bd668c955f66
      e3038ca8
    • Johann's avatar
      Normalize more intrinsic filenames · 863601c5
      Johann authored
      vp9_dequantize_x86 has only sse2 functions.
      
      vp9_dct_sse2_intrinsics has no namespace collision and can drop
      _intrinsics.
      
      vp9_idct_mmx.h is unused.
      
      Change-Id: Ic16e31fb372a1d1e841a62ecb4189fe8f95808ec
      863601c5