1. 08 Aug, 2014 1 commit
    • levytamar82's avatar
      Fix bug 807 · 69a5f5ec
      levytamar82 authored
      in the sub_pixel_*variance* function the dst is aligned to 16 bytes and not
      to 32 bytes - now load unaligned data
      
      Change-Id: I2e0b9745543697efc56fefa32857ea10117af135
      69a5f5ec
  2. 07 Aug, 2014 1 commit
    • levytamar82's avatar
      Fix bug 806 · af10457e
      levytamar82 authored
      in the function sad32x32x4d and sad64x64x4d the source is aligned to 16 bytes
      and not to 32 bytes - the load is now unaligned.
      
      Change-Id: I922fdba56d0936b5cf72e4503519f185645a168c
      af10457e
  3. 28 Jul, 2014 1 commit
    • levytamar82's avatar
      Fix bug 805 · 4ba92dc5
      levytamar82 authored
      Remove all the redundant dct functions (dct4x4, dct8x8)
      in avx2 except dct32x32 those functions were copied originally from dct_sse2
      
      Change-Id: I742576fbf5175f3ac09f2076976a9247b259323e
      4ba92dc5
  4. 08 Jul, 2014 1 commit
    • Jingning Han's avatar
      Re-design quantization process for 32x32 transform block · 9ad1b9fc
      Jingning Han authored
      This commit enables a new quantization process for 32x32 2D-DCT
      transform coefficient blocks. It improves the compression
      performance of speed 5 by 1.4%. The overall compression gains of
      speed 5 due to the new quantization scheme is 4.7%. It also includes
      the SSSE3 implementation of the 32x32 quantization process.
      
      Change-Id: I0855b124fd6462418683f783f5bcb44255c9993b
      9ad1b9fc
  5. 07 Jul, 2014 1 commit
  6. 02 Jul, 2014 1 commit
    • Jingning Han's avatar
      Re-design quantization process · 9ac2f663
      Jingning Han authored
      This commit re-designs the quantization process for transform
      coefficient blocks of size 4x4 to 16x16. It improves compression
      performance for speed 7 by 3.85%. The SSSE3 version for the
      new quantization process is included.
      
      The average runtime of the 8x8 block quantization is reduced
      from 285 cycles -> 255 cycles, i.e., over 10% faster.
      
      Change-Id: I61278aa02efc70599b962d3314671db5b0446a50
      9ac2f663
  7. 12 Jun, 2014 1 commit
    • Jingning Han's avatar
      Fast computation path for forward transform and quantization · ccba289f
      Jingning Han authored
      This commit enables a fast path computational flow for forward
      transformation. It checks the sse and variance of prediction
      residuals and decides if the quantized coefficients are all
      zero, dc only, or more. It then selects the corresponding coding
      path in the forward transformation and quantization stage.
      
      It is currently enabled in rtc coding mode. Will do it for rd
      coding mode next.
      
      In speed -6, the runtime for pedestrian_area 1080p at 1000 kbps
      goes down from 14234 ms to 13704 ms, i.e., about 4% speed-up.
      Overall coding performance for rtc set is changed by -0.18%.
      
      Change-Id: I0452da1786d59bc8bcbe0a35fdae9f623d1d44e1
      ccba289f
  8. 03 Jun, 2014 1 commit
    • Jingning Han's avatar
      Fix potential overflow issue in SSSE3 forward 8x8 2D-DCT · 540d9103
      Jingning Han authored
      The SSSE3 implementation might find a potential overflow issue in
      its second 1-D transform, if all input residual pixels are close to
      255. This commit fixes the issue and re-enables the unit test on
      the SSSE3 version.
      
      Change-Id: I0520478abdab7afd3ff2842516bec951111e9b3c
      540d9103
  9. 28 May, 2014 2 commits
  10. 27 May, 2014 1 commit
  11. 22 May, 2014 1 commit
  12. 21 May, 2014 1 commit
    • Deb Mukherjee's avatar
      Renames x86_64 specific asm files · e2722734
      Deb Mukherjee authored
      Renames all x86_64 specific assembly files to consistently
      end in _x86_64.asm. This will be useful for build systems to
      handle these files differently.
      All new 64-bit specific assembly files should use the new
      naming convention.
      
      Change-Id: I36c89584967c82ffc4088b1b5044ac15d2bb7536
      e2722734
  13. 20 May, 2014 1 commit
  14. 19 May, 2014 1 commit
    • Jingning Han's avatar
      Adjust the forward 16x16 DCT computation steps · 7f547336
      Jingning Han authored
      This commit adjusts the forward 16x16 DCT computation steps to
      simplify the register level operations. It fixes the corresponding
      sse2 version accordingly.
      
      Change-Id: I72a9c25b8ca9442fc5e113f47cd701ae55aa7f08
      7f547336
  15. 14 May, 2014 1 commit
    • levytamar82's avatar
      AVX2 To VP9 Block Error Optimization · 1fbab853
      levytamar82 authored
      vp9_block_error_sse2 can only handle 16 bytes at a time but
      the function requires to handle a sequence of 32 bytes at a time
      so each 16 bytes is handled in a different register.
      With AVX2 optimization the 32 bytes can be handled in one register instead
      of two in the SSE2
      The vp9_block_error was optimized by 85%.
      The user level was optimized by 1.2%
      
      Change-Id: Ia8fffe60e61eff7432a5fbd538757894f6c319fd
      1fbab853
  16. 08 May, 2014 1 commit
  17. 07 May, 2014 1 commit
    • Paul Wilkins's avatar
      Revert "Add an MMX fwht4x4" · 33b1c457
      Paul Wilkins authored
      Includes changes that are not compatible with VS windows builds.
      Amongst other things stdint.h is not supported in VS.
      
      This reverts commit 89fbf3de.
      
      Change-Id: Ifa86d7df250578d1ada9b539c9ff12ed0c523cdd
      33b1c457
  18. 05 May, 2014 2 commits
    • Alex Converse's avatar
      Add an MMX fwht4x4 · 89fbf3de
      Alex Converse authored
      7% faster encoding a desktop lossless at RT speed 4.
      
      Change-Id: I41627f5b737752616b6512bb91a36ec45995bf64
      89fbf3de
    • Jingning Han's avatar
      SSSE3 implementation of full inverse 8x8 2D-DCT · 52ae97b6
      Jingning Han authored
      This commit enables SSSE3 version full inverse 8x8 2D-DCT and
      reconstruction. It makes the runtime of vp9_idct8x8_64_add down
      from 256 cycles (SSE2) to 246 cycles.
      
      Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3
      52ae97b6
  19. 01 May, 2014 1 commit
  20. 30 Apr, 2014 1 commit
  21. 29 Apr, 2014 1 commit
    • Jingning Han's avatar
      Enable SSSE3 implementation of 8x8 forward 2D-DCT · 1eaa3a76
      Jingning Han authored
      Assembly implementation of ssse3 8x8 forward 2D-DCT. The current
      version is turned on only for x86_64. The average unit runtime
      goes from 157 cycles down to 136 cycles, i.e., about 12.8% faster.
      This translates into about 1.5% speed-up for pedestrian_area 1080p
      at speed 2.
      
      Change-Id: I0f12435857e9425ed7ce12541344dfa16837f4f4
      1eaa3a76
  22. 25 Apr, 2014 1 commit
  23. 14 Apr, 2014 1 commit
    • Dmitry Kovalev's avatar
      Removing unused vp9_mcomp_x86.h file. · 2fc3a186
      Dmitry Kovalev authored
      We don't use declarations from this file. The real declarations
      (differently named) are in vp9_rtcd_defs.pl, e.g. vp9_full_search_sad.
      
      Change-Id: I73cbf064305710ba20747233cfdbe67366f069a0
      2fc3a186
  24. 21 Mar, 2014 1 commit
    • levytamar82's avatar
      AVX2 SAD Optimization: · 0fa8b668
      levytamar82 authored
      2 functions were optimized for avx2 by using full 256 bit register
      In order to handle 32 elements in parallel instead of only 16 in parallel:
      1. vp9_sad32x32x4d
      2. vp9_sad64x64x4d
      
      The function level gain is 66% and the user level gain is ~1%.
      
      Change-Id: I4efbb3bc7d8bc03b64b6c98f5cd5c4a9dd3212cb
      0fa8b668
  25. 17 Mar, 2014 1 commit
  26. 03 Mar, 2014 1 commit
    • Andrew Russell's avatar
      improved speed of 4x4 sse2 fdct. · a46f5459
      Andrew Russell authored
      * speed improvment of 30 percent achieved
      * multiplies and adds remain the same
      * non-arithmetic instructions minimized by hand, by:
         -expanding 2 pass loop
         -removing irrelivant "shuffles"
         -combining last two rounding steps
      * further improvments may be possible
      
      Change-Id: Idec2c3f52910c48e6a0e0f9aefed5cae31b0b8c0
      a46f5459
  27. 01 Mar, 2014 1 commit
    • levytamar82's avatar
      AVX2 SubPixel AVG Variance Optimization · ea149096
      levytamar82 authored
      Optimizing 2 functions to process 32 elements in parallel instead of 16:
      1. vp9_sub_pixel_avg_variance64x64
      2. vp9_sub_pixel_avg_variance32x32
      both of those function were calling vp9_sub_pixel_avg_variance16xh_ssse3
      instead of calling that function, it calls vp9_sub_pixel_avg_variance32xh_avx2
      that is written in avx2 and process 32 elements in parallel.
      This Optimization gave 80% function level gain and 2% user level gain
      
      Change-Id: Iea694654e1b7612dc6ed11e2626208c2179502c8
      ea149096
  28. 19 Feb, 2014 1 commit
  29. 14 Feb, 2014 1 commit
    • levytamar82's avatar
      AVX2 SubPixel Variance Optimization · 52dac5d1
      levytamar82 authored
      Optimizing 2 functions to process 32 elements in parallel instead of 16:
      1. vp9_sub_pixel_variance64x64
      2. vp9_sub_pixel_variance32x32
      both of those function were calling vp9_sub_pixel_variance16xh_ssse3
      instead of calling that function, it calls vp9_sub_pixel_variance32xh_avx2
      that is written in avx2 and process 32 elements in parallel.
      This Optimization gave 70% function level gain and 2% user level gain
      
      Change-Id: I4f5cb386b346ff6c878a094e1c3b37e418e50bde
      52dac5d1
  30. 13 Feb, 2014 1 commit
  31. 07 Feb, 2014 1 commit
    • Yunqing Wang's avatar
      Bug fix in ssse3 quantize function · 0d43bd77
      Yunqing Wang authored
      A bug was reported in Issue 702: "SIGILL (Illegal instruction) when
      transcoding with vp9 - using FFmpeg". It was reproduced and fixed.
      
      Change-Id: Ie32c149a89af02856084aeaf289e848a905c7700
      0d43bd77
  32. 06 Feb, 2014 1 commit
  33. 28 Jan, 2014 1 commit
  34. 24 Jan, 2014 1 commit
  35. 08 Jan, 2014 1 commit
    • levytamar82's avatar
      AVX2 Variance Optimization · 357b6536
      levytamar82 authored
      Optimizing the variance functions: vp9_variance16x16, vp9_variance32x32,
      vp9_variance64x64, vp9_variance32x16, vp9_variance64x32,
      vp9_mse16x16 by migrating to AVX2
      some of the functions were optimized by processing 32 elements instead of 16.
      some of the functions were optimized by processing 2 loop strides of 16
      elements in a single 256 bit register
      This optimization gives between 2.4% - 2.7% user level performance gain
      and 42% function level gain.
      
      Change-Id: I265ae08a2b0196057a224a86450153ef3aebd85d
      357b6536
  36. 17 Dec, 2013 1 commit
  37. 21 Nov, 2013 2 commits