Commits · 83c7e13a6bcd1535d9547ef3c89816bf993b458b · BC / public / external / libvpx

11 Jul, 2013 - 1 commit
- Remove unused fwalsh/fdct x86 SIMD implementations. · c13e0bcb
  Ronald S. Bultje authored 11 years ago
```
Change-Id: Ia942e56cf322821d42ba06178672791eeee2847e
```
  c13e0bcb
10 Jul, 2013 - 1 commit

SSE2 16x16 ADST/DCT hybrid transform · 11442353

Jingning Han authored 11 years ago

This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2
operations. It reduces the runtime from 5433 cycles to 1621 cycles, at
no compression performance loss.

Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230

11442353

03 Jul, 2013 - 1 commit

Refactor SSE2 8x8 functional units · 2cb75c96

Jingning Han authored 11 years ago

These serve as building blocks for SSE2 8x8 and 16x16 ADST/DCT
hybrid transform coding.

Change-Id: I4089a754c66e0c986f67d9b8ec4dfb9627ad430d

2cb75c96

02 Jul, 2013 - 1 commit

Use pmovmskb to skip quantize loops over empty coefficients. · e5fb4b61

Ronald S. Bultje authored 11 years ago

If none of the 16 coefficients that we quantize per loop iteration
are larger than the zbin, directly skip to the next round of coeffs,
rather than doing a full quantize loop that will eventually result
in 16 zeroes. This incurs a jump cost, but saves a lot of other work.
32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded
no significantly positive results for smaller transforms, so is not
used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles).

Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647

e5fb4b61

01 Jul, 2013 - 2 commits

Update quantize SSSE3 SIMD to cover 32x32 transform case also. · c8defcfd

Ronald S. Bultje authored 11 years ago

Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
2min10.1, i.e. a 2.3% overall speed increase.

Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87

c8defcfd

Quantize (64-bit only, for now) SSSE3 SIMD. · 7353ceab

Ronald S. Bultje authored 11 years ago

Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
x86-64 only, it needs some minor modifications to be 32bit compatible,
because it uses 15 xmm registers, whereas 32bit only has 8.

Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904

7353ceab

29 Jun, 2013 - 2 commits

SSE2 version of vp9_short_fdct32x32_rd. · 466e0cf3

Christian Duvivier authored 11 years ago

43,000 -> 5,750 cycles, about 7.5x faster.

Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0

466e0cf3

Enable SSE2 4x4 ADST/DCT transform · 1109b6b8

Jingning Han authored 11 years ago

This commit enables SSE2 4x4 foward hybrid transform. The runtime
goes from 249 cycles down to 74 cycles. Overall around 2% speed-up
at no compression performance change.

Change-Id: Iad4d526346e05c7be896466c05500711bb763660

1109b6b8

28 Jun, 2013 - 2 commits

Fix switch statement in 8x8 transform · 9def7f72
Jingning Han authored 11 years ago
```
Change-Id: I7c46354c4983feb5f6202c3ab4a1d9534da7e30f
```
9def7f72

Make coefficient skip condition an explicit RD choice. · af660715

Ronald S. Bultje authored 11 years ago

This commit replaces zrun_zbin_boost, a method of biasing non-zero
coefficients following runs of zero-coefficients to be rounded towards
zero, with an explicit skip-block choice in the RD loop.

The logic is basically that if individual coefficients should be rounded
towards zero (from a RD point of view), the trellis/optimize loop should
take care of it. If whole blocks should be zero (from a RD point of
view), a single RD check is much more efficient than a complete
serialization of the quantization loop.

Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
SIMD for quantize will follow in a separate patch. Results for other
test sets pending.

Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4

af660715

26 Jun, 2013 - 1 commit

fixed a compiling problem with MSVC win32 build · 60dc7375

Yaowu Xu authored 11 years ago

The aligned array in parameter list caused win32 build to report
c2719 error. This commit fixed the issue by make the parameter
type a pointer instead of an array.

Change-Id: I4ed654ce4eba2db4995d9cdc136c68e9a6acc992

60dc7375

25 Jun, 2013 - 4 commits

Add averaging-SAD functions for 8-point comp-inter motion search. · c24d9223

Ronald S. Bultje authored 11 years ago

Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
the variance of the averaging predictor. This is slightly suboptimal
because the function is subpixel-position-aware, but it will (at least
for the SSE2 version) not actually use a bilinear filter for a full-pixel
position, thus leading to approximately the same performance compared to
if we implemented an actual average-aware full-pixel variance function.
That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
leading to a total gain of 2.7%.

Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd

c24d9223

Tune the rounding operations in 8x8 ADST/DCT sse2 · 0084e61d

Jingning Han authored 11 years ago

Improve the round-trip precision to meet the unit test setttings.

Change-Id: I303febae56b4b990ea3798b8ebed94c0510ecf79

0084e61d

Use aligned buffer operations in 8x8/16x16 2D-DCT · 82d504b5

Jingning Han authored 11 years ago

This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles.

Change-Id: I137758b81cd127b936175284310e81378db64552

82d504b5

Enable sse2 implmentation of 8x8 ADST/DCT · a32a086d

Jingning Han authored 11 years ago

This commit makes use of the butterfly structure to enable the sse2
version implementation of 8x8 ADST/DCT hybrid transform coding.

The runtime of hybrid transform module goes down from 1170 cycles
to 245 cycles. Overall speed-up around 1.5%.

Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f

a32a086d

21 Jun, 2013 - 4 commits

Remove emms - that shouldn't be there. · fc033b38
Ronald S. Bultje authored 11 years ago
```
Change-Id: I8fcab81e390f93dc17e9666bbf8f77883b5aa897
```
fc033b38
Add missing SECTION .text marker in assembly file. · ba42c026
Ronald S. Bultje authored 11 years ago
```
Fixes a crash on Windows when building with MSVC.

Change-Id: I124ac756a1be55d190fadda5fcc46d23b1445dbf
```
ba42c026

Implement SSE2 block_error. · 54b2a596

Ronald S. Bultje authored 11 years ago

Change vp9_block_error() to return a 64bit error variable, change all
callers to expect a 64bit return value (this will prevent overflows,
which we basically don't check for at all right now). Remove duplicate
block_error() function, which fixed that through truncation. Remove
old (incompatible) mmx/sse2 block_error SIMD versions and replace with
a new one that returns a 64bit value.

Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to
3min23, i.e. a 3% overall speedup.

Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68

54b2a596

Add subtract_block SSE2 version and unit test. · 25c588b1
Ronald S. Bultje authored 11 years ago
```
3% faster overall (3min35.0 to 3min28.5).

Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
```
25c588b1

20 Jun, 2013 - 2 commits

SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance(). · 1e6a32f1

Ronald S. Bultje authored 11 years ago

Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
perfectly interleaved, and can probably be improved further in the
future. I've marked this with a few TODOs/FIXMEs in the code.

Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9

1e6a32f1

Implement sse2 and ssse3 versions for all sub_pixel_variance sizes. · 8fb6c581

Ronald S. Bultje authored 11 years ago

Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 ->
3min58). Specific changes to timings for each function compared to
original assembly-optimized versions (or just new version timings if
no previous assembly-optimized version was available):

sse2   4x4:    99 ->   82 cycles
sse2   4x8:           128 cycles
sse2   8x4:           121 cycles
sse2   8x8:   149 ->  129 cycles
sse2   8x16:  235 ->  245 cycles (?)
sse2  16x8:   269 ->  203 cycles
sse2  16x16:  441 ->  349 cycles
sse2  16x32:          641 cycles
sse2  32x16:          643 cycles
sse2  32x32: 1733 -> 1154 cycles
sse2  32x64:         2247 cycles
sse2  64x32:         2323 cycles
sse2  64x64: 6984 -> 4442 cycles

ssse3  4x4:           100 cycles (?)
ssse3  4x8:           103 cycles
ssse3  8x4:            71 cycles
ssse3  8x8:           147 cycles
ssse3  8x16:          158 cycles
ssse3 16x8:   188 ->  162 cycles
ssse3 16x16:  316 ->  273 cycles
ssse3 16x32:          535 cycles
ssse3 32x16:          564 cycles
ssse3 32x32:          973 cycles
ssse3 32x64:         1930 cycles
ssse3 64x32:         1922 cycles
ssse3 64x64:         3760 cycles

Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d

8fb6c581

17 Jun, 2013 - 1 commit

Move subpixel variance function from common/ to encoder/. · d9fc4516

Ronald S. Bultje authored 11 years ago

This seems to only be used in the encoder. Also remove an empty wrapper
file that contained forward declarations for this function, but didn't
actually define any actual functions.

Change-Id: Ifc561eef7ebe374a7d03698055e51e105f6d614b

d9fc4516

14 Jun, 2013 - 1 commit

Enable sse2 version of sad8x4/4x8 · c43af9a8

Jingning Han authored 11 years ago

The encoding time for bus at CIF goes from 661s to 625s. This commit
also enabled unit test of sad8x4/4x8 in sad_test.cc.

Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1

c43af9a8

13 Jun, 2013 - 1 commit

Enable sse2 version of sad8x4/4x8 · 15f50e7b

Jingning Han authored 11 years ago

The encoding time for bus at CIF goes from 661s to 625s. This commit
also enabled unit test of sad8x4/4x8 in sad_test.cc.

Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1

15f50e7b

12 Jun, 2013 - 1 commit

Implement SSE version for sad4x8x4d and SSE2 version for sad8x4x4d. · fa96eeb8

Ronald S. Bultje authored 11 years ago

Encoding time of crew (CIF, first 50 frames) @ 1500kbps goes from 4min56
to 4min42.

Change-Id: I92c0c8b32980d2ae7c6dafc8b883a2c7fcd14a9f

fa96eeb8

22 May, 2013 - 1 commit

Optimize variance functions · f4fcfe30

Yunqing Wang authored 11 years ago

Added SSE2 version of variance functions for super blocks.

Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d

f4fcfe30

01 May, 2013 - 1 commit

Remove unused quantize optimizations. · e43662e8

Johann authored 11 years ago

Files were copied from vp8 and never maintained.

Change-Id: I9659a8755985da73e8c19c3c984423b6666d8871

e43662e8

26 Apr, 2013 - 2 commits

Whitespace nit · e3038ca8
Johann authored 11 years ago
```
Change-Id: I7486970c57cda75d26ec2c6d1f36bd668c955f66
```
e3038ca8

Normalize more intrinsic filenames · 863601c5

Johann authored 11 years ago

vp9_dequantize_x86 has only sse2 functions.

vp9_dct_sse2_intrinsics has no namespace collision and can drop
_intrinsics.

vp9_idct_mmx.h is unused.

Change-Id: Ic16e31fb372a1d1e841a62ecb4189fe8f95808ec

863601c5

25 Apr, 2013 - 1 commit

Move dequant from BLOCKD to per-plane MACROBLOCKD · 15255eef

John Koleszar authored 11 years ago

This data can vary per-plane, but not per-block.

Change-Id: I1971b0b2c2e697d2118e38b54ef446e52f63c65a

15255eef

23 Apr, 2013 - 1 commit

Move src_diff to per-plane MACROBLOCK data · cbd1315a

John Koleszar authored 11 years ago

First in a series of commits making certain MACROBLOCK members
addressable per-plane. This commit also refactors the block subtraction
functions vp9_subtract_b, vp9_subtract_sby_c, etc to be
loops-over-planes and variable subsampling aware.

Change-Id: I371d092b914ae0a495dfd852ea1a3d2467be6ec3

cbd1315a

18 Apr, 2013 - 1 commit

Make the use of pred buffers consistent in MB/SB · 6f43ff58

Jingning Han authored 11 years ago

Use in-place buffers (dst of MACROBLOCKD) for  macroblock prediction.
This makes the macroblock buffer handling consistent with those of
superblock. Remove predictor buffer MACROBLOCKD.

Change-Id: Id1bcd898961097b1e6230c10f0130753a59fc6df

6f43ff58

17 Apr, 2013 - 1 commit

Add SSE2 versions for rectangular sad and sad4d functions. · 0c481f4d

Ronald S. Bultje authored 11 years ago

About 11% overall encoder speedup with the sbsegment experiment enabled.

Change-Id: Iffb1bdba6932d9f11a6c791cda8697ccf9327183

0c481f4d

16 Apr, 2013 - 2 commits

Faster vp9_short_fdct4x4 and vp9_short_fdct8x4. · 5b6d33f9

Christian Duvivier authored 12 years ago

Scalar path is about 1.3x faster (2.1% overall encoder speedup).
SSE2 path is about 5.0x faster (8.4% overall encoder speedup).

Change-Id: I360d167b5ad6f387bba00406129323e2fe6e7dda

5b6d33f9

Faster vp9_short_fdct4x4 and vp9_short_fdct8x4. · f13b69d0

Christian Duvivier authored 12 years ago

Scalar path is about 1.3x faster (2.1% overall encoder speedup).
SSE2 path is about 5.0x faster (8.4% overall encoder speedup).

Change-Id: I360d167b5ad6f387bba00406129323e2fe6e7dda

f13b69d0

10 Apr, 2013 - 1 commit

Make RD superblock mode search size-agnostic. · b4f6098e

Ronald S. Bultje authored 11 years ago

Merge various super_block_yrd and super_block_uvrd versions into one
common function that works for all sizes. Make transform size selection
size-agnostic also. This fixes a slight bug in the intra UV superblock
code where it used the wrong transform size for txsz > 8x8, and stores
the txsz selection for superblocks properly (instead of forgetting it).
Lastly, it removes the trellis search that was done for 16x16 intra
predictors, since trellis is relatively expensive and should thus only
be done after RD mode selection.

Gives basically identical results on derf (+0.009%).

Change-Id: If4485c6f0a0fe4038b3172f7a238477c35a6f8d3

b4f6098e

04 Apr, 2013 - 1 commit

Move qcoeff, dqcoeff from BLOCKD to per-plane data · 4c05a051

John Koleszar authored 11 years ago

Start grouping data per-plane, as part of refactoring to support
additional planes, and chroma planes with other-than 4:2:0
subsampling.

Change-Id: Idb76a0e23ab239180c818025bae1f36f1608bb23

4c05a051

18 Mar, 2013 - 1 commit

Optimize 8x8 idct function · 6344c84c

Yunqing Wang authored 12 years ago

Wrote sse2 functions of vp9_short_idct8x8 and vp9_short_idct10_8x8.
Compared to c version, the sse2 version is 2X faster. The decoder
test didn't show noticeable gain since 8x8 idct doesn't take much
of decoding time (less than 1% in my test).

Change-Id: I56313e18cd481700b3b52c4eda5ca204ca6365f3

6344c84c

15 Mar, 2013 - 1 commit

Faster vp9_short_fdct16x16. · 4418b790

Christian Duvivier authored 12 years ago

Scalar path is about 1.5x faster (3.1% overall encoder speedup).
SSE2 path is about 7.2x faster (7.8% overall encoder speedup).

Change-Id: I06da5ad0cdae2488431eabf002b0d898d66d8289

4418b790

28 Feb, 2013 - 1 commit
- mv dct_sse2.c dct_sse2_intrinsics.c to avoid collision · 8f270acf
  Jim Bankoski authored 12 years ago
```
Change-Id: Id786be31da3c91d95d2955aa569ecdc6e66650df
```
  8f270acf