Commits · ad6ed536d5084a5839a57f8c02b73a5acc010b95 · BC / public / external / libvpx

24 Sep, 2013 - 1 commit
- Number of instructions in fdct4_1d_sse2 reduced by two. · 13c7715a
  A.Mahfoodh authored 11 years ago
```
Mathematically the results are the same.

Change-Id: I1c5126cd3ca64e8515ca6331e0989c6f7dd651a0
```
  13c7715a
07 Sep, 2013 - 1 commit

Fix overflow issue in 16x16 quantization SSSE3 · 09bc942b

Jingning Han authored 11 years ago

The 16x16 transform unit test suggested that the peak coefficient
value can reach 32639. This could cause potential overflow issue
in the SSSE3 implmentation of 16x16 block quantization. This commit
fixes this issue by replacing addition with saturated addition.

Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e

09bc942b

05 Sep, 2013 - 1 commit

Use saturated addition in SSSE3 of 32x32 quant · 458c2833

Jingning Han authored 11 years ago

The 32x32 forward transform can potentially reach peak coefficient
value close to 32700, while the rounding factor can go upto 610.
This could cause overflow issue in the SSSE3 implementation of 32x32
quantization process.

This commit resolves this issue by replacing the addition operations
with saturated addition operations in 32x32 block quantization.

Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70

458c2833

01 Sep, 2013 - 1 commit

Fix 32x32 forward transform SSE2 version · 3cf46fa5

Jingning Han authored 11 years ago

This commit fixed the potential overflow issue in the SSE2
implementation of 32x32 forward DCT. It resolved the corrupted
coded frames in the border of scenes.

Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9

3cf46fa5

29 Aug, 2013 - 1 commit

Fix overflow issue in SSSE3 32x32 quantization · abff6788

Jingning Han authored 11 years ago

The 32x32 quantization process can potentially have the intermediate
stacks over 16-bit range, thereby causing enc/dec mismatch. This commit
fixes this overflow issue in the SSSE3 implementation, as well as the
prototype, of 32x32 quantization.

This fixes issue 607 from webm@googlecode.

Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806

abff6788

27 Aug, 2013 - 1 commit

fixed the reading too many bytes · 9482c079

Yaowu Xu authored 11 years ago

In subpel_avg_variance functions, code similar to the following

punpkldq m2, [addr]

actually reads 8 bytes. For functions that are supposed to work on
buffers only have less 8 bytes a line, this caused valgrind error
of reading uninitialized memory.

Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34

9482c079

26 Aug, 2013 - 1 commit
- Fix the reading of too many input pixels · 6c5433c8
  Yaowu Xu authored 11 years ago
```
in VP9_get4x4var_mmx

Change-Id: I4b4a8f45f25ebdfad281f169cc87aba5e2d6f227
```
  6c5433c8
12 Aug, 2013 - 1 commit

SSE2 high precision 32x32 forward DCT · 78136edc

Jingning Han authored 11 years ago

Enable SSE2 implementation of high precision 32x32 forward DCT. The
intermediate stacks are of 32-bits. The run-time goes down from
32126 cycles to 13442 cycles.

Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56

78136edc

06 Aug, 2013 - 3 commits

variance x86inc guards · 5b307886

Jim Bankoski authored 11 years ago

also fixed bug in sad calcs

Change-Id: I6571fcbe37556c16ae32be66dc0fd879852aac1d

5b307886

Place holder for high-precision 32x32 fdct · 28566a6c

Jingning Han authored 11 years ago

Resolve compile warnings on re-define FDCT32x32_2D template.

Change-Id: Idb3a54ef8d2710ce7245b726379a0e5c875f5cad

28566a6c

Move fdct32x32 SSE2 implementation in separate file. · 3d98205f

Christian Duvivier authored 11 years ago

This is in preparation for the SSE2 version of the high-precision
32x32 forward DCT which will share a lot of code with the existing
low precision version used for rate-distortion search.

Change-Id: I7084b6bdfb480b1fabb8493fb14e3f7fcc7888c0

3d98205f

11 Jul, 2013 - 1 commit
- Remove unused fwalsh/fdct x86 SIMD implementations. · c13e0bcb
  Ronald S. Bultje authored 11 years ago
```
Change-Id: Ia942e56cf322821d42ba06178672791eeee2847e
```
  c13e0bcb
10 Jul, 2013 - 1 commit

SSE2 16x16 ADST/DCT hybrid transform · 11442353

Jingning Han authored 11 years ago

This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2
operations. It reduces the runtime from 5433 cycles to 1621 cycles, at
no compression performance loss.

Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230

11442353

03 Jul, 2013 - 1 commit

Refactor SSE2 8x8 functional units · 2cb75c96

Jingning Han authored 11 years ago

These serve as building blocks for SSE2 8x8 and 16x16 ADST/DCT
hybrid transform coding.

Change-Id: I4089a754c66e0c986f67d9b8ec4dfb9627ad430d

2cb75c96

02 Jul, 2013 - 1 commit

Use pmovmskb to skip quantize loops over empty coefficients. · e5fb4b61

Ronald S. Bultje authored 11 years ago

If none of the 16 coefficients that we quantize per loop iteration
are larger than the zbin, directly skip to the next round of coeffs,
rather than doing a full quantize loop that will eventually result
in 16 zeroes. This incurs a jump cost, but saves a lot of other work.
32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded
no significantly positive results for smaller transforms, so is not
used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles).

Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647

e5fb4b61

01 Jul, 2013 - 2 commits

Update quantize SSSE3 SIMD to cover 32x32 transform case also. · c8defcfd

Ronald S. Bultje authored 11 years ago

Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
2min10.1, i.e. a 2.3% overall speed increase.

Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87

c8defcfd

Quantize (64-bit only, for now) SSSE3 SIMD. · 7353ceab

Ronald S. Bultje authored 11 years ago

Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
x86-64 only, it needs some minor modifications to be 32bit compatible,
because it uses 15 xmm registers, whereas 32bit only has 8.

Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904

7353ceab

29 Jun, 2013 - 2 commits

SSE2 version of vp9_short_fdct32x32_rd. · 466e0cf3

Christian Duvivier authored 11 years ago

43,000 -> 5,750 cycles, about 7.5x faster.

Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0

466e0cf3

Enable SSE2 4x4 ADST/DCT transform · 1109b6b8

Jingning Han authored 11 years ago

This commit enables SSE2 4x4 foward hybrid transform. The runtime
goes from 249 cycles down to 74 cycles. Overall around 2% speed-up
at no compression performance change.

Change-Id: Iad4d526346e05c7be896466c05500711bb763660

1109b6b8

28 Jun, 2013 - 2 commits

Fix switch statement in 8x8 transform · 9def7f72
Jingning Han authored 11 years ago
```
Change-Id: I7c46354c4983feb5f6202c3ab4a1d9534da7e30f
```
9def7f72

Make coefficient skip condition an explicit RD choice. · af660715

Ronald S. Bultje authored 11 years ago

This commit replaces zrun_zbin_boost, a method of biasing non-zero
coefficients following runs of zero-coefficients to be rounded towards
zero, with an explicit skip-block choice in the RD loop.

The logic is basically that if individual coefficients should be rounded
towards zero (from a RD point of view), the trellis/optimize loop should
take care of it. If whole blocks should be zero (from a RD point of
view), a single RD check is much more efficient than a complete
serialization of the quantization loop.

Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
SIMD for quantize will follow in a separate patch. Results for other
test sets pending.

Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4

af660715

26 Jun, 2013 - 1 commit

fixed a compiling problem with MSVC win32 build · 60dc7375

Yaowu Xu authored 11 years ago

The aligned array in parameter list caused win32 build to report
c2719 error. This commit fixed the issue by make the parameter
type a pointer instead of an array.

Change-Id: I4ed654ce4eba2db4995d9cdc136c68e9a6acc992

60dc7375

25 Jun, 2013 - 4 commits

Add averaging-SAD functions for 8-point comp-inter motion search. · c24d9223

Ronald S. Bultje authored 11 years ago

Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
the variance of the averaging predictor. This is slightly suboptimal
because the function is subpixel-position-aware, but it will (at least
for the SSE2 version) not actually use a bilinear filter for a full-pixel
position, thus leading to approximately the same performance compared to
if we implemented an actual average-aware full-pixel variance function.
That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
leading to a total gain of 2.7%.

Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd

c24d9223

Tune the rounding operations in 8x8 ADST/DCT sse2 · 0084e61d

Jingning Han authored 11 years ago

Improve the round-trip precision to meet the unit test setttings.

Change-Id: I303febae56b4b990ea3798b8ebed94c0510ecf79

0084e61d

Use aligned buffer operations in 8x8/16x16 2D-DCT · 82d504b5

Jingning Han authored 11 years ago

This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles.

Change-Id: I137758b81cd127b936175284310e81378db64552

82d504b5

Enable sse2 implmentation of 8x8 ADST/DCT · a32a086d

Jingning Han authored 11 years ago

This commit makes use of the butterfly structure to enable the sse2
version implementation of 8x8 ADST/DCT hybrid transform coding.

The runtime of hybrid transform module goes down from 1170 cycles
to 245 cycles. Overall speed-up around 1.5%.

Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f

a32a086d

21 Jun, 2013 - 4 commits

Remove emms - that shouldn't be there. · fc033b38
Ronald S. Bultje authored 11 years ago
```
Change-Id: I8fcab81e390f93dc17e9666bbf8f77883b5aa897
```
fc033b38
Add missing SECTION .text marker in assembly file. · ba42c026
Ronald S. Bultje authored 11 years ago
```
Fixes a crash on Windows when building with MSVC.

Change-Id: I124ac756a1be55d190fadda5fcc46d23b1445dbf
```
ba42c026

Implement SSE2 block_error. · 54b2a596

Ronald S. Bultje authored 11 years ago

Change vp9_block_error() to return a 64bit error variable, change all
callers to expect a 64bit return value (this will prevent overflows,
which we basically don't check for at all right now). Remove duplicate
block_error() function, which fixed that through truncation. Remove
old (incompatible) mmx/sse2 block_error SIMD versions and replace with
a new one that returns a 64bit value.

Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to
3min23, i.e. a 3% overall speedup.

Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68

54b2a596

Add subtract_block SSE2 version and unit test. · 25c588b1
Ronald S. Bultje authored 11 years ago
```
3% faster overall (3min35.0 to 3min28.5).

Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
```
25c588b1

20 Jun, 2013 - 2 commits

SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance(). · 1e6a32f1

Ronald S. Bultje authored 11 years ago

Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
perfectly interleaved, and can probably be improved further in the
future. I've marked this with a few TODOs/FIXMEs in the code.

Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9

1e6a32f1

Implement sse2 and ssse3 versions for all sub_pixel_variance sizes. · 8fb6c581

Ronald S. Bultje authored 11 years ago

Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 ->
3min58). Specific changes to timings for each function compared to
original assembly-optimized versions (or just new version timings if
no previous assembly-optimized version was available):

sse2   4x4:    99 ->   82 cycles
sse2   4x8:           128 cycles
sse2   8x4:           121 cycles
sse2   8x8:   149 ->  129 cycles
sse2   8x16:  235 ->  245 cycles (?)
sse2  16x8:   269 ->  203 cycles
sse2  16x16:  441 ->  349 cycles
sse2  16x32:          641 cycles
sse2  32x16:          643 cycles
sse2  32x32: 1733 -> 1154 cycles
sse2  32x64:         2247 cycles
sse2  64x32:         2323 cycles
sse2  64x64: 6984 -> 4442 cycles

ssse3  4x4:           100 cycles (?)
ssse3  4x8:           103 cycles
ssse3  8x4:            71 cycles
ssse3  8x8:           147 cycles
ssse3  8x16:          158 cycles
ssse3 16x8:   188 ->  162 cycles
ssse3 16x16:  316 ->  273 cycles
ssse3 16x32:          535 cycles
ssse3 32x16:          564 cycles
ssse3 32x32:          973 cycles
ssse3 32x64:         1930 cycles
ssse3 64x32:         1922 cycles
ssse3 64x64:         3760 cycles

Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d

8fb6c581

17 Jun, 2013 - 1 commit

Move subpixel variance function from common/ to encoder/. · d9fc4516

Ronald S. Bultje authored 11 years ago

This seems to only be used in the encoder. Also remove an empty wrapper
file that contained forward declarations for this function, but didn't
actually define any actual functions.

Change-Id: Ifc561eef7ebe374a7d03698055e51e105f6d614b

d9fc4516

14 Jun, 2013 - 1 commit

Enable sse2 version of sad8x4/4x8 · c43af9a8

Jingning Han authored 11 years ago

The encoding time for bus at CIF goes from 661s to 625s. This commit
also enabled unit test of sad8x4/4x8 in sad_test.cc.

Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1

c43af9a8

13 Jun, 2013 - 1 commit

Enable sse2 version of sad8x4/4x8 · 15f50e7b

Jingning Han authored 11 years ago

The encoding time for bus at CIF goes from 661s to 625s. This commit
also enabled unit test of sad8x4/4x8 in sad_test.cc.

Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1

15f50e7b

12 Jun, 2013 - 1 commit

Implement SSE version for sad4x8x4d and SSE2 version for sad8x4x4d. · fa96eeb8

Ronald S. Bultje authored 11 years ago

Encoding time of crew (CIF, first 50 frames) @ 1500kbps goes from 4min56
to 4min42.

Change-Id: I92c0c8b32980d2ae7c6dafc8b883a2c7fcd14a9f

fa96eeb8

22 May, 2013 - 1 commit

Optimize variance functions · f4fcfe30

Yunqing Wang authored 11 years ago

Added SSE2 version of variance functions for super blocks.

Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d

f4fcfe30

01 May, 2013 - 1 commit

Remove unused quantize optimizations. · e43662e8

Johann authored 11 years ago

Files were copied from vp8 and never maintained.

Change-Id: I9659a8755985da73e8c19c3c984423b6666d8871

e43662e8

26 Apr, 2013 - 2 commits

Whitespace nit · e3038ca8
Johann authored 11 years ago
```
Change-Id: I7486970c57cda75d26ec2c6d1f36bd668c955f66
```
e3038ca8

Normalize more intrinsic filenames · 863601c5

Johann authored 11 years ago

vp9_dequantize_x86 has only sse2 functions.

vp9_dct_sse2_intrinsics has no namespace collision and can drop
_intrinsics.

vp9_idct_mmx.h is unused.

Change-Id: Ic16e31fb372a1d1e841a62ecb4189fe8f95808ec

863601c5