- 24 Sep, 2013 - 1 commit
-
-
A.Mahfoodh authored
Mathematically the results are the same. Change-Id: I1c5126cd3ca64e8515ca6331e0989c6f7dd651a0
-
- 07 Sep, 2013 - 1 commit
-
-
Jingning Han authored
The 16x16 transform unit test suggested that the peak coefficient value can reach 32639. This could cause potential overflow issue in the SSSE3 implmentation of 16x16 block quantization. This commit fixes this issue by replacing addition with saturated addition. Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e
-
- 05 Sep, 2013 - 1 commit
-
-
Jingning Han authored
The 32x32 forward transform can potentially reach peak coefficient value close to 32700, while the rounding factor can go upto 610. This could cause overflow issue in the SSSE3 implementation of 32x32 quantization process. This commit resolves this issue by replacing the addition operations with saturated addition operations in 32x32 block quantization. Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70
-
- 01 Sep, 2013 - 1 commit
-
-
Jingning Han authored
This commit fixed the potential overflow issue in the SSE2 implementation of 32x32 forward DCT. It resolved the corrupted coded frames in the border of scenes. Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9
-
- 29 Aug, 2013 - 1 commit
-
-
Jingning Han authored
The 32x32 quantization process can potentially have the intermediate stacks over 16-bit range, thereby causing enc/dec mismatch. This commit fixes this overflow issue in the SSSE3 implementation, as well as the prototype, of 32x32 quantization. This fixes issue 607 from webm@googlecode. Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806
-
- 27 Aug, 2013 - 1 commit
-
-
Yaowu Xu authored
In subpel_avg_variance functions, code similar to the following punpkldq m2, [addr] actually reads 8 bytes. For functions that are supposed to work on buffers only have less 8 bytes a line, this caused valgrind error of reading uninitialized memory. Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34
-
- 26 Aug, 2013 - 1 commit
-
-
Yaowu Xu authored
in VP9_get4x4var_mmx Change-Id: I4b4a8f45f25ebdfad281f169cc87aba5e2d6f227
-
- 12 Aug, 2013 - 1 commit
-
-
Jingning Han authored
Enable SSE2 implementation of high precision 32x32 forward DCT. The intermediate stacks are of 32-bits. The run-time goes down from 32126 cycles to 13442 cycles. Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
-
- 06 Aug, 2013 - 3 commits
-
-
Jim Bankoski authored
also fixed bug in sad calcs Change-Id: I6571fcbe37556c16ae32be66dc0fd879852aac1d
-
Jingning Han authored
Resolve compile warnings on re-define FDCT32x32_2D template. Change-Id: Idb3a54ef8d2710ce7245b726379a0e5c875f5cad
-
Christian Duvivier authored
This is in preparation for the SSE2 version of the high-precision 32x32 forward DCT which will share a lot of code with the existing low precision version used for rate-distortion search. Change-Id: I7084b6bdfb480b1fabb8493fb14e3f7fcc7888c0
-
- 11 Jul, 2013 - 1 commit
-
-
Ronald S. Bultje authored
Change-Id: Ia942e56cf322821d42ba06178672791eeee2847e
-
- 10 Jul, 2013 - 1 commit
-
-
Jingning Han authored
This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2 operations. It reduces the runtime from 5433 cycles to 1621 cycles, at no compression performance loss. Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230
-
- 03 Jul, 2013 - 1 commit
-
-
Jingning Han authored
These serve as building blocks for SSE2 8x8 and 16x16 ADST/DCT hybrid transform coding. Change-Id: I4089a754c66e0c986f67d9b8ec4dfb9627ad430d
-
- 02 Jul, 2013 - 1 commit
-
-
Ronald S. Bultje authored
If none of the 16 coefficients that we quantize per loop iteration are larger than the zbin, directly skip to the next round of coeffs, rather than doing a full quantize loop that will eventually result in 16 zeroes. This incurs a jump cost, but saves a lot of other work. 32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded no significantly positive results for smaller transforms, so is not used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles). Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647
-
- 01 Jul, 2013 - 2 commits
-
-
Ronald S. Bultje authored
Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to 2min10.1, i.e. a 2.3% overall speed increase. Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
-
Ronald S. Bultje authored
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is x86-64 only, it needs some minor modifications to be 32bit compatible, because it uses 15 xmm registers, whereas 32bit only has 8. Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
-
- 29 Jun, 2013 - 2 commits
-
-
Christian Duvivier authored
43,000 -> 5,750 cycles, about 7.5x faster. Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0
-
Jingning Han authored
This commit enables SSE2 4x4 foward hybrid transform. The runtime goes from 249 cycles down to 74 cycles. Overall around 2% speed-up at no compression performance change. Change-Id: Iad4d526346e05c7be896466c05500711bb763660
-
- 28 Jun, 2013 - 2 commits
-
-
Jingning Han authored
Change-Id: I7c46354c4983feb5f6202c3ab4a1d9534da7e30f
-
Ronald S. Bultje authored
This commit replaces zrun_zbin_boost, a method of biasing non-zero coefficients following runs of zero-coefficients to be rounded towards zero, with an explicit skip-block choice in the RD loop. The logic is basically that if individual coefficients should be rounded towards zero (from a RD point of view), the trellis/optimize loop should take care of it. If whole blocks should be zero (from a RD point of view), a single RD check is much more efficient than a complete serialization of the quantization loop. Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim. SIMD for quantize will follow in a separate patch. Results for other test sets pending. Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
-
- 26 Jun, 2013 - 1 commit
-
-
Yaowu Xu authored
The aligned array in parameter list caused win32 build to report c2719 error. This commit fixed the issue by make the parameter type a pointer instead of an array. Change-Id: I4ed654ce4eba2db4995d9cdc136c68e9a6acc992
-
- 25 Jun, 2013 - 4 commits
-
-
Ronald S. Bultje authored
Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2, i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc the variance of the averaging predictor. This is slightly suboptimal because the function is subpixel-position-aware, but it will (at least for the SSE2 version) not actually use a bilinear filter for a full-pixel position, thus leading to approximately the same performance compared to if we implemented an actual average-aware full-pixel variance function. That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus leading to a total gain of 2.7%. Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
-
Jingning Han authored
Improve the round-trip precision to meet the unit test setttings. Change-Id: I303febae56b4b990ea3798b8ebed94c0510ecf79
-
Jingning Han authored
This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles. Change-Id: I137758b81cd127b936175284310e81378db64552
-
Jingning Han authored
This commit makes use of the butterfly structure to enable the sse2 version implementation of 8x8 ADST/DCT hybrid transform coding. The runtime of hybrid transform module goes down from 1170 cycles to 245 cycles. Overall speed-up around 1.5%. Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
-
- 21 Jun, 2013 - 4 commits
-
-
Ronald S. Bultje authored
Change-Id: I8fcab81e390f93dc17e9666bbf8f77883b5aa897
-
Ronald S. Bultje authored
Fixes a crash on Windows when building with MSVC. Change-Id: I124ac756a1be55d190fadda5fcc46d23b1445dbf
-
Ronald S. Bultje authored
Change vp9_block_error() to return a 64bit error variable, change all callers to expect a 64bit return value (this will prevent overflows, which we basically don't check for at all right now). Remove duplicate block_error() function, which fixed that through truncation. Remove old (incompatible) mmx/sse2 block_error SIMD versions and replace with a new one that returns a 64bit value. Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to 3min23, i.e. a 3% overall speedup. Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68
-
Ronald S. Bultje authored
3% faster overall (3min35.0 to 3min28.5). Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
-
- 20 Jun, 2013 - 2 commits
-
-
Ronald S. Bultje authored
Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to 3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't perfectly interleaved, and can probably be improved further in the future. I've marked this with a few TODOs/FIXMEs in the code. Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
-
Ronald S. Bultje authored
Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 -> 3min58). Specific changes to timings for each function compared to original assembly-optimized versions (or just new version timings if no previous assembly-optimized version was available): sse2 4x4: 99 -> 82 cycles sse2 4x8: 128 cycles sse2 8x4: 121 cycles sse2 8x8: 149 -> 129 cycles sse2 8x16: 235 -> 245 cycles (?) sse2 16x8: 269 -> 203 cycles sse2 16x16: 441 -> 349 cycles sse2 16x32: 641 cycles sse2 32x16: 643 cycles sse2 32x32: 1733 -> 1154 cycles sse2 32x64: 2247 cycles sse2 64x32: 2323 cycles sse2 64x64: 6984 -> 4442 cycles ssse3 4x4: 100 cycles (?) ssse3 4x8: 103 cycles ssse3 8x4: 71 cycles ssse3 8x8: 147 cycles ssse3 8x16: 158 cycles ssse3 16x8: 188 -> 162 cycles ssse3 16x16: 316 -> 273 cycles ssse3 16x32: 535 cycles ssse3 32x16: 564 cycles ssse3 32x32: 973 cycles ssse3 32x64: 1930 cycles ssse3 64x32: 1922 cycles ssse3 64x64: 3760 cycles Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d
-
- 17 Jun, 2013 - 1 commit
-
-
Ronald S. Bultje authored
This seems to only be used in the encoder. Also remove an empty wrapper file that contained forward declarations for this function, but didn't actually define any actual functions. Change-Id: Ifc561eef7ebe374a7d03698055e51e105f6d614b
-
- 14 Jun, 2013 - 1 commit
-
-
Jingning Han authored
The encoding time for bus at CIF goes from 661s to 625s. This commit also enabled unit test of sad8x4/4x8 in sad_test.cc. Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
-
- 13 Jun, 2013 - 1 commit
-
-
Jingning Han authored
The encoding time for bus at CIF goes from 661s to 625s. This commit also enabled unit test of sad8x4/4x8 in sad_test.cc. Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
-
- 12 Jun, 2013 - 1 commit
-
-
Ronald S. Bultje authored
Encoding time of crew (CIF, first 50 frames) @ 1500kbps goes from 4min56 to 4min42. Change-Id: I92c0c8b32980d2ae7c6dafc8b883a2c7fcd14a9f
-
- 22 May, 2013 - 1 commit
-
-
Yunqing Wang authored
Added SSE2 version of variance functions for super blocks. Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d
-
- 01 May, 2013 - 1 commit
-
-
Johann authored
Files were copied from vp8 and never maintained. Change-Id: I9659a8755985da73e8c19c3c984423b6666d8871
-
- 26 Apr, 2013 - 2 commits