- 08 Jul, 2013 - 2 commits
-
-
Ronald S. Bultje authored
Overall, on all test sets, this gains about +0.2% on all metrics. City is a clip where this really hurts (-1.0% on all metrics), I'm not quite sure why yet. Maybe interesting to look into in the future. Change-Id: I6f0eecb20e72f0194633270d30bf00d76d9eae78
-
Deb Mukherjee authored
Skips mode searches for intra and compound inter modes depending on the best mode so far and the reference frames. The various heuristics to be used are selected by bits from a flag. The previous direction based intra mode search pruning is also absorbed in this framework. Specifically the flags and their impact are: 1) FLAG_SKIP_INTRA_BESTINTER (skip intra mode search for oblique directional modes and TM_PRED if the best so far is an inter mode) derfraw300: -0.15%, 10% speedup 2) FLAG_SKIP_INTRA_DIRMISMATCH (skip D27, D63, D117 and D153 mode search if the best so far is not one of the closest hor/vert/diagonal directions. derfraw300: -0.05%, about 9% speedup 3) FLAG_SKIP_COMP_BESTINTRA (skip compound prediction mode search if the best so far is an intra mode) derfraw300: -0.06%, about 7-8% speedup 4) FLAG_SKIP_COMP_REFMISMATCH (skip compound prediction search if the best single ref inter mode does not have the same ref as one of the two references being tested in the compound mode) derfraw300: -0.56%, about 10% speedup Change-Id: I1a736cd29b36325489e7af9f32698d6394b2c495
-
- 03 Jul, 2013 - 7 commits
-
-
Dmitry Kovalev authored
Change-Id: I221126f22ab9067348eb0efb8a73b15a8f49c3fd
-
Jingning Han authored
This commit allows encoder to detect the cumulative rate-distortion cost per transformed block inside a partition. If the cumulative rd cost is already above the best rd value, it terminates the rest operations and continue to next prediction mode test. It reduces the runtime of bus at target bit-rate 2000 from 308 second to 266 second, i.e., about 13% speed-up at no performance penalty. Change-Id: I5f15a3d8955d97031d5653006027866a00654e7a
-
Dmitry Kovalev authored
Change-Id: I32276552b3ea6dc1dce8e298be114cfe1019b31c
-
Yaowu Xu authored
Change-Id: Ib41f0643fdcc088500e7420708f4e72f1f64c710
-
Jingning Han authored
These serve as building blocks for SSE2 8x8 and 16x16 ADST/DCT hybrid transform coding. Change-Id: I4089a754c66e0c986f67d9b8ec4dfb9627ad430d
-
Paul Wilkins authored
When this is 0 (BLOCK_SIZE_AB4X4) we want to do the inter joint search for all sizes. Change-Id: Id40cd6fe7790e7e1165352b9cef5e12fa8c0bc88
-
Paul Wilkins authored
sf->unused_mode_skip_lvl. Tests modes as normal for all sizes at or below the given level. At larger sizes it skips all modes that were not chosen at any smaller size. Hence setting BLOCK_SIZE_SB64X64 is in effect off. Setting BLOCK_SIZE_AB4X4 will only consider modes that were chosen for one or more 4x4 blocks at larger sizes. sf->reference_masking. Do a test encode of the NONE partition at one size and create a reference frame mask based on the best rd choice. In the full search only allow this reference frame. Currently it is testing 64x64 and repeats this in the full search. This does not work well with Jim's Partition code just now and is disabled by default. Change-Id: I8f8c52d2ef4a0c08100150b0ea4155d1aaab93dd
-
- 02 Jul, 2013 - 16 commits
-
-
Dmitry Kovalev authored
Change-Id: I08fc6e474ff2c12cfa065bae4989c724276e2c83
-
Dmitry Kovalev authored
Change-Id: I143b430b7c24a964ccd0ebb75944cf317a072214
-
Yaowu Xu authored
This commit adds a speed feature where only squared partition are evaluated in partition picking. Enable this feature in cpu-used 2 reduces encoding time by ~30%. loss of compression: -0.9% on cif set -1.23% on stdhd Change-Id: Ia6fad11210f0b78365abb889f9245604513be5b9
-
Ronald S. Bultje authored
If none of the 16 coefficients that we quantize per loop iteration are larger than the zbin, directly skip to the next round of coeffs, rather than doing a full quantize loop that will eventually result in 16 zeroes. This incurs a jump cost, but saves a lot of other work. 32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded no significantly positive results for smaller transforms, so is not used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles). Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647
-
Ronald S. Bultje authored
Change-Id: Ibfd2def2c088f4bc541a1de25990d73480b53d4b
-
Deb Mukherjee authored
This speed feature will skip searching the directional intra prediction modes D63, D117, D27, D153 if the best intra mode so far is not one of the diagonal, horizontal or vertical directions closest to the respective directions being tested. In other words, this implements a sort of binary search in the angular domain. Speedup: about 9-10% Results: -0.05% only on derfraw300. Change-Id: I413584c41f2a3e8dabfbdeb40718c8fc4b1d63a2
-
Deb Mukherjee authored
(1) Refines the modeling function and uses that to add some speed features. Specifically, intead of using a flag use_largest_txfm as a speed feature, an enum tx_size_search_method is used, of which two of the types are USE_FULL_RD and USE_LARGESTALL. Two other new types are added: USE_LARGESTINTRA (use largest only for intra) USE_LARGESTINTRA_MODELINTER (use largest for intra, and model for inter) (2) Another change is that the framework for deciding transform type is simplified to use a heuristic count based method rather than an rd based method using txfm_cache. In practice the new method is found to work just as well - with derf only -0.01 down. The new method is more compatible with the new framework where certain rd costs are based on full rd and certain others are based on modeled rd or are not computed. In this patch the existing rd based method is still kept for use in the USE_FULL_RD mode. In the other modes, the count based method is used. However the recommendation is to remove it eventually since the benefit is limited, and will remove a lot of complications in the code (3) Finally a bug is fixed with the existing use_largest_txfm speed feature that causes mismatches when the lossless mode and 4x4 WH transform is forced. Results on derf: USE_FULL_RD: +0.03% (due to change in the tables), 0% encode time reduction USE_LARGESTINTRA: -0.21%, 15% encode time reduction (this one is a pretty good compromise) USE_LARGESTINTRA_MODELINTER: -0.98%, 22% encode time reduction (currently the benefit of modeling is limited for txfm size selection, but keeping this enum as a placeholder) . USE_LARGESTALL: -1.05%, 27% encode-time reduction (same as existing use_largest_txfm speed feature). Change-Id: I4d60a5f9ce78fbc90cddf2f97ed91d8bc0d4f936
-
Deb Mukherjee authored
Uses mapping tables instead of complicated modulo/division operations for prob mapping for forward updates. No bit-stream or output change. Change-Id: Ifd9ce8ac1437835c305c94f64c18273c7a68f546
-
Dmitry Kovalev authored
Change-Id: I8a2983fb14274a6ac53681fa4cd5d4209cbd2905
-
Yunqing Wang authored
Added a speed feature in speed 1 to disable splitmv for HD (>=720) clips. Test result on stdhd set: 0.3% psnr loss and 0.07% ssim loss. Encoding speedup is 36%. (For reference: The test result on derf set showed 2% psnr loss and 1.6% ssim loss. Encoding speedup is 34%. SPLITMV should be enabled for small resolution videos.) Change-Id: I54f72b94f506c6d404b47c42e71acaa5374d6ee6
-
Jingning Han authored
Compute the rate-distortion cost per transformed block, and cumulate the cost through all blocks inside a partition. This allows encoder to detect if the cumulative rd cost is already above the best rd cost, thereby enabling early termination in the rate-distortion optimization search. Change-Id: I0a856367a9a7b6dd0b466e7b767f54d5018d09ac
-
Paul Wilkins authored
Remove the use of sf->comp_inter_joint_search_thresh from the baseline speed 0. Approx +0.4% on derf. Change-Id: Icc14db98909830f40e5ac66130d40e78d2e55c71
-
Paul Wilkins authored
This reverts commit 13772781. Also fixes a spelling mistake. Change-Id: I5be8aa4d8d3c0323d4a6f41968a7b2c048949c3f
-
Yaowu Xu authored
Change-Id: Icc4f70f0b0f91c9e7d5d00eedd67841afe2f2679
-
Jim Bankoski authored
This cl converts use partition from last frame to do the following: if part is none,horz, vert -> try split if part != none and one of the children is not split - try none Change-Id: I5b6c659e35f3ac9f11c051b92ba98af6d7e8aa87 Signed-off-by:
Jim Bankoski <jimbankoski@google.com>
-
Dmitry Kovalev authored
Change-Id: Ia547a5dd7650b771fd00edd673ab9f920270731c
-
- 01 Jul, 2013 - 8 commits
-
-
Ronald S. Bultje authored
This should significantly speedup cost_coeffs(). Basically what the patch does is to make the neighbour arrays padded by one item to prevent an eob check in get_coef_context(), then it populates each col/row scan and left/top edge coefficient with two times the same neighbour - this prevents a single/double context branch in get_coef_context(). Lastly, it populates neighbour arrays in pixel order (rather than scan order), so we don't have to dereference the scantable to get the correct neighbours. Total encoding time of first 50 frames of bus (speed 0) at 1500kbps goes from 2min10.1 to 2min5.3, i.e. a 2.6% overall speed increase. Change-Id: I42bcd2210fd7bec03767ef0e2945a665b851df56
-
Dmitry Kovalev authored
Change-Id: I5b413bc0884af0bda38c05332d86490103905b3b
-
Ronald S. Bultje authored
Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to 2min10.1, i.e. a 2.3% overall speed increase. Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
-
Ronald S. Bultje authored
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is x86-64 only, it needs some minor modifications to be 32bit compatible, because it uses 15 xmm registers, whereas 32bit only has 8. Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
-
Dmitry Kovalev authored
Moving vp9_default_inter_mode_probs array to vp9_entropymode.c. Change-Id: I88ebda86ccc07f2a43c6c01d4b37898214cfb6de
-
Yaowu Xu authored
Change-Id: I921c9faba6386535aaf717a54301dd346a9b8540
-
Paul Wilkins authored
Added a speed feature that focuses only on thresholds for new motion modes. Moved sf->comp_inter_joint_search_thresh into speed 1. This has ~+0.4% impact on quality at speed 0 as our quality reference baseline. Slight adjustment to baseline thresholds. Change-Id: I7ebf104f1fe29af77ed4837b2e84be065621bbe5
-
Dmitry Kovalev authored
Change-Id: I30ea91561ffac7e5065ba41b2d3ab7dedb720593
-
- 29 Jun, 2013 - 7 commits
-
-
Christian Duvivier authored
43,000 -> 5,750 cycles, about 7.5x faster. Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0
-
Dmitry Kovalev authored
Change-Id: I83ca53bf6def871f199a382a671f26ad7cbecbca
-
chm authored
- Add add_constant_residual_8x8 16x16 32x32 functions - Tested under RealView debugger enviroment Change-Id: I5c3a432f651b49bf375de6496353706a33e3e68e
-
Dmitry Kovalev authored
Change-Id: I41711bb994f542c5ba3d0cefd9b2e79db3c2c3a1
-
James Zern authored
since: 92479d95 Make update_partition_context faster fixes: vp9/common/vp9_blockd.h:408:22: error: non-constant-expression cannot be narrowed from type 'int' to 'char' in initializer list [-Wc++11-narrowing] char pcvalue[2] = {~(0xe << boffset), ~(0xf <<boffset)}; ^~~~~~~~~~~~~~~~~ Change-Id: Id5b00b9a72d00a2b314081a23879bd1fa3ce983b
-
Jingning Han authored
This commit enables SSE2 4x4 foward hybrid transform. The runtime goes from 249 cycles down to 74 cycles. Overall around 2% speed-up at no compression performance change. Change-Id: Iad4d526346e05c7be896466c05500711bb763660
-
Yaowu Xu authored
Change-Id: I692d800af1f976c84a76f8bd66864c4b39540abc
-