• Johann's avatar
    subpel variance neon: reduce stack usage · f3c97ed3
    Johann authored
    Unlike x86, arm does not impose additional alignment restrictions on
    vector loads. For incoming values to the first pass, it uses vld1_u32()
    which typically does impose a 4 byte alignment. However, as the first
    pass operates on user-supplied values we must prepare for unaligned
    values anyway (and have, see mem_neon.h).
    
    But for the local temporary values there is no stride and the load will
    use vld1_u8 which does not require 4 byte alignment.
    
    There are 3 temporary structures. In the C, one is uint16_t. The arm
    saturates between passes but still passes tests. If this becomes an
    issue new functions will be needed.
    
    Change-Id: I3c9d4701bfeb14b77c783d0164608e621bfecfb1
    f3c97ed3