* VPRO[LR] (bitwise rotate on vectors, super-valuable for hashing; yet another one that replaces what used to be 3 ops: shift, shift, or)
* VPERM[IT]2* (permute across two source registers instead of one - PPC/SPUs had this, ARM has this, it's yet another one that replaces usually 3 instructions with one: two permutes and an OR or similar. And 2 of these go to the often-contended full shuffle network)
* Broadcast load-operands (4-16x reduction in L1D$ space for all-lane constants for free)