找到一篇很好的SSE的实践的例子
例子分成下面的这几个步骤来完成优化操作:
Naïve C++
Basic SSE
Batch Processing
16-byte memory alignment
Instruction Pairing
Prefetching
Increase Temporal Locality of Memory I/O
Application-Specific Specialization
最终函数的执行时间从90 cycles/vector降到了17 cycles/vector,呵呵