The first optimization is nice, but if you also have SSSE3 (not SSE3) you can use PSHUFB. PSHUFB does it in one instruction but you need a temp register and an extra move from memory. Triple interlacing with x86-64 you use 15 registers leaving one open

:
- Code: Select all
uint8 rot16[16] = { 2, 3, 0, 1,
6, 7, 4, 5,
10,11, 8, 9,
14,15,12,13}; // I hope that's correct
...
MOVDQA xmm15,rot16
...
PSHUFB xmm0,xmm15
I think PSHUFB is slow so doing PSHUFHW and PSHUFLW might actually be faster.
But the best is checking for XOP which is only on AMD

but it has:
- Code: Select all
VPROTD xmm0,xmm0,16
This is good for all rounds saving 3 * 64 instructions

----------
The second optimization is not good because it trades a PXOR rd,rs for a MOVDQA rd,rs. Now with AVX this might be useful because it let's you do VPXOR rd,rs1,rs2 (rd = rs1^rs2). Even with AVX you need more registers because you now need two temp registers per interlace which brings us to 18 ((4+1+1)*3) registers but there are only 16 with x86-64. If you only interlace 2 times this will work but the benefits of triple interlacing are probably more than saving 8 instructions. This might help with 32 bit code and AVX since you can't interlace without using memory.
OK got I it, it is more than likely useful if you have AVX and XOP in x86-64 because you no longer need to use a MOVDQA (AVX) and no longer need to use a temp register to do a rotate (XOP). Just noticed that triple interlacing without this optimization fills every clock perfectly assuming 3 bit operations per clock and 2 adds per clock and needs to wait for the next clock to use the answer of previous instructions. Making it 5+5 clocks vs 6+4 clocks ohh wait that's 5 2/3+4 clocks so it might be faster.
----------
blazer wrote:can it be applied to cuda-multiforcer?
No because it is CUDA only.