FAST* functions: Any use?

Discussion and support for the CUDA Multiforcers (Windows and Linux)
  • Ads

FAST* functions: Any use?

Postby Bitweasil » Mon Nov 08, 2010 1:00 am

Ignoring the fact that the FAST* functions right now are rather epically broken, does anyone find this concept of use?

I've been playing around, and can get pretty darn nice performance on very short kernel lengths (keeping the desktop GUI quite useful) now.

I can also gain some by hardcoding full-us-charset-95 into the code. Do many people run their tests with that particular setup? Full US charset? Enough to make it useful to hardcode?
Bitweasil
Site Admin
 
Posts: 912
Joined: Tue Jan 20, 2009 4:26 pm

Re: FAST* functions: Any use?

Postby blazer » Mon Nov 08, 2010 8:48 am

i never use the fast thought they were broken.
Wide US charset is abit too ambitious for me.

btw did u see the post "dalibdor" made at hashcat for optimizing md5

Code: Select all
Here comes one of the optimizations (and very little one) of sse2 code.
In steps 35, 39, 43 and 47 (round 3 od MD5 where rotating by 16 bits is performed), you can use:

a = _mm_shufflehi_epi16(a, 0xB1);
a = _mm_shufflelo_epi16(a, 0xB1);

instead of

tmp = _mm_slli_epi32(a, 16);
a = _mm_srli_epi32(a, 32-16);
a = tmp | a;

Or if you are using assembler, the above means that you can use

PSHUFHW xmm0, xmm0, 0xB1
PSHUFLW xmm0, xmm0, 0xB1

for rotating xmm0 register by 16 bits rather than

MOVDQA xmm7, xmm0
PSLLD xmm0, 16
PSRLD xmm7, 16
por xmm0, xmm7

It means you can save 2 instructions and 1 'temporary' xmm register. Round3 includes 4 these operations, so you can totally save up to 8 instructions


Code: Select all
First two steps of round 3:
a += ((b ^ c) ^ d) + const + pass;
a = ((a<<const) | (a>>(32-const)) ) + b;

d += (a ^ (b ^ c)) + const + pass;
d = ((d<<const) | (d>>(32-const)) ) + a;

Optimized first two steps of round 3:
tmp = b ^ c;
a += (tmp ^ d) + const + pass;
a = ((a<<const) | (a>>(32-const)) ) + b;

d += (tmp ^ a) + const + pass;
d = ((d<<const) | (d>>(32-const)) ) + a;

You can rewrite all pair steps in round 3 in that manner (i.e. computing tmp in each (2n)th step and use it in each (2n+1)th step. Just for sure, variable tmp should be declared as 'register' to avoid writing and reading 'tmp' to/from memory.


can it be applied to cuda-multiforcer?
blazer
 
Posts: 104
Joined: Fri Jan 23, 2009 10:18 am

Re: FAST* functions: Any use?

Postby Sc00bz » Fri Nov 12, 2010 1:37 am

The first optimization is nice, but if you also have SSSE3 (not SSE3) you can use PSHUFB. PSHUFB does it in one instruction but you need a temp register and an extra move from memory. Triple interlacing with x86-64 you use 15 registers leaving one open :):
Code: Select all
uint8 rot16[16] = { 2, 3, 0, 1,
                    6, 7, 4, 5,
                   10,11, 8, 9,
                   14,15,12,13}; // I hope that's correct
...
MOVDQA xmm15,rot16
...
PSHUFB xmm0,xmm15

I think PSHUFB is slow so doing PSHUFHW and PSHUFLW might actually be faster.

But the best is checking for XOP which is only on AMD :( but it has:
Code: Select all
VPROTD xmm0,xmm0,16

This is good for all rounds saving 3 * 64 instructions :D

----------

The second optimization is not good because it trades a PXOR rd,rs for a MOVDQA rd,rs. Now with AVX this might be useful because it let's you do VPXOR rd,rs1,rs2 (rd = rs1^rs2). Even with AVX you need more registers because you now need two temp registers per interlace which brings us to 18 ((4+1+1)*3) registers but there are only 16 with x86-64. If you only interlace 2 times this will work but the benefits of triple interlacing are probably more than saving 8 instructions. This might help with 32 bit code and AVX since you can't interlace without using memory.

OK got I it, it is more than likely useful if you have AVX and XOP in x86-64 because you no longer need to use a MOVDQA (AVX) and no longer need to use a temp register to do a rotate (XOP). Just noticed that triple interlacing without this optimization fills every clock perfectly assuming 3 bit operations per clock and 2 adds per clock and needs to wait for the next clock to use the answer of previous instructions. Making it 5+5 clocks vs 6+4 clocks ohh wait that's 5 2/3+4 clocks so it might be faster.

----------

blazer wrote:can it be applied to cuda-multiforcer?

No because it is CUDA only.
Sc00bz
 
Posts: 93
Joined: Thu Jan 22, 2009 9:31 pm

Re: FAST* functions: Any use?

Postby Bitweasil » Fri Nov 12, 2010 3:05 am

I've actually had some other thoughts.

Fermi has an option to use 48kb of shared memory. And, it seems, a lot of people use the tools to work on short hash lists.

I think I can optimize the code rather significantly for this case and speed things up a non-trivial amount. If I go with 32kb of bitmap, that works nicely for reducing hits, and I have a few other ideas.

I'll see what I can do with them...
Bitweasil
Site Admin
 
Posts: 912
Joined: Tue Jan 20, 2009 4:26 pm


Return to CUDA Multiforcers

Who is online

Users browsing this forum: No registered users and 1 guest

cron