Cryptohaze.com

by **Bitweasil** » Mon Nov 08, 2010 1:00 am

Ignoring the fact that the FAST* functions right now are rather epically broken, does anyone find this concept of use?

I've been playing around, and can get pretty darn nice performance on very short kernel lengths (keeping the desktop GUI quite useful) now.

I can also gain some by hardcoding full-us-charset-95 into the code. Do many people run their tests with that particular setup? Full US charset? Enough to make it useful to hardcode?

by **blazer** » Mon Nov 08, 2010 8:48 am

i never use the fast thought they were broken.
Wide US charset is abit too ambitious for me.

btw did u see the post "dalibdor" made at hashcat for optimizing md5

Code: Select all: Here comes one of the optimizations (and very little one) of sse2 code. In steps 35, 39, 43 and 47 (round 3 od MD5 where rotating by 16 bits is performed), you can use: a = _mm_shufflehi_epi16(a, 0xB1); a = _mm_shufflelo_epi16(a, 0xB1); instead of tmp = _mm_slli_epi32(a, 16); a = _mm_srli_epi32(a, 32-16); a = tmp | a; Or if you are using assembler, the above means that you can use PSHUFHW xmm0, xmm0, 0xB1 PSHUFLW xmm0, xmm0, 0xB1 for rotating xmm0 register by 16 bits rather than MOVDQA xmm7, xmm0 PSLLD xmm0, 16 PSRLD xmm7, 16 por xmm0, xmm7 It means you can save 2 instructions and 1 'temporary' xmm register. Round3 includes 4 these operations, so you can totally save up to 8 instructions

Code: Select all: First two steps of round 3: a += ((b ^ c) ^ d) + const + pass; a = ((a<<const) | (a>>(32-const)) ) + b; d += (a ^ (b ^ c)) + const + pass; d = ((d<<const) | (d>>(32-const)) ) + a; Optimized first two steps of round 3: tmp = b ^ c; a += (tmp ^ d) + const + pass; a = ((a<<const) | (a>>(32-const)) ) + b; d += (tmp ^ a) + const + pass; d = ((d<<const) | (d>>(32-const)) ) + a; You can rewrite all pair steps in round 3 in that manner (i.e. computing tmp in each (2n)th step and use it in each (2n+1)th step. Just for sure, variable tmp should be declared as 'register' to avoid writing and reading 'tmp' to/from memory.

can it be applied to cuda-multiforcer?

by **Sc00bz** » Fri Nov 12, 2010 1:37 am

The first optimization is nice, but if you also have SSSE3 (not SSE3) you can use PSHUFB. PSHUFB does it in one instruction but you need a temp register and an extra move from memory. Triple interlacing with x86-64 you use 15 registers leaving one open

:

Code: Select all: uint8 rot16[16] = { 2, 3, 0, 1, 6, 7, 4, 5, 10,11, 8, 9, 14,15,12,13}; // I hope that's correct ... MOVDQA xmm15,rot16 ... PSHUFB xmm0,xmm15

I think PSHUFB is slow so doing PSHUFHW and PSHUFLW might actually be faster.

But the best is checking for XOP which is only on AMD

but it has:

Code: Select all: VPROTD xmm0,xmm0,16

This is good for all rounds saving 3 * 64 instructions

----------

The second optimization is not good because it trades a PXOR rd,rs for a MOVDQA rd,rs. Now with AVX this might be useful because it let's you do VPXOR rd,rs1,rs2 (rd = rs1^rs2). Even with AVX you need more registers because you now need two temp registers per interlace which brings us to 18 ((4+1+1)*3) registers but there are only 16 with x86-64. If you only interlace 2 times this will work but the benefits of triple interlacing are probably more than saving 8 instructions. This might help with 32 bit code and AVX since you can't interlace without using memory.

OK got I it, it is more than likely useful if you have AVX and XOP in x86-64 because you no longer need to use a MOVDQA (AVX) and no longer need to use a temp register to do a rotate (XOP). Just noticed that triple interlacing without this optimization fills every clock perfectly assuming 3 bit operations per clock and 2 adds per clock and needs to wait for the next clock to use the answer of previous instructions. Making it 5+5 clocks vs 6+4 clocks ohh wait that's 5 2/3+4 clocks so it might be faster.

----------

blazer wrote:can it be applied to cuda-multiforcer?

No because it is CUDA only.

by **Bitweasil** » Fri Nov 12, 2010 3:05 am

I've actually had some other thoughts.

Fermi has an option to use 48kb of shared memory. And, it seems, a lot of people use the tools to work on short hash lists.

I think I can optimize the code rather significantly for this case and speed things up a non-trivial amount. If I go with 32kb of bitmap, that works nicely for reducing hits, and I have a few other ideas.

I'll see what I can do with them...

Cryptohaze.com

FAST* functions: Any use?

FAST* functions: Any use?

Re: FAST* functions: Any use?

Re: FAST* functions: Any use?

Re: FAST* functions: Any use?

Who is online