Progress on the GPU accelerated RTs!

Discussion of the upcoming GPU accelerated rainbow table implementation
  • Ads

Re: Progress on the GPU accelerated RTs!

Postby Bitweasil » Sat Mar 27, 2010 12:47 am

So, uh...

My shit works. :D

Code: Select all
time ../../bin/linux/release/CTCandidates -t writechains -s e9982ec5ca981bd365603623cf4b2277 -h MD5 --threads 512 --blocks 512 -m 200 
Hash type: MD5
Success
CUDA Device Information:

Device 0: "GeForce GTX 260"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  Number of cores:                               216
  Clock rate:                                    1.24 GHz
  Performance Number:                            33534
  Note: Performance number is clock in mhz * core count, for comparing devices.


Launching run group 0 (indexes 0 to 262144)
Kernel Time: 63.450 ms           


Searching table...

At step 0
..................................................
At step 50
.............................

!!!!!!!!!!! PASSWORD FOUND !!!!!!!!!!!
Password: 31415926

real   0m29.379s
user   0m29.310s
sys   0m0.070s


My MD5 len8 (numeric) with chain len 100k is a 300kb file.

On to make this stuff work with something *other* than MD5 len8, and I'll be ready for a beta release!

The CPU chain regen side will eventually have to go, though... it's slow.
Bitweasil
Site Admin
 
Posts: 912
Joined: Tue Jan 20, 2009 4:26 pm

Re: Progress on the GPU accelerated RTs!

Postby Bitweasil » Sat Mar 27, 2010 2:23 am

Yeah. CPU chain regen *has* to go. This is really, really slow that way. It's getting way too many false alarms/chain merges on reasonable coverage of a space.
Bitweasil
Site Admin
 
Posts: 912
Joined: Tue Jan 20, 2009 4:26 pm

Re: Progress on the GPU accelerated RTs!

Postby Bitweasil » Sat Mar 27, 2010 3:35 am

So, I'm sitting on some test MD5 loweralpha len8 tables now. Took... oh, an hour or so to generate. Actually, I only have one table right now, this could certainly be improved with multiple tables. :)

Code: Select all
At step 21600
..................................................
At step 21650
..................................................
At step 21700
..................................................
At step 21750
..................................................
At step 21800
..................................................
At step 21850
..................................................
At step 21900
.................................

!!!!!!!!!!! PASSWORD FOUND !!!!!!!!!!!
Password: abcdefgh

real   75m40.544s
user   75m40.340s
sys   0m0.190s


Good news: It's finding stuff.

Bad news: There are massive numbers of false alarms and chain merges. Side effects of the long chain length and non-perfect tables (for now).

Good news: I'm going to handle this with the GPU, so it won't be *fast* but it will be a lot better than the CPU code I have now.

Long term, this will be expanded to use as many devices as the system has (CPUs, GPUs, etc), if it looks like there will be major performance benefits from that.
Bitweasil
Site Admin
 
Posts: 912
Joined: Tue Jan 20, 2009 4:26 pm

Re: Progress on the GPU accelerated RTs!

Postby blazer » Sat Mar 27, 2010 8:43 am

wow nicely done, GPU table searching
if the table would be perfected, does that mean a lot of work will need to be done (generating wise) since the chains are so long.
And since the tables are so darn big, i guess its gonna be a Cuda perfector?
oh btw, i think the GTX480 finally launched, wonder what the speeds would be like using the multiforcer and RT gen.

http://hothardware.com/articles/NVIDIA- ... as-Landed/
blazer
 
Posts: 104
Joined: Fri Jan 23, 2009 10:18 am

Re: Progress on the GPU accelerated RTs!

Postby Bitweasil » Sat Mar 27, 2010 5:11 pm

Actually, the table searching is done on the CPU right now, as is the chain regen.

The steps, and what does them:

Table gen: Fully GPU accelerated, single GPU. Right now, if you want to utilize multiple GPUs, you just launch multiple table generators with different device IDs passed in.
Table merging/perfecting: CPU. I haven't found a good way to GPU accelerate this, and perfecting is a simple process here. This code actually needs updating, it's not table-header aware right now.

The following are currently in one binary:
Candidate hash generation: GPU accelerated, single GPU. I would like to make this multi-GPU aware and additionally use CPUs to assist (or for those without GPUs).
Table search: CPU, requires 64-bit OS. I memory map the table file into the process space and binary search through it. Sorry 32-bit users, memory mapped files are superior, and you can't handle my file size without some serious hacking.
Chain regen: Currently CPU. This step needs to be GPU based, it appears, due to the number of hits/false alarms I'm getting. This will also likely get the multi-GPU/multi-CPU treatment eventually.

Sound sane?
Bitweasil
Site Admin
 
Posts: 912
Joined: Tue Jan 20, 2009 4:26 pm

Re: Progress on the GPU accelerated RTs!

Postby foobar2342 » Sun Mar 28, 2010 12:24 am

Bitweasil wrote:Actually, the table searching is done on the CPU right now, as is the chain regen.

The steps, and what does them:

Table gen: Fully GPU accelerated, single GPU. Right now, if you want to utilize multiple GPUs, you just launch multiple table generators with different device IDs passed in.
Table merging/perfecting: CPU. I haven't found a good way to GPU accelerate this, and perfecting is a simple process here. This code actually needs updating, it's not table-header aware right now.

The following are currently in one binary:
Candidate hash generation: GPU accelerated, single GPU. I would like to make this multi-GPU aware and additionally use CPUs to assist (or for those without GPUs).
Table search: CPU, requires 64-bit OS. I memory map the table file into the process space and binary search through it. Sorry 32-bit users, memory mapped files are superior, and you can't handle my file size without some serious hacking.
Chain regen: Currently CPU. This step needs to be GPU based, it appears, due to the number of hits/false alarms I'm getting. This will also likely get the multi-GPU/multi-CPU treatment eventually.

Sound sane?


If you your file search strategy turns out to be a bottleneck, then you can change from the log(N) disk accesses during binary search to
1 disk access per lookup. You would have to divide all sorted endpoints into equally spaced runs of chains and create an index that maps
the end value to 2 pointers into the file. The index can be stored in the DRAM. Say you have 2^30 chains in the file, if you store 2^25
file offsets each 32bit in size you have a 128 mbyte index with 32 million file pointers and 32 chains between 2 pointers on average,
that amount of chains fits into a single block on the disk. You can also save 25 bits on each end value, cause they are effectively stored
in the index.

Another optimization would be to generate and sort concurrently, that is first generate chains of length 1/16, sort them, dump the merges,
then generate the second 1/16th, sort, generate, ...
Since you can sort and generate at the same time, you do not loose efficiency, but you spend less time on the merges that you would throw
away at the end.

If the number of false alarms is too high you can trade them for some extra storage as described in:
lasecwww.epfl.ch/~oechslin/publications/oechslin-indocrypt-05.pdf

you can probably also get away with less than a full hash as the end value. Since you are mapping from far less than 128 bits to a 128 bits hash
(in the case of MD5), the end points will probably still be unique, if you store only (say) 80 bits of the hash. By unique i mean that this would
not result in any extra merges or maybe around 1%.

what register size are you using? 32 or 64bits, cause cuda has 64 bit instructions (at least at the ptx assembler level) and it seems that it takes less than 200% of the time to crunch with 64bits instead of 32. but i am not sure about that.

hope that helps. happy hacking
foobar2342
 
Posts: 17
Joined: Sun Apr 05, 2009 7:41 pm

Re: Progress on the GPU accelerated RTs!

Postby Bitweasil » Sun Mar 28, 2010 9:48 pm

There are certainly many optimizations I can make.

I'm quite aware my table storage is non-optimal - it takes 32 bytes to represent a single chain. This can be compressed down in future versions - the table headers have a "Version" field that will be used for things like this. I'm "throwing away" space right now with a null-padded 16 character password field, but it does make code much cleaner for now.

Right now, storage is cheap, compute is (relatively) expensive, so I want to optimize for compute efficiency at the expense of storage. This means storing merged chains (at least as an option) - they are still representing useful space, just are less efficient space-wise than a perfect table.

I'm certainly aware I can do some table indexing and will be doing that down the road as an addon feature. For now, binary searches are fast enough on the small test sets I'm playing with.

The most efficient option for tables would be to do something with compressing the password into an index (the character set is provided in the header), and then only store as much of the hash as needed to accurately represent the space (say password space + 16 bits or something) - subject to byte bounds. This is certainly something I will look into down the road as the table sizes grow, but for the added complexity, I don't think I'd see improvements of more than a factor of 2 - I intend to run up on 64-bit space limits fairly soon, meaning I'd need at least 128 bits to store a chain, vs the 256 bits I'm using now. Really, storage is cheap, coding time is expensive, and disk will just get cheaper and cheaper. Not having a 64-bit limit hardcoded in my code is worth quite a bit to me, though.

I'm using 32-bit registers, as that's what MD5 uses. CUDA may have 64-bit instructions on some hardware, but they don't do me much good when I'm working on hashes that are designed to use 32-bit registers. What were you thinking of using them for?
Bitweasil
Site Admin
 
Posts: 912
Joined: Tue Jan 20, 2009 4:26 pm

Re: Progress on the GPU accelerated RTs!

Postby foobar2342 » Sun Mar 28, 2010 11:46 pm

Bitweasil wrote:I'm using 32-bit registers, as that's what MD5 uses. CUDA may have 64-bit instructions on some hardware, but they don't do me much good when I'm working on hashes that are designed to use 32-bit registers. What were you thinking of using them for?


You can apply each operation on a 64 bit register that holds one MD5 register in the low word and one in the high word. If i am correct with my assumption that 64bit instructions are implemented in hardware
then you can crunch on 2 MD5 states at once for the same instruction count. With 128bit SSE instructions on the host you would have 4 parallel instances of MD5 being worked on in parallel.
foobar2342
 
Posts: 17
Joined: Sun Apr 05, 2009 7:41 pm

Re: Progress on the GPU accelerated RTs!

Postby Bitweasil » Mon Mar 29, 2010 12:14 am

Interesting. I should look over the PTX reference and see if I can find vector ops.

The thing with SSE instructions is that they're designed to work on 128 bit vectors as 4 32-bit words - apply the same transform to each. If you only have 64-bit primitives, I'm not sure that you can do the same thing - unless they're designed to work on a 64-bit register as 2x 32-bit registers.

Also, to the best of my knowledge, only the GT200 series have the 64-bit registers. So I'd still need a fork of the code for the pre-GT200 cored cards (8xxx/9xxx series). That's definitely something I will look into though - do you happen to know the PTX instructions that would be relevant?
Bitweasil
Site Admin
 
Posts: 912
Joined: Tue Jan 20, 2009 4:26 pm

Re: Progress on the GPU accelerated RTs!

Postby foobar2342 » Mon Mar 29, 2010 2:20 am

EDIT: it seems that i am totally wrong. the cuda profiler tells me that i use exactly twice as many instructions and the runtime is twice as long, so there has to be
another reason for the behaviour i observed in my own 32/64bit code. so the original argument is not true and there is no speedup with 64bit operands on a GT200.

[ original message removed ]

here is the test program:

#include <stdio.h>

typedef unsigned long long T;

__global__ void crunch(T * ptr) {
T r = ptr[threadIdx.x];
for (T i = 0; i < 5000000; ++i) {
r ^= i | i + 1;
}
ptr[threadIdx.x] = r;
}

main() {
T * p;
T ph[32 * 8];
fprintf(stderr, "%d\n", sizeof(T));
cudaMalloc((void **) & p, sizeof(T) * 32 * 8);
cudaMemcpy(p, ph, 32 * 8 * sizeof(T), cudaMemcpyHostToDevice);
crunch<<<1, 32 * 8>>>(p);
cudaMemcpy(ph, p, 32 * 8 * sizeof(T), cudaMemcpyDeviceToHost);
for (int i = 0; i < 32 * 8; ++i) {
fprintf(stderr, "%d ", ph[i]);
}
}
foobar2342
 
Posts: 17
Joined: Sun Apr 05, 2009 7:41 pm

PreviousNext

Return to GPU Rainbow Tables

Who is online

Users browsing this forum: No registered users and 1 guest