Bitweasil wrote:Actually, the table searching is done on the CPU right now, as is the chain regen.
The steps, and what does them:
Table gen: Fully GPU accelerated, single GPU. Right now, if you want to utilize multiple GPUs, you just launch multiple table generators with different device IDs passed in.
Table merging/perfecting: CPU. I haven't found a good way to GPU accelerate this, and perfecting is a simple process here. This code actually needs updating, it's not table-header aware right now.
The following are currently in one binary:
Candidate hash generation: GPU accelerated, single GPU. I would like to make this multi-GPU aware and additionally use CPUs to assist (or for those without GPUs).
Table search: CPU, requires 64-bit OS. I memory map the table file into the process space and binary search through it. Sorry 32-bit users, memory mapped files are superior, and you can't handle my file size without some serious hacking.
Chain regen: Currently CPU. This step needs to be GPU based, it appears, due to the number of hits/false alarms I'm getting. This will also likely get the multi-GPU/multi-CPU treatment eventually.
Sound sane?
If you your file search strategy turns out to be a bottleneck, then you can change from the log(N) disk accesses during binary search to
1 disk access per lookup. You would have to divide all sorted endpoints into equally spaced runs of chains and create an index that maps
the end value to 2 pointers into the file. The index can be stored in the DRAM. Say you have 2^30 chains in the file, if you store 2^25
file offsets each 32bit in size you have a 128 mbyte index with 32 million file pointers and 32 chains between 2 pointers on average,
that amount of chains fits into a single block on the disk. You can also save 25 bits on each end value, cause they are effectively stored
in the index.
Another optimization would be to generate and sort concurrently, that is first generate chains of length 1/16, sort them, dump the merges,
then generate the second 1/16th, sort, generate, ...
Since you can sort and generate at the same time, you do not loose efficiency, but you spend less time on the merges that you would throw
away at the end.
If the number of false alarms is too high you can trade them for some extra storage as described in:
lasecwww.epfl.ch/~oechslin/publications/oechslin-indocrypt-05.pdf
you can probably also get away with less than a full hash as the end value. Since you are mapping from far less than 128 bits to a 128 bits hash
(in the case of MD5), the end points will probably still be unique, if you store only (say) 80 bits of the hash. By unique i mean that this would
not result in any extra merges or maybe around 1%.
what register size are you using? 32 or 64bits, cause cuda has 64 bit instructions (at least at the ptx assembler level) and it seems that it takes less than 200% of the time to crunch with 64bits instead of 32. but i am not sure about that.
hope that helps. happy hacking