Cryptohaze.com

Posted: **Wed May 13, 2009 3:24 pm**

http://www.cryptohaze.com/releases/CUDA ... .7.tar.bz2

Questions? Comments? Suggestions?

Posted: **Fri May 15, 2009 9:58 pm**

Nice sharedBitmap and DEVICE_HashTable (you should probably call it DEVICE_Bitmap). You can make it better by using different parts of the hash for the two bitmaps (like a and b instead of just a on both). So you can have it be:
1-((1-odds_of_matching_16bits_on_any_hash) * (1-odds_of_matching_32bits_on_any_hash))
instead of
odds_of_matching_32bits_on_any_hash

This is the way I would do it:

check 15 bits on sharedBitmap (4 KiB)
- if this only weeds out like 1 in 50 or well obviously if it's all ones it should not be used
- 4 KiB is better than 8 KiB because 3 thread blocks can fit in shared memory per multiprocessor instead of just 1
check 29 to 33 bits on DEVICE_Bitmap (64 MiB to 1024 MiB)
use 24 to 27 bits on DEVICE_ChainedHashTable (64 MiB + 4 bytes to 512 MiB + 4 bytes)
linear search

Leaving 16 MiB free should be OK in most cases. You might need to reserve more video ram depending on your OS, other programs running, number of screens, and screen resolution(s).

Here's a nice little function:

Code: Select all: /* * Find the largest bitmap that fits in memory then the largest hash table that fits in memory. * Minimum size for both the bitmap and hash table is 64 MiB. * Which is OK since I'm pretty sure the least amount of memory any CUDA enabled card has is 256 MiB. * Maximum size for the bitmap is 4 GiB and 16 GiB for the hash table. * This function should be changed when cards have over 8 GiB. Also change the way to do a look up in the bitmap and hash table. * * [in] totalRam is the total amount of video ram in MiB * [in] reserveRam is the total of video ram in MiB to not use (should be around 16 MiB to be safe) * [in] numHashes is the total number of hashes trying to crack * [out] deviceBitmapBits is the number of bits for the bitmap on the device * [out] deviceHashTableBits is the number of bits for the hash table on the device * returns true on success */ bool calcBitmapAndHashTableBits(uint32_t totalRam, uint32_t reserveRam, uint32_t numHashes, uint32_t &deviceBitmapBits, uint32_t &deviceHashTableBits) { const uint32_t HASH_SIZE = 16 + 16 + 4; // hash size + password size + hash index uint32_t ramToUse, ramUsed, hashRam; hashRam = ((HASH_SIZE * (numHashes >> 8)) >> 12) + 1; ramToUse = totalRam - reserveRam - hashRam - 64; // 64 MiB is for the hash table ramUsed = 64; // use a minimum of 64 MiB for the bitmap on the device deviceBitmapBits = 29; do { ramUsed <<= 1; deviceBitmapBits++; } while (ramUsed <= ramToUse && deviceBitmapBits < 35); ramUsed >>= 1; deviceBitmapBits--; ramToUse = totalRam - reserveRam - hashRam - ramUsed; ramUsed = 64; // use a minimum of 64 MiB for the hash table on the device deviceHashTableBits = 24; do { ramUsed <<= 1; deviceHashTableBits++; } while (ramUsed <= ramToUse && deviceHashTableBits < 32); deviceHashTableBits--; ramUsed = reserveRam + hashRam + (1 << (deviceBitmapBits - 23)) + (1 << (deviceHashTableBits - 18)); return ramUsed <= totalRam; }

Code: Select all: __device__ inline void checkHash(...) // :) { bool goodToGo = false; if ((s_bitmap[a >> 20] & (1 << (a & 0x7))) && // grab the highest 12 bits from a and grab the lowest 3 bits from a (d_bitmap[b >> (35 - deviceBitmapBits)] & (1 << (c >> 29)))) // grab the highest deviceBitmapBits - 3 bits from b and grab the highest 3 bits from c { uint32_t x = c & ((1 << deviceHashTableBits) - 1); // grab the lowest deviceHashTableBits bits from c uint32_t start = d_chainedHashTable[x]; uint32_t end = d_chainedHashTable[x+1]; uint64_t ad = (((uint64_t)a) << 32) | d; for (x = start; x < end; x++) { if (d_hashes[2*x] == ad) { // I never decided whether to: // stop here at 88 to 91 bits matching and report partial match somehow (and have the CPU check the rest) // or continue and check the full hash goodToGo = true; break; } } } __syncthreads(); if (goodToGo) { uint64_t bc = (((uint64_t)b) << 32) | c; do { if (d_hashes[2*x+1] == bc) { // Report successful // Atomic functions require at least compute capability 1.1 :( uint32_t y; if (compute_capability_1_0) // :) { y = x; d_crackedHash[y] = 1; } else { y = atomicAdd(d_crackedCount, 1); d_crackedHashIndex[y] = x; } d_crackedPw[y] = curPw; // :) break; } x++; } while (d_hashes[2*x] == ad); } __syncthreads(); }

Oh right in your code numberOfPasswords should be called numberOfHashes.

Hopefully that helps/works. I haven't even tested to see if there's any syntax errors :oops:

. Oh yeah code marked with "// : )" isn't real code.

Posted: **Fri May 15, 2009 10:11 pm**

Interesting - I'll take a look at that.

I'm not quite sure I understand what you're suggesting with DEVICE_ChainedHashTable.

Also, I'm guessing you haven't checked VRAM usage of modern OSes. Unfortunately, certain programs (*cough*Firefox*cough*) make heavy use of VRAM textures, and Firefox can easily be using 20-30% of the vram on a modern card with a large number of tabs open.

One thing that I don't have in place yet is a report on the amount of free video RAM. For small numbers of hashes, it doesn't matter, but to extract peak performance on large hash lists, more VRAM is useful.

Do you have access to a CUDA dev system? If not, ping me and I'll see what I can do - I could get you access to my dev box once I move it out of my office and to somewhere it can vent heat without cooking me.

Posted: **Thu May 21, 2009 10:27 pm**

Never mind, I figured out the chained hash lookup.

I'm not sure how much of an issue it is, though. Theoretically, the shared bitmap (512MB) will take care of the vast majority of "return negative" for any reasonable size hash list.

On my GTX series cards, the check rate with the large bitmap in place drops to 60M steps per second and hangs there. It stays there no matter how many hashes I throw at things.

This indicates to me that it's not spending much time in the actual "search the list" steps.

I do think you're right about the smaller lookup for better occupancy - I think this will improve speed by allowing more threads to exist on the multiprocessor, which will hide latency better - I will play with this.

Posted: **Fri May 22, 2009 1:07 am**

Looking at it closer, for the big functions, I use 8kb for the shared hash, and 4kb for the charset - so dropping the bitmap down to 4kb won't gain me anything.

The best solution might be a separate kernel for the large hash space. Eliminate the shared bitmap entirely, and increase the kernel population on the multiprocessor to let the global latency be hidden on the big bitmap lookups.

Have 3 kernels for each hash type - FAST (for very small hash lists), MEDIUM (shared bitmap, no big global bitmap), and LARGE (no shared bitmap, large global bitmap) - and then benchmark before running to find out what the fastest option for the given parameters is.

I'd like to get the autotuning working again, as it does make a big difference on the large lengths. Just extend it to multi-kernel checking.

I think this could be done as an extension to the current 0.7 codebase. I'll look into it.

Posted: **Fri May 22, 2009 10:33 pm**

There's no significant difference between having the shared memory bitmap and not - it doesn't increase occupancy to not have it (register count is too high for that), and the speed difference is on the order of a few percent - not enough to justify additional complexity of more kernels IMO.

Also, with the large bitmap, running the numbers, even on lists of a few hundred thousand hashes or more, the number of hits to the global memory for hash list searching are very, very low - a 3M unique hash list will hit the search function about 0.07% of the time. I don't think this is enough to justify optimizations beyond the binary search - a 3M element list will require ~22 accesses to find the hash or not.

Posted: **Fri Feb 11, 2011 11:45 am**

These all steps for source developer that can having the bitmap shared memory. For these the kernal is the best solution for the large hash space it can be eliminate the sharped bitmap entirely.

Posted: **Fri Feb 11, 2011 4:22 pm**

blendaperry wrote:These all steps for source developer that can having the bitmap shared memory. For these the kernal is the best solution for the large hash space it can be eliminate the sharped bitmap entirely.

Sorry, this is unclear... could you try to explain better?

Posted: **Fri Feb 11, 2011 5:05 pm**

I think he's trying to say when you have a lot of hashes you should not use the bitmap in shared memory. Which is stated here:

Bitweasil wrote:Have 3 kernels for each hash type - FAST (for very small hash lists), MEDIUM (shared bitmap, no big global bitmap), and LARGE (no shared bitmap, large global bitmap) - and then benchmark before running to find out what the fastest option for the given parameters is.

Posted: **Fri Feb 11, 2011 6:49 pm**

Ah. Yes, that's probably a good idea. I'm in the middle of reworking my bitmaps/etc anyway. I already have way too darn many kernels (last I looked, the binary was 29MB). But I'll see what I can do. I have a few ideas to improve performance.

Cryptohaze.com

0.7 source: Developer discussion

0.7 source: Developer discussion

Re: 0.7 source: Developer discussion

Re: 0.7 source: Developer discussion

Re: 0.7 source: Developer discussion

Re: 0.7 source: Developer discussion

Re: 0.7 source: Developer discussion

Re: 0.7 source: Developer discussion

Re: 0.7 source: Developer discussion

Re: 0.7 source: Developer discussion

Re: 0.7 source: Developer discussion