Cryptohaze.com

by **Bitweasil** » Tue Mar 09, 2010 4:38 pm

That's part of my system - GPU accelerated cracking/candidate hash generation.

Chain length 200k is not really feasible on a CPU for cracking... at 20M links per second, it's 17 mins per table. And chain length 500k is *much* worse

So, yes, GPUs will be used for the precalculation stage. They're fast at it.

Right now, I'm working on the finer points of multi-GPU coding - if I'm going to do it, I may as well do it right, and make the systems able to use all GPUs on the system as well as all CPU cores (don't expect great CPU code right now, it'll just be mostly filler for those without GPUs, but I'm intending to implement some SSE magic for the CPU threads at some point).

Theoretically, if I can do this properly, all table generation and all precalc work will be done utilizing the full resources of the system - all CPU cores and all GPU cores (unless otherwise configured).

by **sapling** » Wed Mar 10, 2010 3:45 pm

Nice I was hoping to hear something along those lines.

Well when your ready to release for Beta testing I will be more than happy to loan out my GPU for some linux based testing. Just rebuilt my machine recently so its clean and ready but no telling how it will compile up new code.

by **jzeller** » Mon Mar 15, 2010 6:17 pm

I just finished up a system that should hold up well for this.

MountainMods Extended Ascension CYO with Duality
ASUS Supercomputer
Xeon 3520 @ 4.4ghz
12GB RAM
2x60GB OCZ Vertex RAID 0
2x500GB WD RE3 RAID 0
2x 295GTX
NAS storage of 10TB

I have been anticipating your release for a long time, I had to make sure something was waiting when you did.

That system can be headless or GUI, whichever you want for testing. Would be a shame to turn off the 3 24" Monitors but its all in the name of CUDA. This system was built to start learning CUDA and OpenCL. There is enough funds to drop in a couple FERMI cards when they are released in addition to the 295s. With the case, there is room on the other side for the new server board once Intel gets around to releasing their new Beckton CPUs, should have even more horsepower. It's currently running as a development machine so Win7, Linux 64bit, and BackTrack 4. The design allows for the current setup as the workstation side, with the server side coming with the new Intel proc & mobo running all the VMs. Let me know when you got something so I can finally tap into the power of this thing. BarsWF shined greatly for MD5, yours does as well. I just want to see all those cores light up. :mrgreen:

by **Bitweasil** » Mon Mar 15, 2010 8:36 pm

Current progress:

I've got single GPU generate code for MD5 len 6-10 working. NTLM is a quick patch in, no big deal there. Just a different reduction function.

The tables now have an 8kb header containing, among other things:
- Hash type
- Password length
- Character set
- Chain length
- Number of chains
- Table index
- Comments

This means the filenames are no longer containing important metadata - all the metadata is stored in the file itsself. You give the cracking tool a hash, and a table file to work with, it does the rest for you. Obviously the cracking tool can't distinguish between a MD5 hash and a NTLM hash, but the rest of the details are stored in the table. Eventually, I'll probably build a file based system that can take a list of hashes and a list of files, but I'll work on that later.

I also have table verification code finished - this is a CPU implementation of the algorithm that's useful for table checking. You pass in a table and a stride, and it will check every N chains. The idea here is to catch tables that are catastrophically wrong (also, to catch bugs in my implementation - that's the primary purpose).

I'm working on the candidate hash generation code (making it GUI friendly), and then I will be plugging in my existing table search code. The table searching will be 64-bit only - sorry, I /really/ like memory mapped files and can't memory map a 1TB file into 32 bit space without some serious work that I don't care to do. Chain regeneration/searching will probably also be CPU based, for now - it'll be slower than GPU based, but it's not doing much work, and it's a lot faster to implement.

Once that's done, I'll release a beta and let people hammer on it. It'll be single GPU only, but all the code should function properly, generate tables, and do hash cracking. I'll be happily taking bug reports here.

The next stage, once everything is working, is to learn me some multithreading and vector operations (pthreads/SSE), and then recode my stuff to be multithreaded. Realistically, if I do it properly, I think I can work in some network-enabled cracking here too - at least for LAN versions of a network. I'm currently brainstorming how to handle the workunits to make this feasible, but if I can do it properly, it may be a sweet setup - able to use a variety of boxes on the network for all aspects of use. This would also mean that the GPU tables would be a bit more useful - throw 3-4 multi-core boxes on a LAN together and you'd have a CPU based setup that could still handle the GPU chain lengths.

Oh, and at some point I really should look into either OpenCL or code for ATI... OpenCL is an interesting option, but I'm not sure how fast it will be on CPUs/ATI cards. Might be worth playing with, though.

by **jzeller** » Wed Mar 17, 2010 6:30 pm

I can definitely check your LAN based setup when your ready for testing. I have a rack of 6 - Dell 850s for playing along with 2 - Dell 6650s and a Dell R300. If we really get something going, I also have a couple Dell 2900s as well. I also have access to about 20 Dell lab machines on the same network, with similar specs as the Dell 850s. I have worked with MPI and some clustering OSs. I have add all of them running together with JtR, seemed to work well. Provided I can pull some political strings, I hope to have some decent GPUs added to those 20 Dell lab machines which will give much more horsepower. Problem with is they are factory builds so PS doesn't have much room to work with.

Lemme know when you have something.

by **Bitweasil** » Sun Mar 21, 2010 8:07 pm

GUI-friendly candidate hash generation code framework is done.

Right now, MD5 len8 only, but it's trivial to finish writing the functions.

Some benchmarks on a GTX260:

Len 100k chains: 13s (5 000 000 000 operations, )
Len 200k chains: 53s (20 000 000 000 operations)

I'd say this is a fair improvement over CPU speeds... roughly 385M links/sec over the space. It's not quite as efficient as table generation due to some complexities (and it's not quite as optimized as it could be - I feel these speeds are reasonable for a beta).

Anyway, the next few steps should be easy - table search and chain regen/search. These will be CPU based for now - at least for the beta. Depending on performance, I may make the chain regen step GPU based at some point, but I'll cross that bridge when I come to it.

If all goes to plan, I should have a beta out sometime next weekend - this is all coming together nicely.

by **blazer** » Mon Mar 22, 2010 12:59 am

just like to ask if there will be any loss of performance with longer lengths and if so what causes it?

by **Bitweasil** » Mon Mar 22, 2010 1:27 am

Loss of performance with longer lengths...

Longer passwords, or longer chains?

Longer passwords will be slightly slower as there's more work to do - the reduction function has more to do, and the hash has fewer input registers bound to 0x00000000 - so they take longer to process.

Longer chains dramatically affect the time for candidate hash reduction, because the number of operations for this is ((chainlen^2)/2) - and this is the main limiting factor for chain lengths (well, this and merges). A CPU working on len100k chains will take 30+ minutes to do all the candidate hashes with a basic implementation.

Does that answer your question? I'm not 100% sure what you're asking, so I'm shotgunning.

by **blazer** » Tue Mar 23, 2010 9:43 am

ahh, thx for that, yea sorry about the unclear question but I was after the Password length.
More information is good.

by **Bitweasil** » Tue Mar 23, 2010 1:50 pm

For passwords, as I said above, there are two factors.

The first (and most obvious) is the reduction function - it has to do another step for every additional character. This is small, but noticeable.

The other, less obvious one involves the hash function.

They typically process 32-bit registers. At the end of the data, there's a marker bit before the zero padding.

So, for a 7 character password, you are using 2 data registers (7 bytes of password, byte 8 has the high bit set), and then the length register. The rest are "wired to zero" and can be optimized a bit due to this.

When you go to 8 characters, you're now using 3 data registers plus a length register.

NTLM is worse, as you only fit 2 characters per 32-bit data register (UTF LE-16 encoding).

There's not a HUGE slowdown from longer lengths - it's on the order of a few percent per additional character.

Cryptohaze.com

Progress on the GPU accelerated RTs!

Re: Progress on the GPU accelerated RTs!

Re: Progress on the GPU accelerated RTs!

Re: Progress on the GPU accelerated RTs!

Re: Progress on the GPU accelerated RTs!

Re: Progress on the GPU accelerated RTs!

Re: Progress on the GPU accelerated RTs!

Re: Progress on the GPU accelerated RTs!

Re: Progress on the GPU accelerated RTs!

Re: Progress on the GPU accelerated RTs!

Re: Progress on the GPU accelerated RTs!

Who is online