Cryptohaze.com Multihash Brute Forcers

CUDA Multiforcer – The world’s fastest cross-platform MD4/MD5/NTLM cracking for Windows/Mac/Linux

World’s fastest on a single hash or 100,000 hashes? No.. Fastest on a typical workload of tens to 100s of hashes? I believe so. World’s fastest cross platform? As far as I know. Plenty of room for improvement? Sure. The first one planning to release the source? To my knowledge, yes.

Performance

Some random performance stats for password length 7 (shorter passwords will be slightly faster due to lower register usage, longer will be slightly slower):

Tested with a GTX260 (216 stream processors) on 64-bit Linux, full US character set.
Reported rates are compares per second – so (password stepping rate * number of hashes).

The time on the left is the kernel execution time. Shorter times allow a more responsive GUI, but lower performance. 10ms does not interfere with screen updates at all. 20ms is noticeable, but still very usable. 100ms and 500ms are best used either on a headless system or when the user will not be present, as they drop the screen redraw rate to unusable levels.

NTLM Comparison Rates
	1 hash	10 hashes	100 hashes	1000 hashes
10ms	571 M/s	4913 M/s	20536 M/s	18100 M/s
20ms	581 M/s	5003 M/s	20919 M/s	27600 M/s
100ms	590 M/s	5078 M/s	21240 M/s	30450 M/s
500ms	599 M/s	5158 M/s	21617 M/s	30800 M/s

MD5 Comparison Rates
	1 hash	10 hashes	100 hashes	1000 hashes
10ms	410 M/s	3665 M/s	17940 M/s	17633 M/s
20ms	418 M/s	3740 M/s	18310 M/s	17760 M/s
100ms	425 M/s	3804 M/s	18670 M/s	29850 M/s
500ms	431 M/s	3862 M/s	18970 M/s	30260 M/s

System Requirements

These binares require CUDA (nVidia’s API for programming GPUs to do non-graphics things). You will need the following:

An nVidia GPU that supports CUDA – pretty much any 8000 series or later GPU should work

Enough video RAM – this varies. OS X uses far more video RAM than a headless Linux server. 256MB should be enough.

The appropriate drivers. This is the tricky bit. You have to have recent nVidia drivers with CUDA support.

Compatible libraries. This should only be an issue with Linux, and any decently recent Linux should be fine. Non-standard libraries are included.

Driver Downloads

To get the driver you need, go to http://www.nvidia.com/object/cuda_get.html and download the driver for your OS. Recent nVidia drivers for Windows and Linux should have the support built in. If it doesn’t work, update your driver. Also, for OS X, download the CUDA Toolkit and ensure that you select the CUDA kext in a custom install.

Downloads

MD4/MD5/NTLM – 32/64 bit Linux binaries:
http://cryptohaze.com/releases/CUDA-Multiforcer-Linux-0.61.tar.bz2

MD4/MD5/NTLM – 32 bit Windows binaries (Works on both 32 bit and 64 bit Windows, tested on XP and Vista):
http://cryptohaze.com/releases/CUDA-Multiforcer-Windows-0.61.zip

MD4/MD5/NTLM – OS X binaries (Intel-only):
http://cryptohaze.com/releases/CUDA-Multiforcer-Mac-0.61.tar.bz2

FAQ

Why?

Why not? Well, I needed to work out how to do cross-platform CUDA binaries, and how to not interfere with the GUI. I worked on a brute forcer I had laying around for some other stuff, worked out a bunch of stuff that I needed to, and realized that it was really close to a release quality product, so finished tweaking it and released it. Plus, all the other CUDA hash cracking products I’ve seen are Windows-only, and that makes Bitweasel sad. nVidia stream coprocessors belong in headless Linux servers!

Will you release the source code?

Yes – once I clean it up. Once I finish cleaning this up and get everything consolidated to a single codebase, with support for hash expansion, I’ll be releasing the code and taking submissions of new hashes. This also requires updating things to handle hash lengths of > 128 bits (I’d like to theoretically support out through SHA-512 or beyond).

I get errors and my GPU is listed as “Device Emulation”

Then you either don’t have a CUDA compatible GPU or you don’t have the drivers updated. Sorry.

What does the -m parameter do?

The -m parameter sets the target execution time, in ms. When running CUDA code on a system with an active display, the display cannot be updated while a kernel is running. This requires the work to be broken into small chunks, such that the display can update. However, the smaller the work unit, the less efficient the kernel is (as seen in the performance tables). This allows tuning of the kernel execution time to take this into account. A target time of 10-15ms will leave your display effectively “normal” and allow typical desktop activities, watching movies, and possibly light gaming. Longer target execution times of 100ms or greater will dramatically affect the screen update, but will provide better performance. On a system that is not being used, or a headless system (or on a GPU that does not have any monitors attached), target execution times of 500ms or greater will allow the maximum performance.

It won’t run.

Yes… this is a common problem with bleeding edge software. This is also the reason for the support forum. First steps would be to ensure you’re running the most recent nVidia provided driver, and then to see if you can run /any/ CUDA enabled programs. CUDA requires an nVidia Geforce 8000 series or above (or some of the Quadros). Additionally, if you are using a large amount of video RAM, there may not be sufficient memory remaining for the kernels to launch. Debugging for this will be added in future versions, but if you are getting vague errors, try rebooting and running the code with no other applications open. If this doesn’t help, post details of your error and configuration in the forum and I’ll try to help.

Why is it slower on Vista than on Linux?

Classy as it is to say “Blame Microsoft,” it’s true. They changed the driver model for video cards, and as a result, kernel launches take longer, and the GUI does not let the GPU do whatever it wants. Optimization information welcome.

Forget desktop response – how do I make this run as fast as possible?

The best option is a headless Linux server with a very high end GPU in it. If that’s not an option, try passing –maxthreads –autotune -m 500 to it. You may see improvements with an even longer kernel execution time – it depends on the system. Your display will be nearly unusable when this is running.

Can I leave this running while I game?

Yes. The default time of -m 50 should allow for decent gaming performance, and the kernel will auto-tune the amount of work to handle this. You may see better game framerates (but lower cracker performance) with a lower kernel execution time. If it kills performance, just quit the program.

I set a very low -m time and my performance is really, really bad.

There is a certain amount of overhead for launching the kernels and getting them set up. Too short of a kernel execution time will cause the kernels to spend most of their time in setup, not getting much work done. Sorry. Try forcing a lower –threads or –blocks count if you really need to do this.

[Some other product] is faster/better/gave me a kitten

Great! Use it! If all you need to do is crack a single MD5 hash on a Windows system, BarsWF is faster. If you need to crack a HUGE hashlist, there are some paid products that are faster. If you need to crack a reasonable number of hashes at once, or don’t use Windows, this is what there is.

It doesn’t run on my ATI card! When will you support ATI?

When either I or someone else gets around to it. I’ve got higher priorities right now than supporting ATI cards.

How about resume support?

If there’s enough demand, I will add it.

I have another question…

The forum is linked in the nav on the left.

Documentation

Parameters

-h / –hashtype {MD4,MD5,NTLM} (required) This specifies the hash type to search.
-c / –charsetfile <filename> (required) This specifies the charset file. The downloads come with several demo character sets, but you can create your own. These are created as a simple file with the characters you wish to search for. These brute forcers do support foreign characters within the ASCII space, but do NOT SUPPORT UNICODE (yet). In verbose mode, the active character set will be echoed for confirmation.
-o / –outputfile (optional) This specifies the output for found hashes. The file will be appended, not overwritten.
-f / –hashfile (required) This specifies the file of hashes. Hashes should be in ASCII-hex format (as they are typically found), one per line. The file should end with a newline.
-v (optional) Verbose output. Signiicantly greater detail on what is occuring behind the scenes.
–min / –max (required) These set the minimum and maximum password lengths to search. Lengths of 0 through 14 are currently supported.
-d / –device (optional) If you have multiple CUDA-enabled video cards, this allows you to select which card to use. The current card is printed on program execution. Default is 0 (the first CUDA GPU in the system).
-m / –ms (optional) This specifies the target kernel time, in milliseconds (1/1000th of a second). When using a system with a GUI, lower times will allow better display response, but will lower performance. See below for more details. The default is 50ms, which should not interfere with general system use.
-b / –blocks (optional) Force a certain block count (default 128).
-t / –threads (optional) Force a certain thread count (default 64).
–maxthreads (optional) On higher end video cards (8800, 9800, GTX series), use the maximum number of threads per block instead of the default 64. This may improve performance. This option will cause kernel launch failures on lower end cards.
–autotune (optional) Attempt to automatically tune block count for best performance. This does not work very well on Vista. This is best combined with a very high per-kernel execution time and –maxthreads.

Operation

When launched, the Multiforcer display the version number, basic information on the selected GPU, and loads the character set and hashes. Once this is completed, if –autotune is specified, tuning begins for GPU kernels. This tests the GPU under the current conditions (GPU, number of hashes, desired kernel time in ms) to find the optimum number of blocks. In verbose mode, this will provide information on each test. After the parameters have been optimized, the kernel starts executing. Passwords will be displayed as they are found (also written to the output file if specified), and the current speed and progress is displayed.

Kernel execution involves testing a given set of the password space. Each thread gets it’s own chunk of password space to check. Eventually, everything has been checked and the segment completes.

The kernel performs some dynamic performance tuning while executing. The amount of work done for each kernel is dynamic, and is adjusted roughly once per second. This allows the kernel to drop back if something else (a game, perhaps?) is using the GPU. In testing, this code was able to peacefully coexist with World of Warcraft running at 60fps, full detail. The code dropped it’s work-per-kernel back to remain at the target time, and while the throughput dropped while gaming, it was able to make use of unused GPU cycles to continue calculating. This is not as likely to work on a very demanding game such as Crysis.

Parameters displayed during execution:
Work (verbose only): The number of passwords being tested, per thread, per kernel execution. This is adjusted on the fly to keep the execution time near the desired point.
Done: The percentage of the current character space searched.
Time (verbose only): The time of the kernel executions, in ms.
Step rate/search rate: The step rate is the rate that the GPU is iterating through passwords. This slows down as more hashes are added. The search rate is the effective search rate (comparisons per second).