Developer Concepts Guide

From Cryptohaze Project Wiki

Jump to: navigation, search

This is the Developer Concepts Guide. This is my attempt to document what I'm thinking, how things are structured, and the general way I'm trying to handle this project.

Overview

If you're reading this, you probably understand at least something about the Cryptohaze tool suite. They're open source, GPLv2 tools for high performance password cracking. There are two tool chains: The rainbow table stuff (GRT), and the brute forcing stuff (Multiforcer/MFN). MFN, by the way, stands for "MultiForcer New" - it's just the new multiforcer framework.

I'm always looking for developers. If you're interested in helping out, this document should give you an overview of what's going on, and where you can possibly help out.

Philosophy

Software requires a philosophy, right? The general guiding principles for the Cryptohaze tools are:

Make it as simple as possible to add new hash types and extend the tools, when possible.
Make the kernels fast. It's a password cracking tool - speed is king here.
Have robust cross platform support. Windows, Linux, and Mac OS X are the supported OSes, with nVidia, ATI, and CPUs being supported through algorithms.
Don't be evil. The code shouldn't do anything outside what one expects a tool to do. No network connections without a reason, keep to your own directories, etc.

The tools can be used for good or evil. The goal here is to at least balance the playing field so everyone has access to the same performance. The specific target audience is pentesters and whitehats. Many of them use NDAs about password hashes, and are therefore at a disadvantage to blackhats who can just spray hashes to websites for cracking. Pentesters also typically have slightly different algorithm requests and are more likely to be using laptops, so performance on lower end machines is important.

As far as developers go, my goal is to put the complicated code into blocks that are safely abstracted out. This obviously won't work in ALL cases, but I'd like to at least handle most of the cases sane so that new algorithms can be added by simply implementing the algorithm & putting the needed boilerplate around it. It's a grand goal, but it also assumes other people care to add kernels... Please, justify the open source aspects of this! Me being the only developer is not useful. There are a few others, so you're in good company.

Required knowledge

This guide, and the developer section in general, assumes a moderate to deep knowledge of CUDA, and at least some OpenCL (though they're definitely similar). The biggest difference is that the OpenCL cards require a full vector architecture to extract performance, and it's more optional with CUDA.

If you're looking to really contribute, you'll need to know CUDA, and OpenCL. Most of the core bits are already written, but you'll need to be able to understand the hows and whys of the kernels.

Also, everything is written in C++. You'll need to know C++, unless you're the type who wants to port everything to Python, just because you can or something. I will not support porting efforts to Python, btw...

Otherwise, general password cracking knowledge is assumed.

If you're familiar with coding in general, but not CUDA (or OpenCL), the best way to start is with the nVidia CUDA Programming Guide. This covers the basics of GPU computing, the quirks of GPU performance, and CUDA. CUDA is much easier to start with than OpenCL, in my opinion.

Another very useful chunk of knowledge is how current hash algorithms (MD5, SHA1, etc) work. Understand the input blocks, output blocks, and other general concepts. This will help make sense of the code.

Threaded execution environment

The compute threads (CUDA, OpenCL, CPU) run as their own individual classes, but they share some data. This is important to know and understand if working with the hash classes in detail. Things like uncracked hash lists, generated bitmaps, etc, are shared among ALL the instantiated classes. Per-device data like device addresses is private to each class. This lets us avoid generating a lot of data on the host, while still allowing each class to have its own private member data related to the current GPU. It works well enough because the different device classes run very similar kernels.

The core feature this enables is the ability to use different devices with common shared data. Each device type has it's own class, but they share common data at whatever level they can share it. Typically this is on a "per-task" level.

Classes

Everything is broken into classes. As is typical for something like this, the high level classes implement the most generic behavior, and subclasses implement more and more specific behavior until the final instantiation implements everything needed for a specific hash type.

Currently, there are 3 supported architectures: CUDA (for nVidia GPUs), OpenCL (for ATI GPUs only, and *maybe* CPU targets), and CPU (SSE2/etc for CPU targets). OpenCL not compiling for nVidia is not a problem, though if you wanted to fix this, I wouldn't complain...

To use the MFN framework as an example for plain hashes, the class breakdown is as follows:

MFNHashType is the very base class. It implements fairly little functionality, but everything else is built from it.

MFNHashTypePlain is the class that implements generic plain hash cracking (unsalted hashes). It is capable of supporting both "little endian" (MD5/MD4/etc) and "big endian" (SHA type) hashes, and does most of the common work to all this hash type. This class contains many static data elements shared between all the running classes, such as the hash list and bitmaps. These are protected by a mutex (as STL types are distinctly not threadsafe for writes!). As most of the functionality to crack hashes is common between device types (set up the lists, create bitmaps, run kernels until done, report results, etc...), much of the functionality lives in this class.

The next layer down are the device classes. These implement generic device functionality for CUDA, OpenCL, and CPU.

MFNHashTypePlain{CUDA,OpenCL,CPU} implement the device-specific common functionality. This is the first layer in which the classes care what type of device they are running on. These classes implement functionality like doing any device setup/teardown, kernel compilation (for OpenCL), allocating host and device buffers for data transfer, copying data to/from the device, etc. They also handle reporting the found hashes, as these must be read from a host buffer that is related to the device type. To understand these classes, you'll need to know the host side of whatever device tech is being used.

Finally, the final leaf (as far as the host code goes) are the hash-specific classes.

MFHHashTypePlainCUDA_MD5 is an example of this. It handles copying constant data to the device (since this is usually in the class namespace to avoid issues), it handles pre and post processing the hash to improve performance, and it handles the actual kernel launch. Note that for CUDA, this is actually a trampoline into the .cu file (where the actual kernel launches must live, since I'm still using the runtime API).

For the CPU device types, this is the end of the line, and these classes are compiled with SSE2 support, containing the actual kernels.

For CUDA and OpenCL, the kernel (and, for CUDA, supporting functions) live in separate files. The kernel files for OpenCL are the final code, and the .cu files for CUDA contain the code and kernel launcher (and, more specifically, they contain the kernel generation macro, and a bunch of invocations of that).

The advantage of this setup is that adding new hashes to an existing hash type is very easy. To add a new plain hash type (say, SHA1 - which is in the works), one simply has to generate the actual SHA1 kernels for the devices, the final hash type classes (MFNHashTypePlainCUDA_SHA1, MFNHashTypePlainOpenCL_SHA1, etc), and a tiny bit of glue.

Kernel Design

Right now, the kernels are designed to do everything on the device. They generate plains, hash the plains, and check the found passwords on the device. This is required for high speed kernels - there's no good way to push a few billion passwords per second between the host and the device. Just for a data point: at 3B 128 bit hashes per second, for length 8 passwords, we're consuming (3B * (8 + 4)) = 33.5GB per second, and generating about (3B * 16) = 45GB per second of data. Sorry. PCI-E is fast, but not *that* fast... (and that's just per-GPU!)

The kernels are mostly copy/paste other than the actual hash algorithm. This is by design. The common components are designed to be abstracted out as much as possible to allow developers to focus on the hash algorithm. Stuff like checking the password and generating the next plain are common tasks, and shouldn't require much developer effort. Yet. :)

Future development will look at further concepts, such as generating likely plains on the host. This will never work well with "fast" algorithms, but is worth doing for slower algorithms like MD5Crypt and other complex, iterated algorithms that are much slower.

GPU Quirks

Much of the weird stuff in the code (especially the device code) is designed to work around GPU quirks. Understanding the architecture is really vital to understanding the code. AMD GPUs are much worse for this than nVidia - they have a lot more weird stuff that has to be handled properly. One of the worst differences is "clauses" - branches are much, MUCH harsher on performance with ATI GPUs than with nVidia. Read the OpenCL developer guide by AMD to help understand the GPUs. Also, if anyone wanted to write an actual disassembler for the AMD GPUs, I'd really appreciate it.

Network

The Cryptohaze Multiforcers have network support built in. This is to deal with the reality that a single system can only host so many GPUs before it either becomes unstable or infeasible. Networking lets people cluster as many systems as they desire together and combine the power. This works well because password cracking is "embarrassingly parallel." It scales more or less linearly with the resources thrown at the problem.

The fundamental method of splitting work up is the workunit. By splitting tasks into workunits, the work can be split across as many machines as desired (and as many GPUs/CPUs locally). The current method is that for the local system, each thread requests workunits independently. For the old Multiforcer, each remote system requested workunits as needed (on demand), but this adds latency. With the new robust workunit class that can retry failed workunits (if a network client disconnects, the incomplete workunits are requeued for execution), remote hosts going forward should request a small queue of workunits so that they can service the requests from other threads immediately.

Bloom filters/Bitmaps

One of the things that is seen throughout the Multiforcer is the use of bitmaps, of various size. There are not images! These are a simple case of a bloom filter. For each size, a word value (usually 32 bits) is turned into an offset in the bitmap. This bit is set. By doing this, the GPU kernels can get an "easy no" answer before continuing to check the hash against the hashlist in memory (an expensive operation). The different sizes are to fit into different regions of memory. The 8kb bitmap is for the shared memory on the GPUs, the 256k bitmap fits in L2 cache on GPUs, and the 128mb bitmap lives in global memory (and a request can answer "No, the hash is not present" in one memory request).

Metaprogramming

The OpenCL kernels involve a metaprogramming class that can be confusing to understand. Because the kernels are compiled at runtime, the "boilerplate" code they need for hash checking and password incrementing can also be generated at runtime, for the specific kernel invocation. This is what we do. For password incrementing, it generates the code based on the current vector width and password length. For the hash checking, it generates code based on the currently allocated bitmaps and vector width. This replaces an awful lot of ugly defines...

"Early Out"

Another "weird thing" seen in the kernels is what's termed "early out." This relies on the fact that for most kernels, they progress through the output words sequentially. "a += (bunch of stuff); b += (bunch of stuff); c += (bunch of stuff);" etc. What this means for password cracking is that we can skip work! Since the bitmaps give us a "definite no" capability, we simply check each final value, as generated, against the bitmap. If it's present, we continue on, but if it's NOT present - we're done! If there is no hash present that has the "a" value we've just calculated, we know FOR SURE that the hash is not present in the hash list! There's no point in calculating b, c, or d at this point!

Developer Concepts Guide

Contents

Overview

Philosophy

Required knowledge

Threaded execution environment

Classes

Kernel Design

GPU Quirks

Network

Bloom filters/Bitmaps

Metaprogramming

"Early Out"

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox