The defaults for the 0.7 series are pretty bad for Fermis.  I'm working on improving that.
I'm surprised you're finding best results with a non-power-of-32.  Usually threads should be a multiple of 32 (the warp size), and I've found block sizes of multiples of 60 work nicely.
Linux is a better environment for this.  The Windows driver model really sucks.  The fact that you're getting significantly more speed points to this fact 

  The tools are really designed for a headless Linux environment, they just happen to work tolerably with GUIs under Linux, and work with GUIs under Windows.
As for the PHP script... not difficult. 

  Just generate random character strings, run it through the built in md5 function, and output to a file. 

 I don't keep simple stuff like that in version control.