Pages

Tuesday 28 June 2011

Get your fuzzers ready...

I've been skimping a lot on blog posts of late, partially because I've been doing my exams & relaxing there after, but it's also partly down to how busy i've been, and how lazy I am. Anyway, since my last blog post, I've finished my final year project, done my exams and been to both BSidesLondon and BSidesVienna. Both conferences were great, and if I had the money to go to Las Vegas I'm sure I would be going to BSidesLV too (and probably defcon while I'm out there). However, money being a little tighter means I am unable to do this.

What I have really been doing though is focussing on fuzzing, the theory behind it, and setting up my home lab for both bug hunting and general trying to get better at pentesting. The idea of fuzzing and bug hunting is really interesting to me, since there are practically unlimited different ways to try to break a program.
It's the same idea as for all security -- an attacker has to only find one way to get in, a defender has to attempt to stop every single possibility, and you could use the same thing with a security-minded programmer versus a bug hunter.
But back to fuzzing in particular, in case you don't know the basic techniques of fuzzing, there are two main ones -- mutation-based and generation-based.

Generation is a much more comprehensive type of fuzzing, since input is generated and can be made as such that is tests every known feature of a program. The major flaw with this technique (and the main reason it's not used as much as the other) is purely time. It could be deemed it is also down to knowledge, but I am putting knowledge in with time. The reason so much time is taken up by building a generation-based fuzzer is that in order to generate the data, you need to know how the data was constructed. Now with simple protocols and filetypes this isn't such a bad thing, as there may not be a lot to learn and generate data, but with proprietary protocols and stuff that hasn't been well documented/reverse engineered, this becomes a major problem. It may be that you create a fuzzer, spending months reverse engineering the protocol and making the best fuzzer possible, just to find out that there were no problems with the particular program (or at least none your fuzzer could find). It is seen that since generation-based fuzzers are specific, they will make good overtime, since if testing many different programs you just need the one fuzzer and it can test all of them for a specific file format or protocol.
However, the startup cost is usually so great that it is often easier to just use mutation-based fuzzing instead. With this technique the tester takes sample data, changes it, and feeds it into the program being tested. The only problem is that this techniques will not get past security measures (you're pretty screwed against encrypted traffic) and checksums or anything calculated on parts of the data.

Luckily, there are 2 compromises:
--find out some of the protocol/file type
--Mutate a lot of files

You may think that finding out about the protocol/file type is the same as generation fuzzing, but this is meant to be used with mutation-based fuzzing. And you may also be wondering that if you're already going this far, why not just go and build a generation fuzzer?
This is a valid point, however, this way you need to know a lot less. For example, you are reverse engineering a proprietary filetype, you've spent days trying to find out what a particular section of 10 bytes is for. If you're making a generation fuzzer, you will have to keep on getting other files and trying to reverse these, so that you can do the correct generation on files, but if you're doing mutation-based fuzzing you can either say just fuzz it, or leave it as static. The bare minimum need to know for this type of fuzzing is checking what stays the same in each sample, and what is different. Then you can say "fuzz it" or "it's static, don't fuzz it". Of course the further into the protocol/file type you get the better results can be obtained, however there are programs that can try to do this for you (I haven't personally researched them but there may be a blog post coming up about them at a later date). Another advantage of using this is that if there is a checksum or hash that needs to be done, this technique can do these. If these are not done it is likely that the data will just be rejected, effectively losing a potentially huge chunk that cannot be tested with normal mutation-based fuzzing.

Mutating a lot of files basically gives the effect that it will go through a lot more branches of the execution of code, and you will get much better code coverage ((a program could potentially be used for this). This is a technique that has been covered well by Charlie Miller here. The techniques will most likely give a compromise between amount of time spent and problems found. Similar to the technique above there is a lot of startup time used for it, with collecting samples, making them into an easy form to fuzz, and fuzzing them will take a long time since there is a lot of documents (in the slides above Charlie Miller says it would take around 3-5 weeks to fuzz using his own code running in parallel on 1-5 systems at a time). However, this is still often shorter than the amount of time it would make to create a generation fuzzer for a proprietary file format or protocol.

I have personally spent some time setting up machines for doing fuzzing the latter way. And here is some problems and solutions I have found thus far:
(note I have created scripts for many of these jobs, which I am not stating to use, since in some cases they may possible go against websites' Terms of Uses)

So I decided I would start off on my desktop fuzzing PDF software so I first started investigating what to fuzz that actually uses PDFs. This isn't just Adobe PDF reader. There are different viewers, editors, converters, creators and other things applications can do with PDFs. A decent list (with specific platforms) can be found on wikipedia at http://en.wikipedia.org/wiki/List_of_PDF_software, however this isn't a complete list, and I'm pretty sure the list doesn't include any web browser plugins.
So now I have a list of possible software - what am I going to fuzz and what platform am I going to use?
Well I decided since I know best about Linux and Windows I would use these (I do have the possibility of testing Mac software, but I don't want to fuck up my mac and as far as I know it's still not really possible to put OSX in a VM without hacks).
Since I have a desktop able to stay on all night and day fuzzing these (core i5 with 12Gb DDR3) I have installed VMware workstation and will install a few VMs of each, with exactly the same environments since workstation is able to clone virtual machines.

For downloading files from the internet, I created a script that simple downloaded them from Google. Since I couldn't be bothered to try and find out how to look through each page in a search result I just put in another number with the filetype in each search. It isn't the best working script every, and won't find every possible pdf in the results, as I had to make a compromise between how long I wanted it to take and what strings the regex would find. This is because when adding even more possibilities in regex it made it grow rapidly in time. I believe I found a decent compromise and have put it here. As I noted before, Google may have something against scripting in its Terms of Service and as such this should not be used, plus there is limit of around 650 for the reason that Google brings up a Captcha. (And if you can script your way around a captcha you need a medal) If you wish to download more, it could be tried at different IPs (how I imagine it determines scripting) or just wait the required time, which I believe to be around 30-60 minutes.

#!/bin/bash
if [ $# -ne 1 ]
then
  echo "Usage: $0 "
  exit
fi
echo "Getting files.."
userAgent="Mozilla/5.0"
num=1
bottomLimit=1
topLimit=650
for ((i=$bottomLimit; i<=$topLimit; i++))
do
  echo "$num of $topLimit completed"
  ((num++))
  wget -q "www.google.com/search?q=$i+filetype%3Apdf" -U $userAgent -O Temp
  egrep -o "http:\/\/(\w*\.)*\w*(\/\w*|\w*\-\w*)*\.$1" Temp >> Temp2
done
rm Temp
sort Temp2 | uniq > $1List.txt
rm Temp2
resultsNum=`wc -l $1List.txt | awk '{print $1}'`
echo "Searching complete! Total of $resultsNum found. Check file $1List.txt for results"
mkdir $1s
cd $1s
echo "Downloading documents to new directory $1s..."

num=1
for i in $(cat ../$1List.txt)
do
  echo "$num of $resultsNum downloaded"
  wget -q -U $userAgent $i
  ((num++))

Note as well that the second part of downloading the files actually was taking a very long time, and as such I made a script of just this part so that I could make it run on several different machines which sped the download up considerably.
This is that script:

peter@LinuxDesktop:~/PDF$ cat downloadFilesFromListOnly.sh
#!/bin/bash
if [ $# -ne 1 ]
then
  echo "Usage: $0 "
  exit
fi
userAgent="Mozilla/5.0"
resultsNum=`wc -l $1List.txt | awk '{print $1}'`
cd $1s
echo "Downloading documents to new directory $1s..."

num=1
for i in $(cat ../$1List.txt)
do
  echo "$num of $resultsNum downloaded"
  wget -q -U $userAgent $i
  ((num++))

This was just used by splitting up the list in the textfile created in the first script and using it on different VMs.

Once this was done I decided I wanted to give all the files a generic name, as I would be making lots of different slightly different copies of each file. So for this I created (yet another) script so that all the files would be named A.pdf, B.pdf, C.pdf .. AA.pdf, AB.pdf....
This was going to be used with upper and lower case arguments, however after 2 hours of frustration I remembered the fact its a case insensitive drive (damn Macs!!) and therefore ab.pdf is the same as AB.pdf (on the drive not in the terminal before anyone tries to argue with me).
Anyway, this script is really bad and I'm sure there's a better way to increment a letter after Z using mod, but I couldn't figure out and was tired, so here's my script for this part too:

peter@LinuxDesktop:~/PDF$ cat randomRename.py
#!/usr/bin/python
import subprocess
from sys import argv, exit

if len(argv)!=3:
  print "Usage: "
  exit(1)
if argv[1][-1]!='/':
  argv[1]= argv[1]+'/'
array=[]
nextElem=0
for i in range(65, 91):
  array.append(chr(i))
cmd='ls -1 ' + argv[1]
p=subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
output=p.stdout.read()
files=output.split()
filesLen=len(files)

for i in files:
  index= files.index(i)
  if index%200==0: #just to have some sort of update to user
    print str(index) + " of " + str(filesLen) + " done."
  #want to make sure am only changing files of same type, so will check extension
  if i[-(len(argv[2])):]!=argv[2]:
    print "File with mismatching extension. Continuing over this file. Best to run on directory filled only with the associated files"
    continue

  if index<len(array):
    newFile = array[index] + '.'  + argv[2]
  elif index<(len(array)**2):
    firstLetter=array[index/len(array)]
    secondPart=index%(len(array))
    secondLetter=array[secondPart]
    newFile = firstLetter + secondLetter + '.' + argv[2]
  elif index<(len(array)**3):
    firstPart=index/(len(array)**2)
    firstLetter=array[firstPart]
    secondPart=index%(len(array)**2)
    secondLetter=array[secondPart/len(array)]
    thirdPart=secondPart%(len(array))
    thirdLetter=array[thirdPart]
    newFile = firstLetter + secondLetter + thirdLetter + '.' + argv[2]
  else:
    continue   
  cmd='mv ' + argv[1] + i + ' ' + argv[1] + newFile
  output = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

You just simply use the script stating what directories all the files are in, and what their extension is (note: I later found out this only works on linux with all the extension is the same case e.g. png isn't the same as PNG which I should've got earlier but I can't be bothered to change the script now). This will only work for a max of 26^3 files (and I think it's so badly written it won't actually do that many,
but I don't care, I'm giving it away on a blog for crying out loud).

And now how to fuzz it?

This I'll leave upto you. I've looked into both Sulley and Peach fuzzing frameworks, and I think Sulley looks really good but I think it's Windows only. Peach however is multi platform but looks as though it may be harder to understand, but I would recommend checking it out if you haven't already because it looks like it can do just about anything.
If you want you can fuzz any way you want. You may want to use the '5 lines of python' that Charlie Miller swears by (which was actually about 20-35 lines when I managed to get it into a python program -- this is actually still fairly small but it's only for generating the fuzzed files). I suggest if doing it this last way, one way to test the fuzzed files if on windows would be to use FileFuzz http://fuzzing.org/wp-content/FileFuzz.zip It's not amazing, but I can't see any reason why you couldn't put a lot of fuzzed files into a directory and tell it to use them, I'm just not sure exactly what FileFuzz has checks for, and some trial and error testing may be needed done a sample at a time in order to get the time needed per test down properly.

Anyway I hope if you try to do any fuzzing that this gives you some help on the basics, and ways you could go about it.

--lavamunky