This post assumes a knowledge of basic LINUX commands. For help, consult the looping section of the BASH manual (http://linux.die.net/man/1/bash).
Suggested readings:
http://isc.sans.org/diary.html?storyid=7903
http://isc.sans.org/diary.html?storyid=7906
http://isc.sans.org/diary.html?storyid=7867
http://isc.sans.org/diary.html?storyid=7984
In response to these and other posts, I think it's time to get serious about 1) shortening the time from starting analysis to the determination of 'malicious' and 2) start tackling the massive numbers of these files swarming the enterprise. Both of these techniques require essentially the same techniques described above to me implemented in repeatable ways to script and automate them.
The malicious PDFs I've analyzed have a few things in common that will make this process easier. First, they almost always contain the dropper payload they want to execute. They usually come from free (gmail, yahoo, hotmail) or weakly secured (AOL, MSN) webmail accounts. And, best of all, the encoding scheme used to protect the droppers is always the same, a 255-byte decrementing XOR key.
So, to build a body of files for analysis, you want to start isolating or collecting all of the PDFs delivered from webmail accounts. Once you have these hundred or thousands of files, you need to start ripping through them and identifying the evil ones.
Before we start, get the latest version of Didier Stevens pdf-parser.py (http://blog.didierstevens.com/programs/pdf-tools/). Now, these pdf files sometime contain duplicate object numbers, lots of unlinked objects, and blobs in the unmapped spaces of the PDF (like after the %EOF tag). So, to begin, let's assume one hundred objects and start ripping all of the encoded/flated objects from the pdf. There will be a lot of blank objects since some don't exist in the PDF. Get rid of those with a remove statement on 0-length files.
$ mkdir pdf.analysis
$ cd pdf.analysis
$ cp ../1.pdf .
$ for (( i=0; i<100;> $i.out; done
$ rm `ls -l | egrep " 0 2010-" | awk '{ print $8}'`
Now we have a collection of extracted objects. As mentioned in Bojan's ISC Diary (http://isc.sans.org/diary.html?storyid=7867), we can search for failed FlateDecodes. This may indicate an intersting PDF for follow-up and can be an easy malicious PDF indicator.
$ grep failed *
31.out: FlateDecode decompress failed
31.out: FlateDecode decompress failed
Binary file 35.out matches
52.out: FlateDecode decompress failed
52.out: FlateDecode decompress failed
The malicious PDFs contain a dropper that is encoded. We've seen simple XOR encoding before, but the nefarious folk of the world appear to have moved into rotating XOR encoding techniques. The key is either incremented or decremented by some amount for every byte processed. When the keyset rotates to the end of the 0x00-FF scale, it turns the corner and picks up at the other end. So, to deal with this, I updated a previously written multi-byte XOR script to handle 256-byte rotating XOR keys with a given offset. Pair it with a for loop to cycle through all 256 possible start keys, and the encoded blob will be decoded and discovered with a simple GREP for a known string. Here's how it works. In this example, I had already located and carved the unknown blob from the PDF capsule. However, for automation you would can pass the entire PDF file and just not worry about the other bytes that will get mulched. We're only looking to identify the EXE, not carve it at this point.
$ for ((i=0; i<256; i++)); do echo $i; perl multi-xor-v2.pl -f 1.pdf -o $i.ex_ -k "$i" -R -1; done $ grep -i KERNEL *
Binary file 0.ex_ matches
Apparently the ROTXOR key starts at 0x00 and rotates at a decrement of -1 for every byte processed. The rotation is typical for PDFs of the day, though I have also seen different start points. Now that we have our decoded blob, the rest can be disposed of.
$ mv 0.ex_ Carved_decoded_ROTXOR255_key0_step-1.exe
$ rm *.ex_
$ ls
1.pdf Carved_decoded_ROTXOR255_key0_step-1.exe multi-xor-v2.pl
The .EXE can be run through standard analysis routines to discover the call-outs and second stage drops. This PDF is definitely malicious.
So, to take it to step two, addressing the large numbers of these PDFs, just take the above steps, codify into a script, and run in another loop.
$ mkdir pdf.analysis
$ cp *.pdf pdf.analysis
$ cd pdf.analysis
$ find . -type f -name *.pdf | while read i; do echo "processing $i"; ../analyzepdf.sh "$i"; done && find . -type -f -name *.exe | while read i; do echo "MATCH: $i"; done | tee matches.txt
The above loop creates a directory for analysis, creates an array of the PDF files available to be analyzed, and initiates the analysis script for each of them. The anlaysis script will create analysis subdirectories for each PDF, perform the above analysis steps and decodings, identify the interesting tidbits, and leave behind the interesting artifacts. When the loop finishes, the FIND command is used to locate the executables left behind, and create a notification for those PDFs found to have drops, recording this information to the matches.txt file.
Now you can revisit the PDFs identified in the matches .txt file and carve the droppers out of them.
$ dd if=1.pdf of=c1.bin bs=1 skip=27598 count=834887 && xxd c1.bin | less
834887+0 records in
834887+0 records out
834887 bytes (835 kB) copied, 1.7896 s, 467 kB/s
Apply the ROTXOR decoder scripts to the blob to reveal the executable.
$ for ((i=0; i<256; i++)); do echo $i; perl multi-xor-v2.pl -f c1.bin -o $i.ex_ -k "$i" -R -1; done $ grep KERNEL *
Binary file 0.ex_ matches
$ mv 0.ex_ Carved_decoded_ROTXOR255_key0_step-1.exe
$ rm *.ex_
$ ls
c1.bin Carved_decoded_ROTXOR255_key0_step-1.exe multi-xor-v2.pl