Monday, December 28, 2009

Malware Analysis Toolkit for 2010

Back in 2008 I posted a list of the tools I use for doing malware analysis. The tools I use have changed over time, and rather than just talk about a couple of recent additions, I decided I'd put a current complete list up with links. This is by no means a comprehensive list of malware analysis tools, it's just what I like and use.

  • VMWare Workstation
  • The "vulnerable stuff:"
    • Windows XP
    • Internet Explorer 7/8
    • Firefox
    • Acrobat Reader
    • Flash Player
General Tools
Analysis Tools

Binary Tools
JavaScript & HTTP Tools
PDF & Flash Tools
Web Sites as Tools

Wednesday, November 18, 2009

ArcSight Logger VS Splunk

You are here because you are searching for information on Splunk vs. ArcSight Logger. I actually wrote this post months before posting it, but sat on it for reasons that may become apparent as you read on.

If you want to hear me talk about my experience with Logger 4.0 through the beta process and beyond, you can check out the video case study I did for ArcSight. In short, Logger is good at what it does, and Logger 4.0 is fast. Ridiculously fast.

But that's not what I want to talk about. I want to talk about the question that's on everyone's mind: ArcSight Logger vs. Splunk?

Comparing features, there's not a strong advantage in either camp. Everybody's got built-in collection based on file and syslog. Everybody's got a web interface with pretty graphs. The main way Logger excels here is in its ability to natively front-end data aggregation for ArcSight's ESM SIEM product. But if you've already got ESM, you're going to buy Logger anyway. So that leaves price and performance as the remaining differentiators.

Splunk can compete on price, especially for more specialized use cases where Logger needs the ArcSight Connector software to pick up data (i.e. Windows EventLog via WMI, or database rows via JDBC). And if you don't care about performance, implying that your needs are modest, Splunk may be cheaper for you for even the straightforward use cases because of the different licensing model that scales downward. So for smaller businesses, Splunk scales down.

For larger businesses, Logger scales up. For example, if you need to add storage capacity to your existing Logger install, and you didn't buy the SAN-attached model, you just buy another Logger appliance. You then 'peer' the Logger appliances, split or migrate log flows, and continue to run search & reporting out of the same appliance you've been using, across all peer data stores. With Splunk? You buy and implement more hardware on your own. And pay for more licenses.

My thinking on performance? Logger 4.0 is a Splunk killer, plain and simple. To analogize using cars, Splunk is a Ford Taurus for log search. It gets you down the road, it's reliable, you can pick the entry model up cheap, and by now you know what you're getting. Logger 4.0, however, is a Zonda F with a Volvo price tag.

To bring the comparison to a fine point, I'd like to share a little story with you. It's kind of gossipy, but that makes it fun.

When ArcSight debuted Logger 4.0 and announced its GA release at their Protect conference last fall, they did a live shoot-out of a Logger 7200 running 4.0 with a vanilla install of Splunk 4 on comparable hardware and the same Linux distro (CentOS) that Logger is based on. They performed a simple keyword search in Splunk across 2 million events, which took just over 12 minutes to complete. That's not awful. But that same search against the same data set ran in about 3 seconds on Logger 4.

This would be an interesting end to an otherwise pretty boring story if it weren't for what happened next. Vendors other than ArcSight - partners, integrators, consultants, etc. - participate in their conference both as speakers and on the partner floor. One of these vendors, an integrator of both ArcSight and Splunk products, privately called ArcSight out for the demo. His theory was that a properly-tuned Splunk install would perform much better. Now, it's a little nuts (and perhaps a little more dangerous) to be an invited vendor at a conference and accuse the conference organizer of cooking a demo. But what happened next is even crazier. ArcSight wheeled the gear up to this guy's room and told him that if he could produce a better result during the conference that they would make an announcement to that effect.

Not one to shy away from a technical challenge, this 15-year infosec veteran skipped meals, free beer, presentations, more free beer, and a lot of sleep to tweak the Splunk box to get better performance out of it. That's dedication. There's no doubt in my mind that he wanted to win. Badly. I heard from him personally at the close of the conference that not only did he not make significant headway, but that all of his results were worse than the original 12 minute search time.

You weren't there, you're just reading about it on some dude's blog, so the impact isn't the same. But that was all the convincing I needed.

But if you need more convincing; we stuffed 6mos of raw syslog from various flavors of UNIX and Linux (3TB) into Logger 4 during the beta. I could keyword search the entire data set in 14 seconds. Regex searches were significantly worse. They took 32 seconds.

Monday, November 9, 2009

Reversing JavaScript Shellcode: A Step By Step How-To

With more and more exploits being written in JavaScript, even some 0-day, there is a need to be able to reverse exploits written in JavaScript beyond de-obfuscation. I spent some time this weekend searching Google for a simple way to reverse JavaScript shellcode to assembly. I know people do it all the time. It's hardly rocket science. Yet, I didn't find any good walk-throughs on how to do this. So I thought I'd write one.

For this walk-through, I'll start with JavaScript that has already been extracted from a PDF file and de-obfuscated. So this isn't step 1 of fully reversing a PDF exploit, but for the first several steps, check out Part 2 of this slide deck.

What you'll need:
  1. A safe place to play with exploits (I'll be using an image in VMWare Workstation.)
  2. JavaScript debugger (I highly recommend and will be using Didier Stevens' modified SpiderMonkey.)
  3. Perl
  4. The script, which you'll find further down in this post
  5. A C compiler and your favorite binary debugger

I'll be using one of the example Adobe Acrobat exploits from the aforementioned slides for this example. You can grab it from milw0rm.

Step 1 - Converting from UTF-encoded characters to ASCII
Most JavaScript shellcode is encoded as either UTF-8 or UTF-16 characters. It would be easy enough to write a tool to convert from any one of these formats to the typical \x-ed UTF-8 format that we're used to seeing shellcode in. But because of the diversity of encoding and obfuscation showing up in JavaScript exploits today, it's more reliable to use JavaScript to decode the shellcode.

For this task, you need a JavaScript debugger. Didier Stevens' SpiderMonkey mod is a great choice. Start by preparing the shellcode text for passing to the debugger. In this case, drop the rest of the exploit, and then wrap the unescape function in an eval function:

Now run this code through SpiderMonkey. SpiderMonkey will create two log files for the eval command, the one with our ASCII shellcode is eval.001.log.

Step 2 -
This is why I wrote this script, to take an ASCII dump of some shellcode and automate making it debugger-friendly.


# crap2shellcode - 11/9/2009 Paul Melson
# This script takes stdin from some ascii dump of shellcode
# (i.e. unescape-ed JavaScript sploit) and converts it to
# hex and outputs it in a simple C source file for debugging.
# gcc -g3 -o dummy dummy.c
# gdb ./dummy
# (gdb) display /50i shellcode
# (gdb) break main
# (gdb) run

use strict;
use warnings;

my $crap;
while($crap=<stdin>) {
my $hex = unpack('H*', "$crap");

my $len = length($hex);
my $start = 0;

print "#include <stdio.h>\n\n";
print "static char shellcode[] = \"";

for (my $i = 0; $i < length $hex; $i+=4) {
my $a = substr $hex, $i, 2;
my $b = substr $hex, $i+2, 2;
print "\\x$b\\x$a";
print "\";\n\n";

print "int main(int argc, char *argv[])\n";
print "{\n";
print " void (*code)() = (void *)shellcode;\n";
print " code();\n";
print " exit(0);\n";
print "}\n";
print "\n";


The output of passing eval.001.log through is a C program that makes debugging the shellcode easy.

Step 3 - View the shellcode/assembly in a debugger
First we have to build it. Since we know that this shellcode is a Linux bindshell the logical choice for where and how to build is Linux with gcc. Similarly, we can use gdb to dump the shellcode. For Win32 shellcode, we would probably pick Visual Studio Express and OllyDbg. Just about any Windows C compiler and debugger will work fine, though.

To build the C code we generated in step 2 with gcc, use the following:

gcc -g3 shellcode.c -o shellcode

The '-g3' flag builds the binary with labels for function stack tracing. This is necessary for debugging the binary. Or at least it makes it a whole lot easier.

Now open the binary in gdb, print *shellcode in x/50i format, set a breakpoint at main(), and run it.

$ gdb ./shellcode
(gdb) display /50i shellcode

(gdb) break main

(gdb) run

Sunday, October 18, 2009

Two-For-One Talk: Malware Analysis for Everyone

These two mini-talks were originally going to be blog posts, but I needed a speaker for this month's ISSA meeting. So I volunteered myself. Here are the slides.

Wednesday, September 23, 2009

Queries: Excel vs. ArcSight

Since ArcSight ESM 4.0, reports and trends have been based on queries. Considering that ESM runs on top of Oracle, a query in ESM is exactly what you think it is. Queries are an extremely flexible way to get at event data. But as the name implies, they go against the ARC_EVENT_DATA tablespace, and therefore you can't use them to build data monitors or rule conditions, since those engines run against data prior to insertion into the database.

Anyway, I've got a story about how cool queries are. And about how much of an Excel badass I am. And also about how queries are still better. Last month, I got a request from one of our architects who was running down an issue related to client VPN activity. Specifically, he wanted to know how many remote VPN users we had over time for a particular morning. Since we feed those logs to ESM, I was a logical person to ask for the information.

So I pulled up the relevant events in an active channel and realized that I wasn't going to be able to work this one out just sorting columns. So, without thinking, I exported the events and pulled them up in Excel. So here's the Excel badass part:

If you want to copy it, here it is:

So A is the column that usernames are in. This formula uses the MATCH function to create a list of usernames and then the FREQUENCY function to count the unique values in the match lists. You need two MATCH lists to make FREQUENCY happy because it requires two arguments, hence the redundancy. It took about an hour for me to put it together, most of that was spent finding the row numbers that corresponded to the time segment borders.

But as I finished it up and sent it off to the requesting architect, I thought, there must be an easier way. And of course there is. So here's how you do the same thing in ESM using queries:

So, it's just EndTime with the hour function applied, and TargetUserName with the count function applied, and the Unique box (DISTINCT for the Oracle DBA's playing at home) checked. And then on the Conditions tab you create your filter to select only the events you want to query against. That's it.

Once the query is created, just run the Report Wizard and go. All told, it's about 90 seconds to the same thing with a query and report that it took an hour to do in Excel.

Sunday, September 20, 2009

The 'Cyberwarfare' Problem

Last week I attended ArcSight's annual user conference in Washinton DC. More about that in a later post. During the conference, ArcSight hosted a panel discussion on cyberwarfare. In DC, where many of ArcSight's biggest customer are based, this is a hot topic, and there will be a lot of time spent discussing it and a lot of money spent on defending against it, maybe.

What struck me about the panel discussion were two comments, both made by James Lewis, one of the panelists, and a director at the Center for International and Strategic Studies. At one point, Mr. Lewis invoked Estonia as an example of state-sponsored cyberwarfare, and made the comment that, "the Russians are tickled that they got away with it." Not ten minutes later, an audience member asked a question about retaliation against cyber-attacks. Mr. Lewis responded to the question by pointing out the problem of attribution. That is, from the logs that the victim systems generated, the IP address(es) recorded can't reliably be used to identify the actual individual(s) responsible for the attack.

Now, I don't intend to pick on James Lewis. It just so happened that one person on the panel expressed the paradox of cyberwarfare. The attribution problem is a big problem for all outsider attacks, not just cyberwarfare. A decade ago, security analysts were calling it "the legal firewall" because US-based hackers would first hack computers in China, Indonesia, Venezuela, or another country that doesn't openly cooperate with US law enforcement, and then hack back into the US from there, causing an investigative barrier that would hinder or prevent an investigation being able to get back to the attacker's actual location.

So knowing that there's a very real problem with being able to identify the source country for Internet-based attacks, it stands to reason that using the same limited forensic data to not only identify the actual source of an attack, but to determine that it is in fact state-sponsored, and not, say, a grassroots attack armed by a teenager, is a stretch. And for that reason, the question of cyberwarfare is an open one. Until a government actually comes forward and claims responsiblity for an attack, it's unprovable.

So as the government spends $100M on cyberdefense over the next six months, it's important to try and answer the question, "What is the military actually defending against?" At the very least, it's fair to say nobody knows for certain.

Wednesday, August 12, 2009

Inbox 3

Teguh writes,

Hi Paul,
could you give some guide to administering logger? i searched thru
google, but found nothing significant. How to(s) and tutorial would be enough i
guess. Does it have to have syslog server for the logger to be able to read data

The documentation for Logger is available from ArcSight's download center. Only registered customers have access, but I assume that if you've got a Logger box, that generally qualifies you.

With regard to your second question, yes Logger has a syslog server. It actually has a few. In Logger nomenclature these are "receivers." Logger supports UDP and TCP syslog, FTP and SSH file pull, NFS and CIFS remote filesystem. Logger also supports some ArcSight-specific receivers including a SmartMessage receiver for events forwarded from ESM and CEF-over-syslog (OK, ArcSight wouldn't agree that this is specific to their products, but despite the C standing for Common, CEF is anything but. At least right now.)
  1. Configuring Logger to act as a syslog server is pretty straightforward.
  2. From the web interface, navigate to Configuration, Event Input/Output.
  3. On the "Receivers" tab, click the Add button.
  4. Name your connector and set the type as "UDP Receiver" then click Next.
  5. The defaults for Compression Level and Encoding are fine. Select the IP address you want the listener to reside on, and set the port number. The default syslog server port is UDP/514.
  6. Click Save.
  7. On the "Receivers" tab, click the little no-smoking image next to the new receiver to enable it.

Tuesday, June 23, 2009

Nobody Sells Laptops for The Price of Silver

If you haven't already, I recommend that you take 20 minutes and read "Nobody Sells Gold for the Price of Silver" by Cormac Herley and Dinei Florencio. (PDF Link) This is an excellent analysis of the research into and press coverage of the underground economy. It's a fascinating read, and they make a cogent argument that the underground economy is more myth than reality. I don't want to say more because it will ruin it for you.

Now I have an excercise for you. First, read the Herley/Florencio article. Then, read Bruce Schneier's experiences with trying to sell a laptop on eBay. Now think about the implications of the "Ripper Tax" on eBay. Now ask yourself why you haven't already sold any stock you own in eBay.

Thursday, June 18, 2009

PCI-DSS and Encrypting Card Numbers

OK, I'm about to do something dumb and talk about cryptography and cryptanalysis. I'm an expert in neither of these things. But despite the fact that somebody smarter than me should be telling you this, you're stuck with me, and I think I have a point. So here goes.

I had a bit of an "A-ha!" moment earlier today around PCI-DSS, specifically requirement 3.4 from v1.2 of the standard. Here's the relevant language from that requirement:

3.4 Render PAN, at minimum, unreadable anywhere it is stored (including on portable digital media, backup media, in logs) by using any of the following approaches:
  • One-way hashes based on strong cryptography
  • Truncation
  • Index tokens and pads (pads must be securely stored)
  • Strong cryptography with associated key-management processes and procedures
The bottom line is that this requirement fails to provide adequate protection to card numbers. Here's why.

Truncation and tokenized strings with pads have limited use cases. In the case of truncating card numbers, PCI-DSS recommends only storing the last 4 digits of the card number. You wouldn't choose truncation for a program that validates a card number because there would be too great a potential for false matches. It would only be helpful for including in receipts, billing statements, and for use in validating a customer identity in conjunction with other demographic information. Database tokens only provide adequate protection in environments where there is a multi-user or multi-app security model, and if there are flaws in the applications that have access to the pads, then your data is pwned.

So for the sake of maximum versatility and security, you're likely (or your software vendor is likely) to opt for hashing or encryption. But you still have a serious problem. While one-way hashes like SHA and block ciphers like AES can provide good protection to many forms of plaintext, credit cards aren't one of them. That's right, the problem isn't actually in the way you encrypt credit card numbers, it's that credit card numbers make for lousy plaintext to begin with.

Take for example the following row of data from my hypothetical e-commerce application's cardholder table:


In this case, we have what we need to use the card, except the card number is hashed with MD5. (Ignore what you know about MD5 collisions for a moment, since this problem also exists for SHA or any other method of encrypting the card number.) If we calculate the possible number of values that could be on the other side of that hash, it would be 10^16, or about 10,000 trillion for the 16-digit card number. That's roughly twice as many possibilities as an 8-character complex password (96^8), which is an acceptable keyspace size, but also completely doable for a tool like John The Ripper.

But if you know credit card numbers, then you've already realized that it's even worse than that. The first 4-6 digits of the card number are a misnomer in calculating keyspace. There aren't 1 million actual possible values. Since that row from my e-commerce app's database told me the card issuer, I know within 4-5 guesses the first two to four digits of the card number, and the last four are right there as well for inclusion on statements, etc. In this case, since it's a Discover card, we already know that the card number is 6011XXXXXXXX1111. Now we've cut the possible values we must guess in half, from 10^16 down to 10^8, which is a mere 100 million possibilities. There are other clever things we can do if it's encrypted with a stream cipher like RC4 or FISH, because we know the beginning and end values of the plaintext. But guess what? It's cheaper and easier to brute-force it even if lousy crypto is used. Even on the scale of millions of records. Even with salting, it's still worth it to brute-force the middle digits.

But wait, there's more! As if publicly known prefix values weren't enough, credit card numbers are also designed to be self-checking. That is to say, the numbers contain something like a checksum that, when a known algorithm is applied to the 7-digit account number, 3 digits of which we know from our last-four field, can be used to validate the card number. This was designed as an anti-fraud mechanism that would allow cards to be checked without a need to communicate with a clearinghouse. But this algorithm allows us to only generate valid account numbers, combined with partially-known prefixes, to reduce the keyspace significantly. And since this is a known algorithm I can (and someone already has) very easily write a tool that combines a brute-force password cracker with a credit card generator.

The bottom line is that, because of the already-partially-known nature of credit card numbers, simply encrypting card numbers inside a database or extract file is insufficient protection. The PCI Security Standards Council should revisit this requirement and modify it to, at the very least, require symmetric-key block ciphers and disallow stream ciphers and one-way hashes. But even then, I suspect, encrypted card numbers will be at risk. Certainly row-level encryption of card numbers should not qualify for "safe harbor" when it comes to breach notification laws.

PS - Extra credit if you crack the full card number from the hash above and post it below.

Thursday, June 11, 2009

From The Inbox 2

lmran writes:

Hi Paul,
Do you know any reason why ArcSight ESM does not support the Cisco MARS? Right now, all my firwalls send the syslog feeds into Cisco MARS and I'm trying to set the Cisco MARS to send thoes raw feeds data to ArcSight local connector but I just found out that ArcSight does not support the Cisco MARS. Thanks in ADV for any info reading this subject.

Starting in 4.x, MARS can forward events to another remote syslog listener. ArcSight has a syslog connector. So you ought to be able to forward events from MARS to ArcSight via syslog assuming MARS doesn't change the format of the log events too much. Even if MARS does mangle the event format, ArcSight will still receive them, but then most or all of the event will be parsed into the CEF Name field and categorization and prioritization won't be accurate.

If you are unable to upgrade your MARS appliance to 4.31 or later (I think that's the rev you need), another option would be to use a syslog-ng server out front. It supports forwarding events by source to other syslog servers. You could use this to send the stuff you want in ESM to ArcSight's syslog Connector and the stuff you want in MARS to MARS.

Or, you could do the environmentally conscious thing and unplug then recycle your MARS appliance. ;-)

Tuesday, June 9, 2009

From The Inbox

Anonymous writes,

Hi Paul, I am one of those who, as you say, found your blog by googling ArcSight, trying to do some recon on the product for my employer. (I think I see that the most recent posts here are from 2007 so who knows if you or anybody will be seeing my question.) I'm trying to find out, can Arcsight's data be queried programmatically; i.e. is it stored in a relational database, hopefully SQL Server or Oracle, or if not, is there an API or ADO.NET provider that can allow it to be queried, preferably with SQL? Thanks for any info anyone reading can provide.

ArcSight ESM uses Oracle 10g for its back-end database. At one point, and this may still be true, DB2 was also supported. You can query the database directly, and the schema is pretty straightforward. The table ARC_EVENT_DATA is where most of the event data lives, for example. But depending on your use case, that might not be the best way to get data out of ESM.

Also, since you didn't specify, it may be worth mentioning that the same is not true of the ArcSight Logger platform, which is flat storage. Instead of querying the log store directly, Logger can be configured to forward events based on source, type, etc. to another destination, if you need them in real-time. There is a PostegreSQL database on Logger, but it's my understanding that it supports the reports engine, and doesn't store the raw or CEF events in any comprehensive way.

The interesting thing is that the storage technology behind Logger 3.0, because of its performance and relative "cheapness" may become the data store for ESM down the road. It would only make sense, since you could handle MUCH higher event rates with less disk and no Oracle license fee. If it can be done while maintaining the stability and feature set that the Oracle-based data store has, it's a walk-off home run for ArcSight.

Monday, June 1, 2009

New Rules

After many months off, I'm jumping back in to the blog with both feet. Mostly in a Howard Beale sort of way. Didja miss me? Anyway, stealing a meme from Bill Maher, I've got something to say to security vendors. Without further ado, New Rules.

If you are a vendor, especially a vendor of security products or services, these are the rules I expect your product to follow. These are common sense, and I feel a little condescending telling them to you. But if recent experience is any indicator, you need to hear them. And you deserve the condescension.

  1. Do not store credentials in clear text! Seriously, you can get free libraries to hash credentials or store them in a secure container file that requires a secret key. There's no reason for a password to be in a text file or HKLM Registry key. None.
  2. Do not hardcode passwords! If I can't change every single password associated with your product simply and easily, then there should be a law that strips all of your developers of any degree they hold and forces them to go back to college and learn file IO methods.
  3. Do not use HTTP/Telnet/FTP/LDAP for authentication! Seriously, more than enough free libraries for SSH, TLS, IPSec exist. Use one. Or buy the one you really like. It beats having to issue a "patch" to sell to government and regulated industry.
  4. Don't run as root/SYSTEM/sa/DBA! Your product is not so special that it actually needs administrative privileges to run on the server or database that hosts it. Unless by "special" you mean "coded by lazy fools that don't want to define even the most basic security model." OK, then it is special.
  5. Don't use broken crypto algorithms! Sorry, but if you are shipping new product that uses 56-bit DES, RC4, or ROT13, please see rule #3.
  6. Don't send passwords in e-mail! Remote password reset is easy enough to do properly, there's no reason to be lazy and just send me my password if I forget it. Also, it means you're breaking rule #1. Busted.

There are no excuses for any product to not follow these rules, but especially security/compliance products. Gee, thanks. I just spent six figures on a product to help me manage or achieve compliance, and the product itself can't comply with the regulation I'm trying to address.

Monday, January 19, 2009

The Next Phase

For those of you who haven't given up on my blog (or forgot it was still in your feed list), I want to let you know that I will be back to it later this year. More punditry, more metrics, more SIM, more cool random technical stuff. I'll try anyway. I've been missing it, but I had too much going on, had to prioritize, and this blog has rusted as a result.

A lot has changed since my last blog post in November - a new position at work, a new baby daughter - and the one thing that I've come to realize is that changing is hard work, but if you want it, it's worth it. There's been an excessive amount of talk about change this past year, and on the eve of President Obama's inauguration, I've decided to share with you this story of a moment I had recently.

On November 5th, the day after Election Day 2008, I spoke at the SecureWorld Expo conference in Detroit. I've been in West Michigan for the past several years, but I used to live and work on the East side of the state. It was a gorgeous Wednesday, clear and unseasonably warm for November. And as I was driving westbound on I-96, into the dusk between me and the sunset, I looked up and found myself in familiar territory - Webberville.

You've probably never heard of Webberville, Michigan. That's OK. It's a rural town on the automotive corridor where in the 1990's, companies got huge tax breaks to buy up farmland and build factories. And in 2001, I had an office in one of those factories. That company (a "Tier One" in industry lingo because we sold directly to car makers), like many automotive suppliers, has since gone out of business. And despite working there only a year, I have some very fond and vivid memories of that job. Perhaps the most vivid, however, is driving that stretch of I-96 between Webberville and Wixom and hearing the radio newscaster describe the second plane hitting the World Trade Center on 9/11.

That day changed everything for Americans. I was living in the Midwest, working in a one-story office that had highway on one side and cows on the other, but for the weeks that followed the attacks, I was afraid. We all were. I recall making that drive to Webberville again a week later while all of the planes were still grounded and thinking to myself, "How long until we recover? Can we recover? What will it take for us to move forward?"

Not get over it. Not forget. But move forward - take the next step as a society, as a culture, as a country.

So back to 11/5/2008, and my drive home from SecureWorld, less than 24 hours after learning that Barack Obama - a young, African-American man - would be our next president. And it was there, on that piece of highway in rural Michigan that I answered my own question. Seven years and two months later, I knew America was moving forward. We were moving forward.