Tuesday, June 5, 2007

SIM Sizing - CPU (+ Performance Tuning)

There are a couple of places across a traditional SIM that are more susceptible to performance degradation than others. Here's the short list:

The Database
There basically two things that can beat up your database. The first is basically drowning it with INSERT's of new log data. While this will manifest as poor performance and high CPU utilization, the problem is most likely disk array write performance. The answer to that problem is most likely expensive. Sorry.

The second thing that can kill database performance is event searches. This can happen in reporting or table views or pattern discovery or even in charts and graphs. (It can also happen in correlation rule filters - more on that below.) Think of it this way; whatever means you are using to search events, especially historic events (double-especially if there's compression in the mix here) has to be translated in to some horrid, fugly SELECT statement, probably with multiple JOIN's. Use these in rules, graphs, or regularly scheduled reports and you can drown your database server in work to the point that the stuff you're actively doing is unusably slow. The answer here is a combination of giving lots of CPU to your database servers and writing smart search/filter statements.

Correlation Rules
Correlation rules are the heart & soul of SIM technology. For more on what they do, check out my old ISSA 'Intro to SIM' preso deck (PDF Link). There are a number of things that can screw you here, and I already mentioned the first one above. Writing filters that are too complex or simple filters that are too vague will come back to haunt you.

Like an IDS, you will need to tune the correlation rules that your SIM ships with. A lot of this will be about eliminating noise and false positives, just like IDS. But also like IDS, some of the tuning will be performance-related. In addition to the filters you write you will also want to think about things like the number of events to match on, time frame (how long to wait for event2 after event1 occurs), etc. A cool thing that ArcSight includes is a real-time graph partial rule matches. In the example below, you can see there are two rules there that need tweaking and will probably free up measurable memory and CPU cycles once I do.


One last tip on rules and performance: If your rule creates a new meta-event, make certain that the new event does not match its own correlation filter. Trust me on this one. It's worth the extra time to double-check before turning it on.

Log Agents/Parsers/Handlers
The final place where CPU load can grow quickly is your log collection points. Somewhere between the log source and the database is code that your SIM uses to convert the log from its original format to a standard format for insertion into the database. The frequency with which log entries hit this code can have an impact on performance. This is where all of the regex matching, sorting, asset category searching, severity calculation, and so on occurs. For well-defined log formats and sources (like firewalls), this tends not to be that intense a process since you have very little diversity to be handled. But for UNIX servers with a variety of services running, there is the potential for serious friction as these parsers try and figure out what the actual log source is and what the message means.

If you have something like a UNIX a server farm that generates thousands of events per second and you want to push it through your SIM, you will need to spread this load out or buy big hardware to handle it.

No comments: