Friday, September 23, 2011

Using Hadoop-MapReduce for faster log-correlation to detect APT.

The only well-known method of detecting an APT is by log analysis and correlation. “You need to put into place the processes and possibly the technology necessary to cultivate the security logs and pinpoint the information needed to keep the infrastructure secure. These efforts absolutely require some type of log management and correlation.” Correlation could be rule based correlation, statistical or algorithmic correlation, as well as other methods that involve relating various event logs to each other in order to detect a particular anomaly in the information system. Also as Alex Stamos from iSEC Partners has pointed out in his paper Aurora Response Recommendations [2], the companies with the most effective response to these attacks have utilized their central log aggregation mechanisms to track the actions of these attackers.
There can be plenty of locations on a system where these logs can be present. Also, the size of these log files is pretty huge. When considered on an enterprise level or a network of systems, human interaction is not a practical approach for log correlation and analysis, since the amount of task involved is quite substantial.
The solution I propose is to use the services of parallel programming, specifically the HadoopMapReduce framework to solve this issue. Hadoop-MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. [21] In Map-Reduce the data processing primitives are called mappers and reducers. This framework enables scaling the application to run over multiple (thousands) machines in the cluster and can be controlled by doing a configuration change. The framework is appropriate for log correlation and analysis because of the inherent parallel processability of the logs. Hunting for a specific event or a set of events in thousands of log files (each sometimes having terabytes of data) on various servers is not feasible or is way too slow on a single node non-parallel approach.


In a non Hadoop distributed approach if all the logs are stored on one central storage server, then the bottleneck is in the bandwidth of that server. Having more machines for processing only helps up to a
certain point- until the storage server can’t keep up. You will also need to split up the logs among the set of processing machines such that each machine will process only those logs that are stored in it to
remove this bottleneck. The storage and processing of the data have to be tightly coupled in data intensive distributed applications. Hadoop differs from traditional distributed data processing schemes.
Hadoop focuses on moving code to data instead of vice versa. [23 - Chapter 1] To see how Hadoop fits into the design of distributed systems, I suggest you read [23] and [22].

Implementation of Hadoop-MapReduce for Log Correlation

Step 1: Since the starting point for a majority of attacks involving APT’s start with one of the application crashing. The first step would be to look for any such crashes. On a standalone windows system any application errors or crashes would be recorded in the application – event logs. This process could be automated. We could write a java program that parses through just the application log file and detect any such log entries. Once the crash or exception in the application is noticed we record the date and time of the crash.

Step2: Our next step is to bring all the related log file entries together. Because the Map function would run on different nodes it is important to get all the different log file entries across different log files falling in the same date and time range together. This step could be performed by the MAP function. The input and output for the initial MAP function are as follows,


Our MAP function is not doing much of any computation on the log data. Its main aim is to get the data present in various different log files falling in the required time range together. So the input could be directly passed as the output. We can also perform some cleaning up here, if we consider that certain log entries are unnecessary we can omit them here. Also log entries out of our time range could be omitted here. Also, we can have a utility function that takes dates and time in different formats and converts them to a time stamp which would make it easy to compare and sort. The output of this MAP function and eventually all MAP functions across all nodes goes to the MapReduce Framework before going to the REDUCE function. The framework now sorts and groups the different pairs by key. One important aspect to note here is that the sorting happens in an ascending order i.e. the latest event is at the bottom of the list. Values with same keys (timestaps) are grouped together. This input now goes into the REDUCE function.



We have two places where we could omit the unwanted log data. One is in the MAP function where before outputting the values we can write a filter to remove the unwanted log entries. The other place is the REDUCE function where once we have the sorted data we can remove the unwanted log entries. If we wait till REDUCE function to omit unwanted data our resources may get wasted since the framework is processing the unwanted data. Now we have the relevant log data in the desired format and sorted in an appropriate way. The input to the REDUCE program would be all the potential logs that need to be searched for any kind of intrusion attempt. We need to implement our behavior based ranking system here and assign points to the various log files. The logic would be “if a particular log entry has any corresponding key words from the key word database then based on the relevance of the log file we assign a value on the scale of 10 to that particular log entry.” This value we assign is based on significance of the log entry with the corresponding to the application that crashed. For example, a log entry which says a file was downloaded in the system32 folder during that time duration would have a value of ‘8’ where as a log entry that says some completely unrelated application made some registry changes would have a value of ‘0’. There are a couple of factors that favor this approach. First, as we parse the log entries we discover new keywords that could be added to the keyword data type which increases the relevance of log entries higher up in the list. The keyword data type would initially be empty or it could contain the name or process id of the application that initially crashed. Second, the log entries are sorted in an ascending order. The relevance value of the different logs could be added up and the final value could be the output of the program.


Tests could be conducted to find an optimal relevance value that signifies that an intrusion has taken place in the system.

There are quite a few problems with this approach. More on the problems and solutions in the next post.

[21] http://hadoop.apache.org/mapreduce/
[22] Hadoop in Action. Author: Chuck Lam
[23] Hadoop the Definitive Guide Author: Tom Whit