top of page

Creating a Data Driven Detection Lifecycle: Solving the SOC


detection-lifecycle

Every day, your SOC is inundated with a deluge of data, but is it leading to actionable insights? Big data is king, and it's becoming more prevalent every day. In the relentless battle against cyber threats, data is the unsung hero that can make a huge difference between breach and defense. Data can help drive product design decisions, create massive data models, help make educated predictions, and point to areas that may need improvement. While building out a SOC (Security Operations Center), it’s important to understand how implementing data driven processes is the most efficient way to build an effective team. As a refresher, the SOC is the team responsible for alert triage and incident response process in a security team, and is generally assisted by a team of security engineers to build out detections for alerts and incident response infrastructure.


A SOC team is built on top of the massive amounts of data being ingested daily, generally via a SIEM (Security Incident Event Management) tool. The main cost behind SIEM licensing is hands down going to be the data ingestion costs, so it’s vital to make sure you’re making the most of all the data that is being collected.


So, how do you do this within your Detection Lifecycle? A Detection Lifecycle is the process a security team takes to review and improve the quality of their detections in place. It can also be to look for gaps in knowledge where more detections should be created. This is meant to be an iterative process, taking place generally quarterly, or after there has been enough data collected to be considered statistically significant since the previous review.


Let’s walk through how this looks for a detection from end-to-end. A detection will first come to fruition after a gap in monitoring has been identified, through online research/documentation, or if a new log source has begun ingestion into the SIEM.


For our example, let’s consider the following scenario: A new data source for a Remote Support Software has been hooked up to our SIEM, and as a member of the Detection Engineering team, you are now tasked with creating a suite of new detections for these new audit logs. You start by diving into the documentation to gain an understanding of how the software works, the use cases it solves, and to assess risk. After digging through the available online resources, you set up a discussion with the IT department, who is the admin for the software, to determine the use case for the software. The primary use case is to help troubleshoot remote employees by utilizing screen capture and giving IT the ability to control the computer. You assess the main risk of this software, and determine it is unauthorized screen share with employees. So, you create a detection that alerts whenever a screen share session has begun. After implementation, you quickly realize that due to the size of the company, these screen share sessions are happening approximately 10 times per day, and quickly contributing to heavy amounts of alert fatigue.


So, how could we use a data driven detection lifecycle to help solve this?


Filter for Most Frequent

The first step in the detection lifecycle is to filter through your alerts to find the detections that fire off most frequently. These detections will generally be considered the “low-hanging fruit” - detections that clearly need some tweaking and can have counts knocked down with a simple fix. Since this is happening ~10 times a day, it is an easy candidate for a review.


Classification

When closing out alerts, it’s a good idea to create a field for the classification of why the alert is being closed. This makes it easier to see, from a high-level, what is going on with an alert. Here are some easy ways to classify your alerts:

  1. True Positive - This was a valid alert that caught potentially malicious activity.

  2. False Positive - This alert caught activity that was deemed not malicious and is normal activity.

  3. Confirmed Activity - The activity was confirmed Out-of-Band by the employee who the alert was about, deemed not malicious.

  4. Expected Activity - The activity has been previously deemed normal and is expected to happen by the team.

  5. Security Testing - The activity is for a security test to gauge the effectiveness of an alert, or to ensure that integrations are working as expected.

These classifications will serve as a baseline for classifying alerts and can point to areas that could use improvement.


Now that we’ve identified this detection as low-hanging fruit due to its frequency of alerting, we can check the classifications spread for the alert. After querying the data, it’s apparent that all of the Classifications for the alert are False Positives, Confirmed Activity, or Expected Activity. Due to there being so many employees in the company, as lead detection engineer, you decide it’s time to create a list to only detect on employees that have access to sensitive files. After collecting data, you’ve identified that the main risk of the application is a vishing attempt where a malicious actor poses as an IT engineer, and gets access to an employee's laptop via the screen share. So, you create a blacklist for this detection so that it only fires an alert whenever a screen share session has begun with a C-Suite member (CEO, CFO, CTO, etc.), a Director, a VP, member of Finance, or Senior Manager due to their high levels of access.


After this fix has been implemented for a fair amount of time, you’ve notice the level of frequency in the alert has decreased, and the data is there to back it up. But as mentioned previously, the detection lifecycle is meant to be an iterative process. As the new quarter rolls around, all detections are up for review again. You stumble across the VIP Screen Share detection you had created again. But, this detection is no longer considered low-hanging fruit. What other analytics should you look at this time to determine how to improve the detection?


Investigation Time

Most ticketing systems will collect analytics on how long the ticket is open (how long it takes to transition from In Progress to Closed). However, this doesn’t always tell the full story. A more telling analytic can be gathered by asking the person triaging the ticket to mark an Investigation Time: the estimated amount of time spent actively triaging the ticket. Why is this more useful? It’s possible a ticket could be considered open for an hour if the analyst is waiting for a response from a person of interest, but could have only spent 2-3 minutes actually investigating the alert. In this instance, investigation time is much more telling about which detections could use improvement. If alerts are consistently taking high amounts of time to triage, it could mean it’s time to rethink the detection logic or automation surrounding it. Here’s how I’d recommend splitting up your Investigation Time field:

  • < 1 min: The Detection Logic and Automation on this alert were good enough to close it incredibly quick.

  • 1-5 mins: The Detection Logic and Automation on this alert were good, but some additional digging was required. Some improvements could be made.

  • 5-10 mins: There was some digging required on this alert, likely through multiple different systems. There is likely some room for improvement.

  • 10-30 mins: There was some suspicious activity on this alert that required some additional digging and may or may not have led to an escalation to an incident.

  • 30-60 mins: There was something very suspicious that occurred and likely involved getting other security team members involved to investigate fully.

  • >1 hour: This was likely escalated to an incident after involving other security team members.

Generally, alerts falling primarily in the 1-5 minute or 5-10 minute still have room for improvement, and it can generally be chained with the next field I’d recommend implementing.


Resources Used

This field will be used to document exactly what the analyst used to triage the alert, pointing to areas for improvement in either the automation or the detection logic. For example, does triaging the alert have heavy usage of your SIEM in the resources used? This would likely point to a gap of knowledge in the context around the alert, meaning there could be some opportunity to add queries into the automation logic. Are user roles being referenced often? This could point to an opportunity to refine an exception or add a new one. The Resources being used vary from environment to environment - make sure to include all of your possible investigation resources in the list and make inferences based on the detection under review.


Finalizing the Detection Lifecycle Iteration

Moving through our list of detections, we are back to our VIP Screen Share detection. Looking through the data from the last quarter, it becomes evident that the detections are split pretty evenly between 1-5 minutes and 5-10 minutes. After glancing at the Classifications, almost all are now Confirmed Activity, with some Expected Activity sprinkled in. A quick glance at the Resources Used and there are only two - SIEM and Slack. From this information, we can make a quick inference that the detection logic is fine - there are no False Positives, so it seems like the detection is doing what it should be.


However, it seems there would be room for improvement on the automation side since the SIEM is still being so heavily used. The first improvement could be to add a query that looks for the session initiator's most common IP addresses - this would allow the analyst triaging the alert to cross reference the IP from the alert to make sure it matches the initiator's home IP. Any discrepancy could mean malicious activity.


Next, an improvement could be made to the automation workflow to decrease ticket triage significantly. Since a majority of the alerts are Confirmed Activity through Slack, what if we could automate the confirmation process? This is a perfect use case for a SlackBot. So, you implement a new addition to the workflow that will reach out to the initiator via Slack, giving them a 10 minute verification window of the initiated session and requiring them to verify with MFA.


Now, any ticket that comes in will first ask for their verification, which if given, will automatically close the ticket and assign it a Confirmed Activity classification. On top of that, now you know that any VIP Screen Share alert that comes through to the security team will either be notated saying the 10 minute verification window expired, or the session was not a verified session (SOUND THE ALARM!). And in this case, you’ll have included context around alert as well.


As you can see, creating metrics around your alerts is amazingly beneficial to the detection lifecycle that every Detection Engineering team and SOC will encounter. It takes the guessing and anecdotal feelings out of it, and allows you to look at the data to figure out what could be better. On top of that, not only will it help in the continuous fight against alert fatigue, but it will also help to make the most of your money (and data).


If you’re enjoying the Cybersec Cafe “Solving the SOC” series, consider subscribing as there’s so much more to solve within the SOC, and to cover within the Information Security space!


Comments


bottom of page