- Every alert that the system generates would warrant the team to to take a look at the alert for further action. The alert could be something that needs remediation right away or could need some time after and
- The system would not generate alerts for behavior that the security team would deem normal.
The most comprehensive log that Amazon AWS produces is AWS infrastructure access logs, CloudTrail. So if the system can do noise-free alerting for CloudTrail, that would be a big leap forward.
The Goal: Our goal is to deterministically catch early indications of security threats such as compromised credentials, could be an internal user or a third party integration role that gets compromised. We are not trying to catch users who are not supposed to do certain actions, that is governed by IAM roles. What we are trying to find indicators of compromise that would would indicate with a high degree of confidence to warrant further investigation.
Indicators in CloudTrail
Let's first look at the indicators available in CloudTrail and their fidelity.
- User Role (who is the user ?) — The role which made the infrastructure call, the attribute is the arn for the user. The role could be an assumed-role or a normal user or god forbid root.
- User action (what did the user do? ) — The indicator here is the eventSource:eventName. Whether the user did a particular action before is a good indicator but the security team should not waste time just on that as the user can technically do all actions permitted by the role, so we need something more.
- Source IP of the Action — The indicator here is the sourceIPAddress. Whether the user role is do doing actions from an IP address that he/she/role has never used before. On the surface it looks like a good indicator, but it is not. IP addresses change all the time especially if you have users doing actions from coffee shops or services such as AWS Lambda doing actions, and AWS services change IP addresses all the time (which can be derived but attacks from AWS happen all the time as well, so knowing that its a AWS IP address is not deterministic), we need something more.
The secret ingredient , userAgent.
- User agent — The indicator here is agent through which the request was made, such as the AWS Management Console, an AWS service, the AWS SDKs or the AWS CLI. It also indicates the automation system, such as whether the action is doing through Terraform or boto3. Sounds like a pretty good input into the system.
Summary of indicators :
The System and the Algorithm
Over the last few months, we worked with several customer to develop an algorithm that would use the indicators above to get to noise free alerting. Here are the observations that led to the algorithm:
We will take the example of DataDog, a popular third party role to illustrate the observations.
Roles use very limited set of userAgents : A given role uses very limited number of userAgents, typically l or 2 with a high end number of 6. So it is very easy to develop baselines based on userAgents. DataDog uses 4 UserAgents.
>>> len (ctdf [ctdf.arn.str.contains('Datadog',na=False)].userAgent.unique().tolist() )
4Roles typically do 10s of actions (eventNames): Though the role gives them wider scope, a typical role does only a few actions. It is quite easy to develop baselines on user actions, especially if we can categorize them into Read (Get) vs Write (Put, Attach) actions
Roles use large number of dynamic IP address subnets: A given role uses tens if not hundreds of IP address subnets (thousands of IP addresses). If its a third party role, it could not non AWS IP addresses as well.
DataDog uses about 10 roles, mostly GetActions and 64 IP address subnets.
>>> ctdf [ctdf.arn.str.contains('Datadog')].eventName.unique().tolist()
['GetSendStatistics', 'GetSendQuota', 'GetBucketTagging', 'Decrypt', 'GetTrailStatus', 'HeadObject', 'GetObject']
>>> ctdf [ctdf.arn.str.contains('Datadog')].userAgent.unique().tolist()
['Boto3/1.7.61 Python/2.7.12 Linux/3.13.0-141-generic Botocore/1.5.65+dd.0', '[Boto3/1.7.61 Python/2.7.12 Linux/3.13.0-141-generic Botocore/1.5.65+dd.0]', 'lambda.amazonaws.com', 'AWS Internal']
>>> len (ctdf [ctdf.arn.str.contains('Datadog')].sourceIPAddress.unique().tolist() )
64The Algorithm: If the role does an action (eventName) that it never did before using an UserAgent that it never used before , alert.
Now comes what does Never mean. How many days of data do we need to aggregate to netting a good number of roles, their actions and their userAgents ?
In our observations working with our wonderful customers, the minimum number of days is 10 and the max is 30, and we put 15 days of aggregation in our product by default. Our customers are starting to use this system to examine indicators of not just security issues but operational issues as well, "whether the role did something it normally does" is a powerful early indicator of operational issues that are about to become fires.
Cloud security with the speed of serverless and convenience of Slack :