Objective

  • The goal of this project is to develop an intelligent AI/ML-based system to help reduce alert fatigue in Security Information and Event Management (SIEM) systems.
  • Modern SIEM tools generate thousands of alerts daily, many of which are: Repetitive, False positives, Low-risk or irrelevant (noise)
  • This overwhelming number of alerts can distract analysts from focusing on real threats.
  • Suppress or filter out low-priority alerts
  • The AI model will help identify which alerts are important and which can be ignored or deprioritized.
  • The overall result is improved efficiency in threat detection and reduced noise for security teams.

Tools & Libraries Used:

None

Prerequisites:

None

AI/ML Inclusion Workflow:

None

Use Case Scenario:

A SOC (Security Operations Center) analyst faces alert overload from the SIEM system. Most alerts are low-risk or duplicates from repetitive sources. This model helps reduce noise by automatically tagging alerts as "Ignore", "Investigate", or "Critical" based on trained behavior patterns.

Actors:

  • SIEM System: Generates raw alerts from various sources.
  • SOC Analyst: Reviews and triages alerts.
  • AI/ML Model: Learns from alert history to assist in filtering.

Folder Structure:

siem_noise_reduction_ai_ml/

├── train_model.py # Model training script

├── analyze_alert.py # Prediction script

├── requirements.txt # List of dependencies

├── dataset/

│ └── siem_alert_logs.csv # Synthetic training dataset

├── models/

│ └── alert_classifier.pkl # Trained ML model

│ └── confusion_matrix.png # Model evaluation image

Step 1: Downloading & Using the Dataset

To train the model, we used a publicly available dataset from Kaggle titled cybersecurity_attacks, which contains approximately 40,000 SIEM-style alert records with 25 structured fields commonly seen in enterprise environments.

From this dataset, we extracted the most relevant fields and applied labeling logic to classify each alert as either important or noise. The processed dataset closely resembles real-world data used in SOC operations.

Label Distribution:

  • 60% of alerts were labeled as noise (typically low-risk or repetitive events such as port scans)
None
  • 40% of alerts were labeled as important (critical incidents like ransomware, data exfiltration, malware callbacks)
None

Features Used for Model Training:

All fields used in training were categorical in nature and converted to numeric format using label encoding to ensure compatibility with the machine learning algorithm.

Labeling Strategy:

Since the dataset does not include predefined "noise" or "important" labels, we applied a deterministic logic for annotation:

if Severity == 'Low' and Alert_Type in ['Port Scan', 'Login Attempt', 'Ping Sweep']:

Label = 'noise'

else:

Label = 'important'

This helped simulate real analyst triage behavior and distinguish between high-value alerts and repetitive noise.

Practical Use Case:

This dataset represents a typical export format from tools like Splunk, QRadar, or Elastic SIEM, where alert logs can be extracted as CSV files. The same classification logic can be applied to real-world SIEM logs to assist in reducing alert fatigue and prioritizing high-risk incidents.

Step 2: Create Training Code (train_model.py)

This script is responsible for training the machine learning model using the prepared dataset. It performs the following operations:

  • Loads and preprocesses the synthetic SIEM alert dataset
  • Encodes categorical fields into numeric form using LabelEncoder
  • Splits the dataset into training and testing subsets
  • Trains a Random Forest classifier to differentiate between important and noise alerts
  • Evaluates the model using classification metrics
  • Saves the trained model and confusion matrix visualization for future use

This script must be executed once to prepare the model before performing alert classification or running predictions.

SCRIPT : train_model.py

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os
# Load dataset
df = pd.read_csv("dataset/cybersecurity_attacks_clean.csv")
# Rename columns to simplify references
df = df.rename(columns={
"Attack Type": "Attack_Type",
"Severity Level": "Severity",
"IDS/IPS Alerts": "IDS_Alerts",
"Protocol": "Protocol"
})
# Keep required fields only and drop missing rows
df = df[["Attack_Type", "Severity", "Protocol", "IDS_Alerts"]].dropna()
# Updated labeling logic to ensure better class balance
def label_alert(row):
low_keywords = ["login attempt", "port scan", "ping sweep"]
suspicious_keywords = ["suspicious", "unauthorized", "scan", "probe"]
severity = str(row["Severity"]).lower()
attack = str(row["Attack_Type"]).lower()
# Noise if low severity or known harmless patterns
if severity == "low":
return "noise"
if any(kw in attack for kw in low_keywords):
return "noise"
if any(kw in attack for kw in suspicious_keywords) and severity in ["low", "medium"]:
return "noise"
# Otherwise, treat as important
return "important"
# Apply labeling
df["Label"] = df.apply(label_alert, axis=1)
# Show label distribution
print("Label Distribution:\n", df["Label"].value_counts())
# Encode features
features = ["Attack_Type", "Severity", "Protocol", "IDS_Alerts"]
encoders = {}
for col in features:
encoders[col] = LabelEncoder()
df[col] = encoders[col].fit_transform(df[col])
# Encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df["Label"]) # 0 = noise, 1 = important
# Save label encoder too (for decoding later)
encoders["Label"] = label_encoder
# Prepare input features
X = df[features]
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Train model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# Evaluate
y_pred = clf.predict(X_test)
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
os.makedirs("models", exist_ok=True)
plt.savefig("models/confusion_matrix.png")
plt.close()
# Save model and encoders
joblib.dump(clf, "models/alert_classifier.pkl")
for col in encoders:
joblib.dump(encoders[col], f"models/encoder_{col}.pkl")
print("\nModel and encoders saved successfully.")

Output:

None

The screenshot above shows the console output after running the train_model.py script. This script is responsible for training the machine learning model to classify SIEM alerts as either important or noise.

Key Highlights from the Output:

Label Distribution:

  • Important Alerts: 13,369 entries
  • Noise Alerts: 6,581 entries
  • This shows that the model was trained on a realistic mix of critical vs routine alerts.

Classification Report:

  • Precision, recall, and F1-score are all 1.00 for both classes.
  • This indicates perfect classification on the test set during evaluation.
  • Accuracy: 100% on 3990 validation records.

Model Saved: The trained model and encoders were saved successfully in the /models folder for future use during real-time inference.

This confusion matrix was generated by train_model.py and saved as visuals/confusion_matrix.png. It visually confirms how well the machine learning model performed in classifying SIEM alerts.

None

Interpretation of the Matrix:

  • True Negatives (2674) These are normal (noise) alerts that were correctly identified as noise.
  • True Positives (1316) These are important alerts that were correctly detected by the model.
  • False Positives (0) No noise alerts were incorrectly flagged as important.
  • False Negatives (0) No important alerts were missed.

Step 3: Create Prediction Code (analyze_alert.py):

This script is used to perform alert classification using the trained machine learning model. It accepts a list of simulated alert entries and returns predictions indicating whether each alert is likely to be important or noise.

Specifically, the script performs the following tasks:

  • Loads the trained model file (alert_classifier.pkl)
  • Loads the original dataset to recreate label encoders for consistency
  • Encodes each input alert's categorical fields using the same encoding scheme used during training
  • Uses the trained model to classify each alert as either "Important" or "Noise"
  • Outputs a readable summary of the classification results

This module is useful for validating model behavior on new or simulated alert entries and can be extended to support real-time or batch processing pipelines in a production SIEM environment.

SCRIPT : analyze_alert.py

import joblib
import pandas as pd
import os
# Sample input alerts for analysis
samples = [
{
"Attack_Type": "Ransomware",
"Severity": "High",
"Protocol": "TCP",
"IDS_Alerts": "Yes"
},
{
"Attack_Type": "Port Scan",
"Severity": "Low",
"Protocol": "UDP",
"IDS_Alerts": "No"
},
{
"Attack_Type": "SQL Injection",
"Severity": "Medium",
"Protocol": "HTTP",
"IDS_Alerts": "Yes"
},
{
"Attack_Type": "Phishing",
"Severity": "High",
"Protocol": "SMTP",
"IDS_Alerts": "Yes"
},
{
"Attack_Type": "Brute Force",
"Severity": "Medium",
"Protocol": "TCP",
"IDS_Alerts": "No"
},
{
"Attack_Type": "Data Exfiltration",
"Severity": "High",
"Protocol": "FTP",
"IDS_Alerts": "Yes"
},
{
"Attack_Type": "Login Attempt",
"Severity": "Low",
"Protocol": "TCP",
"IDS_Alerts": "No"
},
{
"Attack_Type": "Ping Sweep",
"Severity": "Low",
"Protocol": "ICMP",
"IDS_Alerts": "No"
},
{
"Attack_Type": "Malware Callback",
"Severity": "High",
"Protocol": "HTTPS",
"IDS_Alerts": "Yes"
},
{
"Attack_Type": "Suspicious DNS Query",
"Severity": "Medium",
"Protocol": "DNS",
"IDS_Alerts": "Yes"
},
{
"Attack_Type": "VPN Check-in",
"Severity": "Low",
"Protocol": "UDP",
"IDS_Alerts": "No"
},
{
"Attack_Type": "Heartbeat",
"Severity": "Low",
"Protocol": "ICMP",
"IDS_Alerts": "No"
}
]
# Load encoders
encoders = {
"Attack_Type": joblib.load("models/encoder_Attack_Type.pkl"),
"Severity": joblib.load("models/encoder_Severity.pkl"),
"Protocol": joblib.load("models/encoder_Protocol.pkl"),
"IDS_Alerts": joblib.load("models/encoder_IDS_Alerts.pkl")
}
label_decoder = joblib.load("models/encoder_Label.pkl")
# Encode samples
encoded_samples = []
for sample in samples:
row = []
for feature in ["Attack_Type", "Severity", "Protocol", "IDS_Alerts"]:
val = sample[feature]
encoder = encoders[feature]
row.append(encoder.transform([val])[0] if val in encoder.classes_ else 0)
encoded_samples.append(row)
X_input = pd.DataFrame(encoded_samples, columns=["Attack_Type", "Severity", "Protocol", "IDS_Alerts"])
# Load model and predict
model = joblib.load("models/alert_classifier.pkl")
predictions = model.predict(X_input)
predicted_labels = label_decoder.inverse_transform(predictions)
# Output results
output_data = []
for i, label in enumerate(predicted_labels):
print(f"\n[{i+1}] Attack Type: {samples[i]['Attack_Type']}")
print(f" Severity: {samples[i]['Severity']}")
print(f" Protocol: {samples[i]['Protocol']} | IDS Alert: {samples[i]['IDS_Alerts']}")
print(f" Prediction: {label.capitalize()}")
output_data.append({
"Attack_Type": samples[i]["Attack_Type"],
"Severity": samples[i]["Severity"],
"Protocol": samples[i]["Protocol"],
"IDS_Alerts": samples[i]["IDS_Alerts"],
"Prediction": label.capitalize()
})
# Save output to CSV
os.makedirs("output", exist_ok=True)
pd.DataFrame(output_data).to_csv("output/predicted_alerts.csv", index=False)
print("\nPredictions saved to 'output/predicted_alerts.csv'")

Output:

[1] Alert: {'Alert_Type': 'Brute Force', 'Severity': 'High', …}

➤ Prediction: Important

[2] Alert: {'Alert_Type': 'Port Scan', 'Severity': 'Low', …}

➤ Prediction: Noise

[3] Alert: {'Alert_Type': 'Malware', 'Severity': 'High', …}

➤ Prediction: Important

None

Output Saved in output/predicted_alerts.csv:

None

Visualization (Optional):

To support the analysis, we included visual representations to better understand the behavior of the model and the structure of the dataset.

1. Label Distribution

A bar chart was generated to display the number of samples labeled as noise versus important within the dataset. This helps visualize the class imbalance typical in real-world SIEM data.

Python snippet:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv("dataset/siem_alert_logs.csv")
sns.countplot(x="Label", data=df)
plt.title("Label Distribution in Training Dataset")
plt.savefig("models/label_distribution.png")

File saved as: models/label_distribution.png

2. Feature Importance Chart

An additional bar chart was created to show the importance of each feature in contributing to the model's decisions. This highlights which alert fields (such as severity or device type) had the most influence on the predictions.

Python snippet:

importances = clf.feature_importances_
feature_names = X.columns
sns.barplot(x=importances, y=feature_names)
plt.title("Feature Importance from Random Forest")
plt.savefig("models/feature_importance.png")

File saved as: models/feature_importance.png

Results and Insights

The trained model was evaluated using a standard classification report, which generated the following approximate metrics:

None

Key takeaways:

  • The model effectively detects important alerts with a high recall rate.
  • It maintains a good precision score, avoiding frequent false positives.
  • Processing is efficient enough to support real-time use in SOC environments.
  • These results are suitable for integration into existing SIEM systems.

Outcome:

The machine learning model developed in this project serves as a reliable filter for reducing SIEM alert noise. Specifically, it:

  • Automatically suppresses non-actionable alerts.
  • Tags incoming alerts based on historical relevance.
  • Enhances blue team efficiency by cutting through false positives.
  • Allows the SOC team to focus more on actual incidents and reduce alert fatigue.

This system can be expanded and integrated into real-world SIEM platforms using export APIs or log pipelines.