January 4, 2025
Detecting Phishing Websites with Machine Learning: A Cybersecurity Perspective
By Maria Singh, Master of Science Candidate in Cybersecurity at Purdue Global University
Maria Singh
7 min read
By Maria Singh, Master of Science Candidate in Cybersecurity at Purdue Global University
According to Zscaler ThreatLabz 2024 Phishing Report, 2024 Phishing attacks against the finance and insurance industry surged by 393% year-over-year, comprising 27.8% of overall phishing incidents. Phishing attacks remain one of the most significant threats to the financial sector. They target unsuspecting users to steal sensitive information. Traditional detection methods are becoming insufficient as attackers evolve their tactics. In my recent research project, I explored how machine learning (ML) can enhance phishing detection accuracy, focusing on decision tree models to mitigate risks effectively.
Here's a step-by-step breakdown of how I applied machine learning to tackle this critical cybersecurity challenge.
1. Problem Statement: Why Detecting Phishing Websites Matters
Phishing attacks compromise login credentials, financial data, and personal information. These attacks can cause substantial economic loss and reputational damage, especially in the financial sector. My research focused on undetected phishing websites (False Negatives) that slip through detection systems and pose a high-impact, high-likelihood risk.
Key question: How can we leverage machine learning to reduce undetected phishing attacks?
2. The Power of Machine Learning: Why Decision Trees?
A decision tree model was chosen for this project because it is intuitive, interpretable, and highly accurate for binary classification problems like phishing detection. Unlike complex models, decision trees offer clear insights into which website features indicate phishing.
Why Decision Trees?
- Accuracy: Achieved a 91.19% accuracy rate in detecting phishing websites.
- Interpretability: Easy to understand and explain key predictors.
- Scalability: Can be integrated across platforms (email filters, browser extensions, etc.).
3. Dataset: Phishing Websites Dataset
The dataset used for building the phishing detection model is the Phishing Websites Dataset from the UCI Machine Learning Repository, curated by Mohammad, R., & McCluskey, L. (2012). This dataset contains various features that help identify whether a website is legitimate or a phishing attempt. The key features include:
- SSLfinal_State: Indicates the status of SSL/TLS certificates on the website.
- URL_of_Anchor: Measures whether the anchor tags within a webpage point to trusted or external domains.
- Prefix_Suffix: This checksum checks whether a website uses a prefix or suffix in the URL (e.g., hyphens), often seen in phishing websites.
This dataset has been widely used in academic research to develop machine-learning models for phishing detection.
Citation for the Dataset:
Mohammad, R., & McCluskey, L. (2012). Phishing Websites Dataset. UCI Machine Learning Repository. https://doi.org/10.24432/C51W2X
4. Key Features That Help Detect Phishing Websites
The model identified two critical features that play a pivotal role in detecting phishing websites:
- SSLfinal_State: Indicates whether the website has a valid SSL/TLS certificate.
- Phishing websites often lack proper SSL certificates.
- This feature was the root node in the decision tree.
2. URL_of_Anchor: Measures if the anchor tags in a webpage point to the same domain or external domains.
- Phishing websites often use deceptive anchor links.
These features helped the decision tree classify websites with high accuracy.
5. Building the Model: Step-by-Step Approach
To build the model, I followed a structured process:
- Data Collection: The phishing websites dataset from the UCI Machine Learning Repository was used, as shown in Figure 1.
Figure 1: Dataset Uploaded
- Data Preparation: Ensured the dataset was clean and formatted correctly in Figure 2.
Figure 2: Data Preparation
- Model Selection: The decision tree in Figure 1 was created using the rpart() function in R.
- Training and Testing: Split the dataset into 70% training and 30% testing to evaluate performance.
Figure 3: Model Selection
6. Visualizing the Decision Tree
The decision tree model in Figure 4 starts by checking if a website has a valid SSL certificate. If not, it is classified as phishing with an 88% probability. If the certificate is valid, the model checks the URL_of_Anchor to ensure it points to trustworthy domains.
Figure 4: Decision Tree
Key Highlights from the Decision Tree Output
1. Variable Importance
The most important variables for classifying websites as phishing or legitimate are ranked based on their contribution to the splits in the decision tree. Here's the ranking from your output:
- SSLfinal_State (44 points): The most critical predictor. This variable checks whether the website uses a valid SSL/TLS certificate, often missing in phishing websites.
- URL_of_Anchor (30 points): Measures if the anchor tags within the webpage point to internal or external domains. Phishing websites often use external links to redirect users to malicious sites.
- web_traffic (9 points): This indicates the website's traffic level. Legitimate websites generally have higher web traffic.
- having_Sub_Domain (7 points): Checks if the domain name has a subdomain, which can be a characteristic of phishing websites.
- Domain_registration_length and Request_URL (5 points each): These are less significant but contribute to the overall model accuracy.
2. Primary Splits
The primary splits in the decision tree show the decision points based on feature values:
- The first node splits on SSLfinal_State. A website with an SSLfinal_State value of less than 0.5 is classified as phishing with a higher probability.
- The next significant split is based on URL_of_Anchor, web_traffic, having_Sub_Domain, and Prefix_Suffix.
3. Surrogate Splits
Surrogate splits are used when the primary split variable is missing. In your model, URL_of_Anchor and web_traffic are often used as surrogate splits, meaning they serve as backup criteria when SSLfinal_State is unavailable.
- Feature Importance for Model Accuracy The decision tree model shows SSLfinal_State and URL_of_Anchor as the most influential features. These variables significantly improve the model's accuracy in distinguishing between phishing and legitimate websites.
- Interpretable and Actionable Insights Decision tree models are easy to interpret. The model shows that SSL certificates and anchor tag analysis are critical for detecting phishing websites. These insights can be shared with cybersecurity teams to update security policies and training programs.
Confusion Matrix Results
- Accuracy: 91.19%
- Sensitivity: 90.06% (correctly identified phishing websites).
- Specificity: 92.10% (correctly identified legitimate websites).
Figure 5: Confusion Matrix Output
These metrics highlight the balance between detecting phishing websites and minimizing false positives.
7. Real-World Incident Example
AI-Generated Phishing Scams Targeting Corporate Executives
In 2024, a surge of AI-generated phishing scams targeted corporate executives, including those in the financial sector. Cybercriminals utilized advanced AI tools to craft highly personalized phishing emails, making them more convincing and challenging to detect. These emails often contained personal details gleaned from online profiles, enhancing their legitimacy. The consequences were severe, leading to unauthorized access to sensitive information and significant financial losses. The FBI reported that business email compromise scams, a form of phishing, have cost over $50 billion since 2013, with AI-driven attacks contributing to the recent surge.
8. Business Use Case
Implementing Machine Learning for Phishing Detection in Financial Institutions
Financial institutions can bolster their security posture by integrating machine learning-based phishing detection systems. Here's how:
- Data Collection: Aggregate historical data on phishing attempts, including email content, URLs, and metadata.
- Model Training: Employ supervised learning algorithms to train models distinguishing between legitimate and phishing communications.
- Real-Time Monitoring: Deploy the trained model to monitor incoming emails and web traffic in real-time, flagging suspicious activities for further investigation.
- Continuous Learning: Regularly update the model with new data to adapt to evolving phishing tactics.
By implementing such systems, financial institutions can more effectively detect and prevent phishing attacks, safeguarding sensitive information and maintaining customer trust. Machine learning models can analyze vast amounts of transaction data in real time, identifying anomalies that may indicate fraudulent activity.
9. Continuous Improvement
Evolving Machine Learning Models to Combat New Phishing Tactics
Phishing tactics are continually evolving, necessitating the advancement of machine learning models. Strategies for continuous improvement include:
- Incorporating New Features: Integrate additional data points, such as behavioral biometrics and device fingerprints, to enhance detection capabilities.
- Adversarial Training: Expose models to simulated phishing attacks to improve resilience against sophisticated threats.
- Cross-Industry Collaboration: Share threat intelligence across industries to enrich the dataset and improve model accuracy.
By adopting these strategies, machine learning models can stay ahead of emerging phishing techniques, providing robust defense mechanisms for financial institutions. Machine learning algorithms' adaptability allows them to learn from new data and adjust to evolving fraud patterns, making them a powerful tool in the fight against financial fraud.
10. Conclusion: The Future of ML in Cybersecurity
Machine learning is transforming cybersecurity by enabling proactive threat detection. While traditional approaches struggle with evolving phishing tactics, decision tree models offer a scalable and interpretable solution. However, it is essential to recognize that this base model provides a strong foundation but has opportunities for refinement and improvement. Future advancements could include integrating more complex features, refining detection thresholds, and employing ensemble learning techniques to reduce false negatives and adapt to evolving threats. We can build more robust and resilient cybersecurity defenses against phishing attacks by continuously improving these models.
As a Master of Science candidate in Cybersecurity at Purdue Global University, I am committed to exploring innovative ways to make cyberspace safer for businesses and individuals.
References
To ensure my research is well-grounded, I referenced academic papers and industry reports, including works by Mohammad & McCluskey (2012) and Verma & Das (2017).
Ahmed, A. N. (2024, November 10). What is confusion matrix in machine learning? The model evaluation tool explained. Data Camp. https://www.datacamp.com/tutorial/what-is-a-confusion-matrix-in-machine-learning
Alkhalil, Z., Hewage, C., Nawaf, L., & Khan, I. (2021, March 8). Phishing attacks: A recent comprehensive study and a new anatomy. Frontiers. https://www.frontiersin.org/journals/computer-science/articles/10.3389/fcomp.2021.563060/full
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Routledge.https://doi.org/10.1201/9781315139470
Dhamija, R., Tygar, J. D., & Hearst, M. (2006). Why phishing works. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 581–590. https://doi.org/10.1145/1124772.1124861
Ma, J., Saul, L. K., Savage, S., & Voelker, G. M. (2009). Beyond blacklists: Learning to detect malicious websites from suspicious URLs. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://cseweb.ucsd.edu/~jtma/papers/beyondbl-kdd2009.pdf
Mohammad, R. & McCluskey, L. (2012). Phishing Websites Dataset. UCI Machine Learning Repository. https://doi.org/10.24432/C51W2X.
Pykes, K. (2022, October 19). Cohen's kappa explained. Builtin. https://builtin.com/data-science/cohens-kappa
Rokach, L., & Maimon, O. (October, 2014). Data mining with decision trees: Theory and applications. World Scientific Publishing. 81(2), 2–4. https://doi.org/10.1142/9097
Verma, R., & Das, A. (2017). What's in a URL: Fast feature extraction and malicious URL detection. Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics, 55–63. https://doi.org/10.1145/3041008.3041014
https://www.tookitaki.com/compliance-hub/fraud-detection-using-machine-learning-in-banking?
I appreciate your feedback and comments to help me improve my research and model.