Data Science 101 — Overfitting & Under-fitting: How to detect them in our models

A guide for answering the popular data science interview question about overfitting & underfitting

Ashley Ha

~4 min read · February 23, 2023 (Updated: April 8, 2023) · Free: No

Your Guide for the Data Science Interview — Overfitting & Underfitting

Hey, all! 🙋🏼‍♀️Ashley here.

I had a fantastic suggestion recently from a LinkedIn connection for deep-diving into our next Data Science interview question.

What is overfitting & underfitting, & how can we detect them in our models?

So, here's how I would answer this question in a data science interview,

" Overfitting can occur when a model is too complex & fits the training data too closely, resulting in poor performance on new, unseen data. This can happen when a model has too many features or when it is trained on a small dataset.

On the other hand, Underfitting can happen when a model is too simple & doesn't completely capture the underlying patterns in the data. This can result in poor performance in the training & test data sets. This can happen when a model is too rigid or when it is not trained for long enough.

and so, to construct an effective machine learning model, it is important to find that balance between overfitting & underfitting. This can be achieved through techniques such as regularization, feature selection, & cross-validation. "

Understanding the Bias-Variance Tradeoff can greatly help us to understand how to avoid the mistake of overfitting and underfitting.

Methods to detect overfitting & underfitting

Cross-validation — This involves dividing the data into training & validation sets & then training and evaluating the model on different splits of the data. By repeating this process multiple times, we can identify if the model is overfitting by observing poor performance on the validation set despite good performance on the training set. This can help detect overfitting by revealing low performance across all splits of the data.

Image — Author: Ashley Ha

Plotting a Learning Curve — this is a useful visualization tool that plots the performance of the model on the training set & validation set as a function of the amount of training data. If the model is overfitting, we might observe a situation where the training set performance keeps improving while the validation set performance levels off or decreases. On the other hand, an underfitting model will perform poorly on both sets, and the learning curve will indicate that the model's performance does not improve with additional training data.

Image — Author: Ashley Ha

Regularization — Regularization is a useful technique in machine learning that can prevent overfitting by adding a penalty term to the model's loss function. However, if the regularization parameter is set too high, the model may underfit, and if it's set too low, the model may overfit. We can also use regularization to reduce underfitting by reducing the regularization parameter, allowing the model to become more complex & overall better fit the data!

Feature Selection —Overfitting can occur when the model is too complex and trained on too many features, and so removing irrelevant or possibly redundant features can help prevent this issue. On the other hand, underfitting could require adding more relevant features to the model to improve its performance.

Ensemble Methods — these methods can help reduce overfitting by combining multiple models. This can help to increase the model's complexity and improve its performance while combatting underfitting. This can be achieved by training the models on different subsets of the data or with different algorithms.

I hope this helped you feel more confident about this Data Science interview question!

Happy learning,

⏤Ashley

Check out my previous blog post where I deep dive into another popular interview question all about: Handling Unbalanced Data

Resources Mentioned:

As always, feedback is greatly appreciated! I read every single comment & I am constantly working to improve my content & research. I would love to collaborate with others in the field! Please feel free to reach out to me at ashleyha@berkeley.edu or connect on LinkedIn https://www.linkedin.com/in/ashleyeastman/ or Instagram @ashleybee.tech