In my last post, I shared how Chapter One of Hands-On Machine Learning introduced me to the world of algorithms, data, and curiosity. This continuation goes a little deeper into what can go wrong in machine learning. I spent time writing these notes by hand, and they reminded me that the most important lessons often come from mistakes, both ours and the model's.
1. Insufficient Quantity of Training Data
Machine learning needs a lot of examples to work well. With too little data, the algorithm struggles to find meaningful patterns. It is like trying to learn a new language from only a few sentences; there just isn't enough variety to generalize.
2. Non-representative Training Data
For a model to generalize well, its training data has to represent the real-world situations it will face later. If the sample is too small or collected in a biased way, the model may learn patterns that don't hold true elsewhere. This is called sampling bias, and a special type of it is non-response bias.
3. Poor Quality Data
If the training data is full of errors, missing values, or random noise, it becomes difficult for the model to learn the underlying relationships. Data quality is everything. I wrote in my notes: "You need to spend significant time cleaning your data; it's worth it."
4. Irrelevant Features
A system can only learn from what it is given. If the dataset has too many irrelevant features or not enough useful ones, the model becomes confused. Feature engineering is the art of selecting, combining, or creating features that matter most. It involves feature selection, feature extraction, and sometimes even gathering new data to strengthen learning.
5. Overfitting the Training Data
Overfitting happens when a model performs perfectly on the training data but fails to generalize to new data. It means the model has memorized instead of learned. Some ways to prevent this include simplifying the model, gathering more data, cleaning noisy samples, or applying regularization to keep the model simple.
6. Underfitting the Training Data
Underfitting is the opposite problem the model is too simple to capture the structure of the data. It struggles to find meaningful patterns. To fix this, we can choose a more powerful model, feed better features through feature engineering, or reduce overly strict constraints like strong regularization.
7. The Balance Between Simplicity and Complexity
One of my favorite takeaways from this chapter is that machine learning is a balancing act. The goal is to fit the training data well while keeping the model simple enough to generalize. Finding this balance is what makes a good model stand the test of time.
8. Hyperparameters and Validation
Hyperparameters control how the learning algorithm behaves. They are set before training begins and determine things like regularization strength or learning rate. To find the best values, we use hold-out validation, splitting data into training, validation, and test sets. We train on the training set, tune on the validation set, and finally test once at the end to measure generalization.
9. Representativeness Matters
Both the validation and test sets must represent the kind of data we expect to see in production. If they are not, the model's real-world performance will likely disappoint, even if the test results look great on paper.
Wrapping Up
This section of Chapter One made me realize that machine learning is as much about discipline as creativity. Cleaning data, choosing features, and tuning models may sound routine, but they form the foundation of every successful system. Learning to build reliable models feels a lot like learning itself. We test, we adjust, and we grow.
Next up is Chapter Two, End-to-End Machine Learning Project, where I'll bring these ideas to life and start connecting theory with practice.
I can't wait to share how it all unfolds.