Articles

Disagree and Commit: Why "Get More Data" is a Waste of Money?

Sep 25, 2025

Linkedin
We've all heard the mantra: "Get more data". In AI, data is often treated as the ultimate commodity. But what if I told you that hoarding data without a strategy for your model's capacity is like building a library with no plan for the shelves? You'll end up with a pile of books and no way to properly organize or access the knowledge within.
The true art of model building lies in understanding the delicate dance between Memorization and Generalization - a classic manifestation of the bias-variance tradeoff.
Let's break this down simply:
Small Dataset + Small NN → OK. A simple model can learn the limited patterns without much room to memorize noise. It's a balanced, if limited, solution.
Small Dataset + Large NN → ⚠️ Overfitting. The large network has too much capacity. It memorizes the tiny training set perfectly, including all its noise and outliers, and will fail miserably on new data. High variance.
Large Dataset + Small NN → ⚠️ Underfitting. The simple model doesn't have enough capacity to learn the complex patterns in the vast amount of data. It fails to capture the underlying truth, resulting in poor performance. High bias.
Large Dataset + Large NN → OK. Ample data provides enough information for the large network to learn generalizable patterns rather than just memorize. This is the sweet spot.
Think of it this way: Data is a source of information, and a model during training learns how to translate that data into useful information. If your dataset is small or has highly similar, redundant samples, even a large model will simply memorize it, leading to overfitting. This is precisely what your training/validation metrics (a growing gap between train and validation loss) will tell you.
The intuitive response? "My model is overfitting; I need more data or a larger model". But what if you can't get more data? Or what if your data is the problem?
This is where the groundbreaking Chinchilla Law from DeepMind changed the game for Large Language Models (LLMs). The key finding was that behemoths like GPT-3 were significantly undertrained. They had massive models but were starved of data. Chinchilla showed that for a given compute budget, it's almost always better to have a smaller model trained on more data than a larger model trained on less.
The lesson? Throwing parameters at a problem is not a strategy. You must have a proportional amount of data to inform those parameters.

Why does this matter to you? The Cost of Getting It Wrong

This isn't just academic. For anyone in a safety-critical domain - medical imaging, autonomous driving, financial fraud detection - this understanding is non-negotiable. But there's a massive commercial imperative, too.
Indiscriminately using heavy models has a direct bottom-line impact:
- Unreliable AI: An overfit model is an unreliable product, leading to poor user experiences and eroding trust.
- Sky-High Costs: Massive computational requirements skyrocket development budgets for training and hyperparameter tuning.
- Operational Burden: Deploying and serving a gigantic model requires expensive infrastructure, increasing inference costs and latency.
The result? You price yourself out of the market. In today's competitive landscape, efficiency isn't just a technical goal - it's a business survival strategy. The company that achieves the same performance with a leaner, more efficient model wins.
The takeaway: Stop thinking about data and model size as separate entities. They are in constant dialogue. Your job is to listen. Tune your model's capacity to the information content of your data to build AI that is not just accurate on a test set, but is truly reliable, cost-effective, and competitive in the wild.

Articles page

Disagree and Commit: Why "Get More Data" is a Waste of Money?