There is still a lingering belief from classical machine learning that bigger models overfit and thus don't generalize well. This is described by the bias-variance trade-off, but this no longer holds in the new age of machine learning. This is empirically shown by phenomena like double descent, where higher-complexity models perform better than lower-complexity ones. The reason why this happens remains counterintuitive for most people, so I aim to address it here: Capacity Theory: The theory states that when models are much larger than their training data, they have extra capacity not just for memorizing but also for exploring different structures. They can find more generalizable structures that are simpler than those required for memorization. Due to regularization, the model favors the...