For a video on this click here. I recently came across an intriguing paper (https://arxiv.org/html/2406.06489v1) that tested various machine learning models, including a transformer-based language model, on out-of-distribution (OOD) prediction tasks. The authors discovered that simply making neural networks larger doesn't improve their performance on these OOD tasks—and might even make it worse. They argue that scaling up models isn't the solution for achieving genuine understanding beyond their training data. This finding contrasts with many studies on "grokking," where neural networks suddenly start to generalize well after extended training. According to the new paper, the generalization seen in grokking is too simplistic and doesn't represent true OOD generalization. However, I have a ...