What’s the difference between predicting the next video frame vs predicting reality?
Currently videos look a lot like this. Until the models grasp physics, causality, can model natural human behavior etc it will look a lot like this. This understanding is a necessary condition for modeling what happens next in a given video. Put another way is IF AND WHEN we have turing test level video generation we also have a reality predictor, i.e. to the model there's no difference predicting what happens next from a video frame vs predicting what happens next in a car crash, based on visual input. Now if the inference time gets low enough (and the trend is that inference time and cost drops very quickly) we can predict frames faster than they happen, i.e. real time prediction into the future. People often mention quantum uncertainty to me. However, this uncertainty is not large enoug...