There is something from the demo that still naws at me. Mira said, that the 4o model reasons across voice (audio), text, and vision (video). I still don't see any indicators of this for api usage and consumption whatsoever. Firstly, I am asking is this a model consolidations from an api perspective for creators or is this something internal availability only for ChatGPT-4o itself? I will use audio and video as examples. Text has come with an iterative stream feature so this is the kind of feature set I am looking for that correlates with the demo and it's output capabilities. Audio Audio falls under Speech-to-Text (STT) and Text-to-Speech (TTS). In the case of this concern we are speaking to the 'whisper model' modality via the api docs and more specifically STT because that would be the i...