ChatGPT-5 Test Series - Part 2: Exploring Multimodality
🕓 Read Time: ~4 minutes
First things first: What is “multimodality” in AI?
Simply put, multimodality means that an AI can work with more than one type of input and output, not just text. With previous ChatGPT models, input meant - voice, text, and images. Output meant voice, text, and images, plus video depending on your location. Plus a broad support for file uploads — spreadsheets, PDFs, text files, which makes working across formats seamless.
With GPT-5, multimodality now includes inputs in one or more of the following formats: text, images, voice, audio, and video (with restrictions in format and size). For example, you can upload a chart and ask for analysis, drop in an audio clip for transcription or insights, or even integrate video for context-aware reasoning. As for output, not much has changed, really.
As for output, not much has changed, really. And that’s worth underscoring. In my view, output capabilities in GPT-5 are basically the same as before. I haven’t noticed improvements in generated images compared to the April–May 2025 update. The same holds for video and other formats. Even OpenAI’s claim that file exports (like PowerPoint decks) are better doesn’t hold up in my tests. So, if you were hoping for a big leap forward on the output side, you might be underwhelmed... I certainly was.
OpenAI sells improved multimodality as a big shift. To me, it's sort of more like a few incremental upgrades, not nearly what I would have expected from an all new model.
Anyways, here's what I noticed during me extensive tests of the new features:
The Good: where multimodality shines
-
Seamless integration of text, images, audio, and video in one session.
-
Improved reasoning over images and visuals — it can analyze charts, diagrams, and images with much more precision than before.
-
Smart routing: GPT-5 can delegate tasks to specialized sub-models (like a built-in team), which boosts accuracy in research, coding, and technical problem-solving.
-
Developer-ready: API access allows for embedding multimodal features into apps, from voice-driven tools to visual learning assistants.
For educators, consultants, and designers, this unlocks genuinely new workflows.
The Bad: where it stumbles
-
Inconsistent outputs: I have experienced multiple instances of GPT-5 losing track of the original request in complex multimodal tasks.
-
Auto-routing quirks: sometimes it picks the wrong “mode” and delivers weaker results. Quick fix: select the model you want.
-
Usage caps: heavy users bump against tighter limits.
-
Opaque mechanics: little transparency around how routing and sub-models work. And for those who know me, I like my AI to be as transparent as possible (if such a thing even exists in AI).
The Ugly: what’s frustrating
-
More than once I found myself switching back to ChatGPT-4.1 and found it actually outperforming GPT-5. Mostly for structured, high-precision outputs (like tables).
-
Usage limites and format restrictions for audio and video inputs drive me nuts. I don't want to build in an intermediate step of converting my inputs. Back to Notebook.LM for some of this... still a lot faster.
🔍 My take so far
Testing GPT-5’s multimodality has been interesting. On the strength side, I’ve found it useful when working across formats, like combining text and visuals to plan say a new learning module. The weaknesses, though, are real: it can still derail if you push it too far. And usage limits always kick in when you least expect (and need) them.
For me, the key lesson is this: multimodality expands what’s possible, if it actually works. Just because it can handle images, audio, and video doesn’t mean it always handles them well.
Key Takeaway
GPT-5’s multimodality is a step forward. But it comes with quirks. I expect OpenAI to keep improving these features, so it's worth to keep testing them.
Til next time,
Elena
P.S. You saw me call ChatGPT’s PowerPoint output capabilities underwhelming. My go-to alternative is Gamma.App. Personally, I find it far more powerful for visualizing content and generating client presentations. I’m considering hosting a hands-on workshop to show how I use it in practice and share a solid workflow you can use for yourself. Would that be useful? If yes, just hit reply, I’m gauging interest before I set anything up.