← Back to all posts

ChatGPT-5 Test Series - Part 2: Exploring Multimodality

by Elena Jäger
Sep 21, 2025
Connect

🕓 Read Time: ~4 minutes

First things first: What is “multimodality” in AI?

Simply put, multimodality means that an AI can work with more than one type of input and output, not just text. With previous ChatGPT models, input meant - voice, text, and images. Output meant voice, text, and images, plus video depending on your location. Plus a broad support for file uploads — spreadsheets, PDFs, text files, which makes working across formats seamless.

With GPT-5, multimodality now includes inputs in one or more of the following formats: text, images, voice, audio, and video (with restrictions in format and size). For example, you can upload a chart and ask for analysis, drop in an audio clip for transcription or insights, or even integrate video for context-aware reasoning. As for output, not much has changed, really.

As for output, not much has changed, really. And that’s worth underscoring. In my view, output capabilities in GPT-5 are basically the same as before. I haven’t noticed improvements in generated images compared to the April–May 2025 update. The same holds for video and other formats. Even OpenAI’s claim that file exports (like PowerPoint decks) are better doesn’t hold up in my tests. So, if you were hoping for a big leap forward on the output side, you might be underwhelmed... I certainly was.

OpenAI sells improved multimodality as a big shift. To me, it's sort of more like a few incremental upgrades, not nearly what I would have expected from an all new model.

Anyways, here's what I noticed during me extensive tests of the new features:

The Good: where multimodality shines

  • Seamless integration of text, images, audio, and video in one session.

  • Improved reasoning over images and visuals — it can analyze charts, diagrams, and images with much more precision than before.

  • Smart routing: GPT-5 can delegate tasks to specialized sub-models (like a built-in team), which boosts accuracy in research, coding, and technical problem-solving.

  • Developer-ready: API access allows for embedding multimodal features into apps, from voice-driven tools to visual learning assistants.

For educators, consultants, and designers, this unlocks genuinely new workflows.

The Bad: where it stumbles

  • Inconsistent outputs: I have experienced multiple instances of GPT-5 losing track of the original request in complex multimodal tasks.

  • Auto-routing quirks: sometimes it picks the wrong “mode” and delivers weaker results. Quick fix: select the model you want.

  • Usage caps: heavy users bump against tighter limits.

  • Opaque mechanics: little transparency around how routing and sub-models work. And for those who know me, I like my AI to be as transparent as possible (if such a thing even exists in AI).

The Ugly: what’s frustrating

  • More than once I found myself switching back to ChatGPT-4.1 and found it actually outperforming GPT-5. Mostly for structured, high-precision outputs (like tables).

  • Usage limites and format restrictions for audio and video inputs drive me nuts. I don't want to build in an intermediate step of converting my inputs. Back to Notebook.LM for some of this... still a lot faster.

🔍 My take so far

Testing GPT-5’s multimodality has been interesting. On the strength side, I’ve found it useful when working across formats, like combining text and visuals to plan say a new learning module. The weaknesses, though, are real: it can still derail if you push it too far. And usage limits always kick in when you least expect (and need) them.

For me, the key lesson is this: multimodality expands what’s possible, if it actually works. Just because it can handle images, audio, and video doesn’t mean it always handles them well.

Key Takeaway
GPT-5’s multimodality is a step forward. But it comes with quirks. I expect OpenAI to keep improving these features, so it's worth to keep testing them. 

Til next time,
Elena

P.S. You saw me call ChatGPT’s PowerPoint output capabilities underwhelming. My go-to alternative is Gamma.App. Personally, I find it far more powerful for visualizing content and generating client presentations. I’m considering hosting a hands-on workshop to show how I use it in practice and share a solid workflow you can use for yourself. Would that be useful? If yes, just hit reply, I’m gauging interest before I set anything up.

On navigating the AI tool maze, even when you should know better
⏱️ Read time: ~3 min  Even I lose track. More often than I'd like to admit. I've been working with AI since the early days of generative AI. I've tested tools, built workflows, advised clients, and helped teams integrate AI into how they work. And yet, I keep finding myself mid-task asking: wait, which tool should I actually be using right now? It feels like a beginner moment. And a very human ...
Barely Prepared to Briefed: Monica's Meeting Prep Workflow
⏱️ Read time: ~3 min Let me tell you about Monica. Monica is a Leadership and Team Coach. Sharp, experienced, and genuinely good at what she does. But like most coaches I know, she had a meeting prep problem. Not a laziness problem. A time problem. Prep happened when it happened, which meant sometimes it was thorough, sometimes it was a quick scan on the way to the call, and sometimes it was mo...
Why The #1 AI Skill Has Nothing to Do With The Tool You're Using
⏱️ Read time: ~3 min  The AI space is moving fast. Faster than most of us can keep up with. Claude is doing things that felt impossible even a few months ago. AI agents are now within reach for non-technical people. New tools drop every week, each one more impressive than the last. And yet, most people still aren't getting real value from AI. Not because the tools aren't good enough. Because th...

Not signed up yet?
Do it right here:

© 2026 Future of Work
Privacy Policy Home

JOIN THE VIP LIST

Name of Free Resource

Get started today before this once in a lifetime opportunity expires. Get started today before this once in a lifetime opportunity expires.