← Back to all posts

ChatGPT-5 Test Series - Part 2: Exploring Multimodality

by Elena Jäger
Sep 21, 2025
Connect

🕓 Read Time: ~4 minutes

First things first: What is “multimodality” in AI?

Simply put, multimodality means that an AI can work with more than one type of input and output, not just text. With previous ChatGPT models, input meant - voice, text, and images. Output meant voice, text, and images, plus video depending on your location. Plus a broad support for file uploads — spreadsheets, PDFs, text files, which makes working across formats seamless.

With GPT-5, multimodality now includes inputs in one or more of the following formats: text, images, voice, audio, and video (with restrictions in format and size). For example, you can upload a chart and ask for analysis, drop in an audio clip for transcription or insights, or even integrate video for context-aware reasoning. As for output, not much has changed, really.

As for output, not much has changed, really. And that’s worth underscoring. In my view, output capabilities in GPT-5 are basically the same as before. I haven’t noticed improvements in generated images compared to the April–May 2025 update. The same holds for video and other formats. Even OpenAI’s claim that file exports (like PowerPoint decks) are better doesn’t hold up in my tests. So, if you were hoping for a big leap forward on the output side, you might be underwhelmed... I certainly was.

OpenAI sells improved multimodality as a big shift. To me, it's sort of more like a few incremental upgrades, not nearly what I would have expected from an all new model.

Anyways, here's what I noticed during me extensive tests of the new features:

The Good: where multimodality shines

  • Seamless integration of text, images, audio, and video in one session.

  • Improved reasoning over images and visuals — it can analyze charts, diagrams, and images with much more precision than before.

  • Smart routing: GPT-5 can delegate tasks to specialized sub-models (like a built-in team), which boosts accuracy in research, coding, and technical problem-solving.

  • Developer-ready: API access allows for embedding multimodal features into apps, from voice-driven tools to visual learning assistants.

For educators, consultants, and designers, this unlocks genuinely new workflows.

The Bad: where it stumbles

  • Inconsistent outputs: I have experienced multiple instances of GPT-5 losing track of the original request in complex multimodal tasks.

  • Auto-routing quirks: sometimes it picks the wrong “mode” and delivers weaker results. Quick fix: select the model you want.

  • Usage caps: heavy users bump against tighter limits.

  • Opaque mechanics: little transparency around how routing and sub-models work. And for those who know me, I like my AI to be as transparent as possible (if such a thing even exists in AI).

The Ugly: what’s frustrating

  • More than once I found myself switching back to ChatGPT-4.1 and found it actually outperforming GPT-5. Mostly for structured, high-precision outputs (like tables).

  • Usage limites and format restrictions for audio and video inputs drive me nuts. I don't want to build in an intermediate step of converting my inputs. Back to Notebook.LM for some of this... still a lot faster.

🔍 My take so far

Testing GPT-5’s multimodality has been interesting. On the strength side, I’ve found it useful when working across formats, like combining text and visuals to plan say a new learning module. The weaknesses, though, are real: it can still derail if you push it too far. And usage limits always kick in when you least expect (and need) them.

For me, the key lesson is this: multimodality expands what’s possible, if it actually works. Just because it can handle images, audio, and video doesn’t mean it always handles them well.

Key Takeaway
GPT-5’s multimodality is a step forward. But it comes with quirks. I expect OpenAI to keep improving these features, so it's worth to keep testing them. 

Til next time,
Elena

P.S. You saw me call ChatGPT’s PowerPoint output capabilities underwhelming. My go-to alternative is Gamma.App. Personally, I find it far more powerful for visualizing content and generating client presentations. I’m considering hosting a hands-on workshop to show how I use it in practice and share a solid workflow you can use for yourself. Would that be useful? If yes, just hit reply, I’m gauging interest before I set anything up.

The brief AI never forgets.
  ⏱️ Read time: ~4 min You have probably re-explained the same task to AI more times than you can count. Same instructions, different conversation, same frustration. There is a better way. Last time I shared two prompts to get more from AI. Today we go deeper. Because before you can use Skills well, you need to understand what they actually are. So, what is a Skill? A Skill is a ...
2 Prompts. Less Time. Better AI Output
⏱️ Read time: ~3 min Today I'm sharing two prompts that change how you work with AI, and one of them is brand new. One of the most common frustrations I hear from coaches and consultants about AI goes something like this: "I tried it, the output was terrible, and I don't see the point." And almost every time, the real problem isn't the AI. It's the user input. AI won't fix a lack of clarity. It...
On Curiosity, Claude, And Knowing When To Stop
⏱️ Read time: ~4 min I almost lost an entire evening to agentic dashboards. There I was, using Claude Code to generate dashboards on random topics, then building landing pages, then exploring what else it could do. One thing led to another, and before I knew it, hours had passed. I, the person who preaches pragmatic AI use, had completely abandoned my usual discipline. 🤯 And honestly?...

Not signed up yet?
Do it right here:

© 2026 Future of Work
Privacy Policy Home

JOIN THE VIP LIST

Name of Free Resource

Get started today before this once in a lifetime opportunity expires. Get started today before this once in a lifetime opportunity expires.