← Back to all posts

ChatGPT-5 Test Series - Part 2: Exploring Multimodality

by Elena Jäger
Sep 21, 2025
Connect

🕓 Read Time: ~4 minutes

First things first: What is “multimodality” in AI?

Simply put, multimodality means that an AI can work with more than one type of input and output, not just text. With previous ChatGPT models, input meant - voice, text, and images. Output meant voice, text, and images, plus video depending on your location. Plus a broad support for file uploads — spreadsheets, PDFs, text files, which makes working across formats seamless.

With GPT-5, multimodality now includes inputs in one or more of the following formats: text, images, voice, audio, and video (with restrictions in format and size). For example, you can upload a chart and ask for analysis, drop in an audio clip for transcription or insights, or even integrate video for context-aware reasoning. As for output, not much has changed, really.

As for output, not much has changed, really. And that’s worth underscoring. In my view, output capabilities in GPT-5 are basically the same as before. I haven’t noticed improvements in generated images compared to the April–May 2025 update. The same holds for video and other formats. Even OpenAI’s claim that file exports (like PowerPoint decks) are better doesn’t hold up in my tests. So, if you were hoping for a big leap forward on the output side, you might be underwhelmed... I certainly was.

OpenAI sells improved multimodality as a big shift. To me, it's sort of more like a few incremental upgrades, not nearly what I would have expected from an all new model.

Anyways, here's what I noticed during me extensive tests of the new features:

The Good: where multimodality shines

  • Seamless integration of text, images, audio, and video in one session.

  • Improved reasoning over images and visuals — it can analyze charts, diagrams, and images with much more precision than before.

  • Smart routing: GPT-5 can delegate tasks to specialized sub-models (like a built-in team), which boosts accuracy in research, coding, and technical problem-solving.

  • Developer-ready: API access allows for embedding multimodal features into apps, from voice-driven tools to visual learning assistants.

For educators, consultants, and designers, this unlocks genuinely new workflows.

The Bad: where it stumbles

  • Inconsistent outputs: I have experienced multiple instances of GPT-5 losing track of the original request in complex multimodal tasks.

  • Auto-routing quirks: sometimes it picks the wrong “mode” and delivers weaker results. Quick fix: select the model you want.

  • Usage caps: heavy users bump against tighter limits.

  • Opaque mechanics: little transparency around how routing and sub-models work. And for those who know me, I like my AI to be as transparent as possible (if such a thing even exists in AI).

The Ugly: what’s frustrating

  • More than once I found myself switching back to ChatGPT-4.1 and found it actually outperforming GPT-5. Mostly for structured, high-precision outputs (like tables).

  • Usage limites and format restrictions for audio and video inputs drive me nuts. I don't want to build in an intermediate step of converting my inputs. Back to Notebook.LM for some of this... still a lot faster.

🔍 My take so far

Testing GPT-5’s multimodality has been interesting. On the strength side, I’ve found it useful when working across formats, like combining text and visuals to plan say a new learning module. The weaknesses, though, are real: it can still derail if you push it too far. And usage limits always kick in when you least expect (and need) them.

For me, the key lesson is this: multimodality expands what’s possible, if it actually works. Just because it can handle images, audio, and video doesn’t mean it always handles them well.

Key Takeaway
GPT-5’s multimodality is a step forward. But it comes with quirks. I expect OpenAI to keep improving these features, so it's worth to keep testing them. 

Til next time,
Elena

P.S. You saw me call ChatGPT’s PowerPoint output capabilities underwhelming. My go-to alternative is Gamma.App. Personally, I find it far more powerful for visualizing content and generating client presentations. I’m considering hosting a hands-on workshop to show how I use it in practice and share a solid workflow you can use for yourself. Would that be useful? If yes, just hit reply, I’m gauging interest before I set anything up.

Does The AI Race Have a Finish Line?
🕓 Read Time: ~3.5 min   For a long time, ChatGPT was the tool. Everyone had a license, everyone was using it, and it felt like the obvious choice. Then Gemini and Claude came along and put ChatGPT to a test. Then Claude seemed to just take over in most categories. Better models, better features, Claude Code, Skills, Artifacts, Cowork before anyone else was doing it. A lot of people, myself...
Work slop is real. And it's getting embarrassing.
⏱️ Read Time: ~ 2 min This week I want to share something fun with you. Fun, yet totally relevant for each and everyone of us. The internet is full of terrible content. Things that sound good but make absolutely no sense at all. And it seems to be getting worse. Last week, I received two unsolicited messages that made me laugh out loud and then made me a little sad for the people who sent them....
AI is Wasting Your Time.
    ⏱️ Read time: ~2 - 3 min This week's edition might sting a little... Join me for a short experiment? You won't regret it! Ready? Let's go: Scroll through LinkedIn for five minutes right now. I'll wait. AI workflows. Automation stacks. "I built this in Claude Code in 47 minutes." Founders proudly showing off systems that run their entire business while they sleep. It looks impressive...

Not signed up yet?
Do it right here:

© 2026 Future of Work
Privacy Policy Home

JOIN THE VIP LIST

Name of Free Resource

Get started today before this once in a lifetime opportunity expires. Get started today before this once in a lifetime opportunity expires.