9.1 Multimodal AI — More Than Words

🎯 Core Goals

Introduce multimodal AI capabilities and their current limitations.
Clarify that understanding and generating media are very different skills — and not all models do both.

Modern AI can process images, audio, and video — not just text. But “understanding” an image and “creating” an image involve completely different systems. Don’t assume every LLM can do everything.

Beyond the Text Box

For most of AI’s history, the conversation was literally one-dimensional: you type, it types back. Today that’s changed dramatically.

Modern AI systems can:

See images: Upload a photo and ask “What’s in this picture?” or “Is there anything wrong with this diagram?”
Read documents: Upload a PDF or spreadsheet and ask questions about it
Listen to audio: Transcribe speech, translate spoken language, answer voice questions
Watch video: Describe what’s happening, identify objects, summarize content
Generate images: Create illustrations from text descriptions (though this is usually a separate model)

This is called multimodal — working with multiple types of media.

What Can AI Work With?

Modern AI isn't limited to text. Click each type to see what's possible.

⚠️ Not every LLM handles every modality — always verify the specific model's capabilities before building a workflow around them.

Not All Models Are Equal

Here’s where many people get confused: just because a product interface can handle images doesn’t mean the underlying LLM can.

Some LLMs are text-only: They process text in, text out. Full stop.
Some are vision + text: They can understand images but can’t generate them
Some access separate image generation models: When you ask for an image, they call a different model (like DALL-E or Stable Diffusion) behind the scenes — it’s not the LLM itself drawing

When a chat interface generates an image for you, it’s almost always handing off to a specialized image model. The LLM is the orchestrator, not the artist.

“The AI drew this for me” is a common misconception. Most language models cannot natively generate images — they send a request to a separate image generation model. Understanding this distinction matters when you’re evaluating AI tools for a business use case.

The Middleman Problem

You upload a PDF to your favorite LLM. It answers your questions about it. But what actually happened behind the scenes?

In many cases, the LLM didn’t “read” your PDF directly. The platform ran an intermediate tool first — something that extracts text from the PDF, transcribes the audio, or pulls the transcript from a video. The LLM then works with whatever that tool produced.

This matters because the intermediate step can lose or distort information:

PDFs: Tables lose their formatting, charts get converted to garbled text or skipped entirely, multi-column layouts get scrambled, images embedded in the document may be stripped out
YouTube / video links: The LLM may only see the title, description, and auto-generated transcript — not the actual video frames. You might expect it to comment on what’s shown on screen, but it only “knows” what was said
Audio files: Transcription errors, missed words, speaker confusion — whatever the transcription tool gets wrong becomes the LLM’s blind spot
Images inside documents: Diagrams, flowcharts, and graphs may be poorly described or completely ignored during the extraction step

What did the platform actually see? Before you trust an LLM’s response about an uploaded file, ask yourself: did the platform understand my file natively, or did it run a tool first? If there was an intermediate step, what might have been lost or distorted?

We’ll dig deeper into OCR and document processing later.

When the Middleman Gets It Very Wrong

Here’s a striking real-world example of how badly this can go.

“Vegetative electron microscopy” is a nonsensical, non-existent scientific term — yet it appeared in dozens of published academic papers. How? It was born from a digitization error during training data processing.

Many academic papers use a two-column layout. Normally, you read down the left column first, then the right. But some text extraction tools read horizontally, line by line across both columns — mashing the last word of the left column together with the first words of the right column.

On one particular line, the left column ended with “…vegetative” and the right column started with “Electron microscopy…” — and just like that, a new fake term was born. It became a known “fingerprint” of AI-generated and paper-mill content, spreading through the scientific literature.

The lesson: even engineers at top AI companies can introduce basic layout-reading errors. If the intermediate step misreads your document, the LLM receives garbled input — and confidently works with it.

How a Two-Column Paper Gets Misread

Toggle between correct and horizontal reading to see how "vegetative electron microscopy" was born.

Morphological Analysis of Plant Cell Structures

J. Robertson, K. Tanaka — Dept. of Botany, Vol. 42, pp. 112-118

1 2

The structural characteristics of

Studies of cell wall composition

plant tissues have been widely

have previously relied on

studied using various microscopic

conventional light microscopy and

techniques. In previous work on

histological staining methods.

the reproductive and vegetative

Electron microscopy was used to

organs of flowering plants, we

examine the ultrastructure of

observed distinct patterns of

the cell wall matrix. Samples

cell differentiation during the

were prepared using standard

early developmental stages. The

fixation protocols and embedded

results indicate significant

in epoxy resin. Thin sections

What the LLM received on that line:

Select a reading mode above

📝 Key Concepts

Multimodal: Works with multiple media types — text, images, audio, video
Understanding vs. generating: Very different capabilities, often in different models
Not universal: Not every LLM handles every modality — verify before assuming
Image generation is usually separate: A different specialized model does the drawing
Intermediate processing: Many platforms run tools (transcription, OCR, format conversion) before the LLM sees your content — errors in this step become the LLM’s blind spots
Format loss: PDFs can lose table formatting, charts can be stripped, video may only contribute a transcript — check what actually reaches the LLM
The middleman can garble: Digitization errors (like reading two-column layouts wrong) can create entirely new, nonsensical terms — and the LLM will confidently use them