115
Dictionary/multimodal-ai
Dictionary

Multimodal AI

Models that natively process more than one input type — text, images, audio, or video.

Definition
Multimodal AI refers to models trained to understand and generate across modalities. They can read a screenshot, describe a chart, transcribe audio, or watch a short video — enabling agents that act on what users actually see and say, not just on typed text.
Example
A QA agent takes a screenshot of a broken UI, reads the error text in the image, locates the offending React component, and proposes a fix — all in one pass.
Related Workflows
Related Tool Stacks