Basics of "Multi-modality"

Precise Definition of Multimodal Models

A multimodal model is a machine learning model that meets the following criteria:

  1. Input: Directly accepts inputs from two or more distinct modalities (e.g., text, image, audio, video) without preliminary conversion to a single modality.
  2. Architecture: Contains specific components or mechanisms designed to process each modality, as well as to integrate information across modalities.
  3. Joint Representation: Learns to create a unified or aligned representation of the input modalities within its internal layers.
  4. End-to-End Training: Can be trained end-to-end on multimodal data, allowing it to learn cross-modal interactions directly.
  5. Output: Capable of producing outputs that reflect an understanding or integration of multiple input modalities, though the output itself may be in a single modality.

Truly Multimodal Models

Based on this definition, here are some models that can be considered truly multimodal:

  1. GPT-4V: Processes both images and text directly.
  2. CLIP (Contrastive Language-Image Pre-training): While its primary output is separate embeddings, it learns joint representations of images and text.
  3. Flamingo: A few-shot learning model that can process interleaved visual and textual inputs.

Now we realize that there are very few end-to-end trained multimodal models (we do predict it will be the "end game.") in the market (perhaps GPT4v/Claude Sonnet being the only ones while their model architectures are not open-sourced), for the majority open-sourced models including the newly unveiled llama 3.1 405 B they are mostly multimodal "systems" where the base LLM is "chained" after the encoder/adapter/comforer (e.g. for llama 3.1 405B) as the input layer and a generator at the output.

"Multimodal Models" in the colloquial sense

As explained from above, today we can loosely define "Multimodal models" as artificial intelligence systems designed to process, understand, and generate information across multiple types of data or "modalities." These modalities can include text, images, audio, video, and other forms of sensory input.

Architecture:

Super simplified summary, multimodal "systems" typically involve:

  • Separate encoders for each modality
  • Fusion mechanisms to combine information from different modalities
  • Decoder(s) for generating outputs

Key Characteristics:

  1. Multiple Input Types: Can handle various forms of data simultaneously or in combination.
  2. Integrated Processing: Able to fuse information from different modalities to form a comprehensive understanding.
  3. Cross-Modal Learning: Can learn relationships and patterns across different types of data.
  4. Flexible Output: Capable of generating responses or outputs in one or more modalities, potentially different from the input modality.
  5. Transfer Learning: Often able to apply knowledge gained in one modality to tasks in another.

Base Modalities

  • Text
  • Images
  • Audio
  • Video
  • Time-series data
  • Tactile/sensor data (futuristic)