DeepMind's JetFormer: Unified multimodal models without modeling constraints

Recent advances in training large-scale multimodal models are driven by efforts to eliminate modeling constraints and unify architectures across domains. Despite these advances, many existing models still rely on individually trained components, such as modality-specific encoders and decoders.

In a new paper, “JetFormer: An Autoregressive Generative Model of Raw Images and Text,” the Google DeepMind research team introduces JetFormer, a breakthrough autoregressive decoder-specific Transformer designed to directly model raw data. I’m doing it. The model maximizes the potential of raw data without relying on pre-trained components and can seamlessly understand and generate text and images.

The team summarizes JetFormer’s key innovations as follows:

Leveraging normalization flows for image representation: The crucial insight behind JetFormer is that it uses a powerful normalization flow called “jet” to encode images into latent representations suitable for autoregressive modeling. That’s it. Traditional autoregression on raw image patches encoded as pixels has been impractical due to structural complexity. JetFormer’s flow model addresses this issue by providing a lossless, reversible representation that seamlessly integrates with multimodal models. At inference time, the reversibility of the flow enables direct image decoding. Guide models to high-level information: To strengthen the focus on important high-level information, researchers employ two innovative strategies. Progressive Gaussian Noise Augmentation: Gaussian noise is added during training and gradually reduced, encouraging the model to prioritize comprehensive features early. In the process of learning. Managing image data redundancy: JetFormer allows you to selectively exclude redundant dimensions in natural images from autoregressive models. Alternatively, principal component analysis (PCA) has been considered to reduce dimensionality without sacrificing important information.

The team evaluated JetFormer on two challenging tasks: conditional image generation for the ImageNet class and web-scale multimodal generation. Results show that JetFormer competes with less flexible models and outperforms on both image and text generation tasks when trained on large-scale data. Its end-to-end training capabilities further emphasize its flexibility and effectiveness.

JetFormer represents a major advance in simplifying multimodal architectures by integrating text and image modeling approaches. The innovative use of normalized flows and emphasis on prioritizing high-level features begins a new era of end-to-end generative modeling. This work lays the foundation for further exploration of integrated multimodal systems and paves the way for a more integrated and efficient approach to AI model development.

The paper “JetFormer: An Autoregressive Generative Model of Raw Images and Text” is available on arXiv.

Author: Hekate He | Editor: Chain Zhang

Source link

What's Hot

I’ve seen all the Marvel movies. Here’s how to save your MCU

London Stock Exchange Group share price rises as PISCES debut nears and financial results approach

Indian Americans largely disapprove of Trump’s first-year performance, but Democrats aren’t benefiting: Survey

DeepMind’s JetFormer: Unified multimodal models without modeling constraints

D Street Massacre, Humanity Milestones, Bangladesh Election Results, PMO Shift, and More

A smarter way for AI to understand text and images

Surprisingly Tough Competition for Meta’s Ray-Ban

20 Most Anticipated Sex Movies of 2025

How to tell the difference between fake and genuine Adidas Sambas

President Trump’s SEC nominee Paul Atkins marries multi-billion dollar roof fortune

Alice Munro’s Passive Voice | New Yorker

D Street Massacre, Humanity Milestones, Bangladesh Election Results, PMO Shift, and More

A smarter way for AI to understand text and images

Surprisingly Tough Competition for Meta’s Ray-Ban

How AI assistance impacts the formation of coding skills \ Anthropic

Our Picks

I’ve seen all the Marvel movies. Here’s how to save your MCU

London Stock Exchange Group share price rises as PISCES debut nears and financial results approach

Indian Americans largely disapprove of Trump’s first-year performance, but Democrats aren’t benefiting: Survey

Most Popular

Anthropic agrees to work with music publishers to prevent copyright infringement

chatgpt makers claim data breach claims “seriously”

Everything you need to know

Subscribe to Updates

What's Hot

DeepMind’s JetFormer: Unified multimodal models without modeling constraints

Something like this:

Related Posts