Skip to main content
Multimodal language models can accept both text prompts and images as input.

Vision (Image-to-Text)

Pass both text and image content to a multimodal model.
import { generateText } from 'ai';
import { openai, MODEL } from './client';

async function analyzeImage() {
  const base64Image =
    'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8z8BQDwAEhQGAhKmMIQAAAABJRU5ErkJggg==';

  const response = await generateText({
    model: openai(MODEL),

    messages: [
      {
        role: 'user',
        content: [
          {
            type: 'text',
            text: 'Describe this image.',
          },
          {
            type: 'image',
            image: base64Image,
            mimeType: 'image/png',
          },
        ],
      },
    ],
  });

  console.log(response.text);
}

Parameters

When using vision models with generateText, the following parameters are supported:
model
LanguageModel
required
The model instance to use for generation.
messages
Message[]
required
Array of message objects representing the conversation history. For vision, pass type: 'image' along with the image data (URL or Base64) and mimeType.
temperature
number
Controls randomness (0.0 to 2.0).
maxTokens
number
The maximum number of tokens to generate.
topP
number
Nucleus sampling probability.
topK
number
Limits sampling to the top K probable tokens.
presencePenalty
number
Encourages the model to talk about new topics.
frequencyPenalty
number
Prevents the model from repeating words.
seed
number
Attempts deterministic generation.
stopSequences
string[]
Custom sequences that stop the model from generating further text.