Overview
The Describe Image Content action analyzes an image and generates a textual description of its contents. It can be used to extract specific information from images, understand image contents for further processing, or answer queries about particular aspects of an image. The main input is an image URL, but advanced options like model selection and a query input allow you to fine-tune the output. The action outputs a natural language description focused on the overall image or the queried elements. This makes it useful for tasks like extracting text from infographics or slides, categorizing images based on contents, answering questions about images, and generating additional metadata for products or assets based on their visuals.
Usage Examples
Extracting data from a PowerPoint slide or infographic - This example showcases how the Describe Image Content action can be used to extract information from visually rich images like PowerPoint slides or infographics. Often when copying and pasting text from these types of images, the relationships and context between different text elements is lost. By using this action, the model can infer the context directly from the image itself and the query input can be leveraged to focus the output on specific elements within the image.
Extracting detail from an image and categorizing the image - This example combines the Describe Image Content action with the Categorize Text action. First, text details are extracted from the input image. Then that extracted text is fed into the Categorize Text action to classify the image into different categories, similar to a "hot dog or not hot dog" type image classification task. This allows for semi-structured categorization of different images based on their contents.
Answering specific questions about images - A simple but powerful use case is using this action to answer specific questions about an image's contents, such as "Does this image contain a person?", "Does it contain a dog?", or "Does it contain a hot dog?". The model can analyze the image and provide a text response indicating if the specified object is present.
Describing attributes of a product in verbose detail - For product images, this action can generate detailed descriptions of all the product's visible attributes and characteristics. This verbose text output could then be used to create additional keywords, search queries, or contextual information about the product to improve findability in product information management (PIM) systems.
Generating a new image from an existing one - When combined with the Generate Image action (enterprise only), an existing image can be analyzed to extract a high-quality text description using Describe Image Content. That description can then be used as a prompt to generate an entirely new image in a specified style with the Generate Image action.
Inputs
Image URL - The URL of the publicly accessible image that needs to be described. This is the main required input.
https://example.com/product_image.jpg
Model Selector - An advanced input that allows you to choose which specific model or engine should be used for describing the image. Different models may have different capabilities in extracting details from images.
default
Query to answer - An advanced input that lets you provide additional context or a specific question about the image that you want the description to focus on.
What is the text on the product label?
Advanced Inputs
Base64 text - Instead of providing a public image URL, you can provide the image data encoded as Base64 text. This can be used if you cannot host the image publicly.
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...
Outputs
The Describe Image Content action primarily outputs a textual description of the contents of the provided image. The nature and level of detail in this description can vary depending on the specific input parameters used.
If a specific "query to answer" is provided as an input parameter, the output description will be focused on answering that particular query or question related to the image contents. For example, if the query asks what text is shown on a product label in the image, the output will zero in on extracting and providing that specific text rather than a general description.
Troubleshooting
Uploading an image instead of providing a URL - The Describe Image Content action requires a publicly accessible image URL as input. Uploading an image directly is not supported. However, you can use Base64 text representation of the image if interacting via the API. If you don't know how to get Base64 of an image, it's easier to create a public URL by uploading the image to a file sharing service like Dropbox or Google Drive.
Image description does not provide enough detail - If the image description output lacks specific details you need, it could be because you are not using a high enough quality image or the best model for extracting that particular detail. The models rely on the image resolution to infer text or other granular elements. Using a higher quality model along with the "query to answer" input can help extract more specific details.
Image description is hallucinated - If the output seems completely unrelated to the actual image, it likely means the provided image URL is not accessible or scrapable by the model. Even if the URL is invalid, the model will still attempt to generate a response, leading to hallucinated or fabricated descriptions. Confirm that your image URL is public and accessible.
File sharing systems URL issues - When using file sharing services like Dropbox or Google Drive to create a public image URL, the default URL may open the image in an editor view instead of showing the raw image content. To get the direct image URL, append a specific URL pattern that the service provides to access the raw image file.
Related Actions
Generate Image - This is an enterprise-only action that allows you to generate new images based on an existing image. By combining it with the Describe Image action, you can get a high-quality text description of an existing image, use that description to generate an image prompt, and then have a new image generated in a specific style based on that prompt. This allows you to create new images that are variations on an original image.
Categorize Text - This action can be combined with Describe Image to create semi-structured data from images. By first using Describe Image to extract text from an image, you can then feed that text into the Categorize Text action to classify the image into different categories based on its contents. This allows you to systematically categorize and organize a set of images in a structured way based on their contents.