OpenAI model generates whimsical images from text prompts
Image credit: OpenAI
AI research laboratory OpenAI has shared a multimodel AI system - dubbed DALL.E - which combines natural language processing and computer vision to generate images from text captions.
DALL.E uses a 12 billion parameter version of GPT-3 – a model for generating extremely human-like text. Like GPT-3, DALL.E is a Transformer language model; this is a type of deep learning model used in natural language processing which can handle sequential data in any order, reducing training times.
DALL.E could be described in very simple terms as doing for images what GPT-3 does for text: it generates images based on a short text prompt input. Its portmanteau name was inspired by the artist Salvador Dali and Pixar's fictional robot WALL.E.
The model was trained using a dataset of image-text pairs, with both text and compressed image received as a single stream of data containing up to 1,280 symbols (tokens). It was trained using maximum likelihood to generate all of these tokens one after the other. This process allows the model to generate images from scratch, but also to regenerate any rectangular region of an existing image in a way that is consistent with the caption.
OpenAI shared a post demonstrating DALL.E’s ability to generate and manipulate objects, with results including fantasy objects such as “a cube with the texture of porcupine”; "a snail made out of a harp”; or “an armchair in the shape of an avocado”. The generated images were of varying quality, ranging from near-photorealism to cartoons of various styles, to simple geometric shapes, to blocky voxel art.
“We’ve found that it has a diverse set of capabilities, including creating anthropomorphised versions of animals and objects, combining unrelated concepts in plausible ways, rendering text and applying transformations to existing images,” OpenAI said in the blog post.
The blog post acknowledges that DALL.E can occasionally be fragile, with some only very minor changes to prompts generating bizarre changes. Comprehending this is a challenge, given the 'black box' nature of these models.
For each prompt, the model produces 512 images which are filtered and narrowed down to the 32 best options by a second multimodel model called CLIP. CLIP was trained with 400 million image-text pairs scraped from the internet. The model has similarities to GPT-2 and GPT-3, learning to perform a range of tasks with high accuracy, including action recognition and object character recognition.
However, testing found that the model could perform better at certain image recognition tasks like satellite imagery classification or lymph node tumour detection. OpenAI researchers also detected bias in the model, with people under the age of 20 most likely to be categorised as non-human or criminal and men more likely to be categorised as criminals than women, with some label data in the dataset found to be inappropriately gendered.
OpenAI was founded in San Francisco in 2015, by a team including industrialist Elon Musk, with the mission of creating ethical AI for benefitting all of humanity. In 2019, the laboratory received a $1bn investment from Microsoft.
The OpenAI blog post said: “We recognise that work involving generative models has the potential for significant, broad societal impacts. In the future, we plan to analyse how models like DALL.E relate to societal issues like economic impact on certain work processes and professions; the potential for bias in the model outputs, and the longer-term ethical challenges implied by this technology.”
Additional details about the development of DALL.E are expected to be published in an upcoming paper.
Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.