Why the design industry should embrace text-to-image creative revolution as a new medium
The explosion of new text-to-image models that have emerged over the last 18 months like DALL-E 2 and Midjourney, have put the power to create amazing imagery and designs into the hands of anyone and everyone—and that’s admittedly frightening. In fact, some members of the design community have taken to social media to air their concerns that their jobs may be under threat.
But we should greet the arrival of this technology with excitement and inquisitiveness—the same sort of curiosity which often inspires us to think and work creatively. These technological advancements are not something to be feared; they’re something to be savoured and explored. Anything that can support and augment the creative process is a good thing.
And that’s why we owe it to our field to get on board now—and get behind the inherent possibility of this creative revolution.
How it works
Put simply, text-to-image models, allow a user to create a text prompt that is then transformed into an image. To do this, you grab millions of images, feed them into a neural network to understand what those images are. You then link this with another neural network that understands language. The text-to-image model then learns the intersection of these two data sets. As part of this process, it also learns things such as style and aesthetics, as well as how someone might caption an image.
Sometimes referred to as prompt engineering, it has evolved rapidly, where users find new words and phrases that can add different aesthetic qualities to the outcome. There are prompt creation spreadsheets, numerous articles, prompt marketplaces, and there’s even an 82-page Prompt Book! I prefer to think of this skill as prompt crafting… building and refining your prompts to ultimately craft the outcome you’re happy with.
“Open” access encourages adoption, for our collective benefit
Stable Diffusion, one of the products of Stability AI, on the other hand, released their model for the world to do with as they please. This has been seen as an irresponsible move by some who fear that it will lead to people making “bad” images, and welcomed by others as a truly open-source approach to image-to-text models—but while the debate continues, the model remains out there.
Understanding its limitations
Interestingly, we have already seen certain aesthetic qualities and idiosyncrasies in the output of these various tools. When these tools started to appear, my creative colleagues often commented that you can easily tell which text-to-image model was used to create which output. Midjourney, for example, has been trained on a dataset of images that are ‘aesthetically pleasing’ (which does raise the question of who decides what is an aesthetically pleasing image), while DALL-E produces more photorealistic and varied image styles. Midjourney’s aesthetic is consequently more identifiable, both tonally and figuratively. Since the launch of Midjourney version 5, the level of photorealism has become stunning.
The reality remains that these differences are a direct result of the image datasets that the model is trained on, combined with the way in which rules are applied to the resulting image construction. In other words, our human touch—going back to how we designed and trained each tool—has created distinctions.
The below experiment shows the results for “Old man discovers the meaning of life.” The difference between the two is quite clear, with Midjourney providing a far more stylised and identifiable aesthetic. The differences are less pronounced now, due to the increased photorealism.
The output of these tools is inextricably linked to the input dataset of images that they were trained on, so if you train a model with a set of images that are biased in some way, you will get an output that is biased. Users have put this to the test with DALL-E for example, by adding a simple prompt to see what it generates. When prompted with “CEO,” DALL-E previously responded with white, middle-aged men, and “Nurse” returned results of women, highlighting the hidden (unconscious) bias within the dataset. These biases are still present, the same prompt will return the same result.
At the time OpenAI applied what they describe as a “technique,” to include people of diverse background, but this appears to be achieved by secretly appending “black” and “women” into the user's prompt. This ‘fix’ appears to have been removed, but the bias in the data is still there nonetheless.
Prompt: “pixel art of a person holding a text sign that says.” Image credit: Richard x DALL-E
Validating the quality of the images that are input is no easy task, as each platform must be trained on an insanely large number of images—Google’s Parti for example, has an image dataset of 20 billion images. That said, there needs to be more work done to reduce biases within these datasets to ensure that the output is a fair reflection of society, rather than just quietly tweaking the user's prompt to make the output appear more inclusive.
Something new in our toolbox
Text-to-image models are one expression of AI—a new medium we can work with, as opposed to merely a technology. And just as has always been the case, a new medium creates new ways of working by augmenting the creative process, not replacing it. You need to have well-crafted prompts and rounds of refinements to create something with both beauty and meaning. In that way, text-to-image tools can’t replace human creativity, but rather empower our collective creativity. And I for one, appreciate empowerment.