Text-to-image generation is a fascinating and challenging task that aims to create realistic and diverse images from natural language descriptions. This task has many potential applications in fields such as design, art, advertising, and education. However, text-to-image generation also poses many technical difficulties, such as modeling complex and multimodal data, capturing long-range dependencies, and ensuring coherence and consistency between text and image.
Recently, a new paradigm for text-to-image generation has emerged, based on latent diffusion models (LDMs). LDMs are a class of generative models that learn to reversibly transform data from a simple prior distribution (such as Gaussian noise) to a complex data distribution (such as natural images) through a series of stochastic diffusion steps. LDMs have shown impressive results in unconditional image generation, surpassing the state-of-the-art performance of generative adversarial networks (GANs) and variational autoencoders (VAEs).
One of the most prominent LDMs for text-to-image generation is Stable Diffusion, an open-source model developed by OpenAI. Stable Diffusion uses a Transformer encoder to encode the text input into a latent vector, which is then used to condition the diffusion process. Stable Diffusion can generate high-quality images of up to 256x256 resolution from diverse and complex text prompts, such as “a cat wearing a hat” or “a painting of a woman in a red dress”.
However, Stable Diffusion also has some limitations. First, it is very computationally expensive to train and deploy. According to its authors, it took about 460 million US dollars to train Stable Diffusion on a large-scale dataset of 400 million image-text pairs. Moreover, it requires hundreds of diffusion steps to generate an image, which results in high latency and low throughput. Second, it suffers from some quality issues, such as mode collapse, blurry details, and semantic inconsistency.
To address these challenges, Deci, a leading AI company that specializes in optimizing deep learning models for inference efficiency, has developed DeciDiffusion, a novel text-to-image LDM that is faster and better than Stable Diffusion. DeciDiffusion is based on several architectural innovations and advanced training techniques that enable it to achieve equal or higher quality than Stable Diffusion in 40% fewer iterations. Combined with Deci’s inference SDK, Infery, DeciDiffusion can generate images in under a second on affordable NVIDIA A10G GPUs, which is 3 times faster than Stable Diffusion.
DeciDiffusion’s main contributions are as follows:
- It uses AutoNAC, Deci’s proprietary neural architecture search engine, to design an optimal architecture for the diffusion network. AutoNAC automatically searches for the best combination of convolutional layers, residual blocks, attention modules, normalization methods, and activation functions that maximize the model’s performance while minimizing its computational cost.
- It introduces a novel attention mechanism called DeciAttention, which is more efficient and effective than the standard self-attention used by Stable Diffusion. DeciAttention reduces the computational complexity of attention from quadratic to linear by using hashing and clustering techniques. It also improves the quality of attention by using dynamic routing and gating mechanisms that adapt to the input data.
- It employs a new training strategy called DeciTraining, which consists of two stages: pre-training and fine-tuning. In the pre-training stage, DeciDiffusion is trained on a large-scale dataset of 400 million image-text pairs using contrastive learning and knowledge distillation. In the fine-tuning stage, DeciDiffusion is further trained on a smaller dataset of 40 million image-text pairs using adversarial learning and style transfer. This strategy allows DeciDiffusion to learn both general and specific features from different data sources.
- It leverages Deci’s inference SDK, Infery, to optimize DeciDiffusion for deployment on various hardware platforms. Infery applies various techniques such as quantization, pruning, sparsification, fusion, and compilation to reduce the model’s size, latency, memory usage, and power consumption.
DeciDiffusion has demonstrated remarkable results in text-to-image generation. It can generate realistic and diverse images from various domains such as animals, landscapes, portraits, cartoons, logos, and abstract art. It can also handle complex and creative text prompts such as “a dragon playing chess with a unicorn” or “a logo for a company called Deci that specializes in AI”. Moreover, it can generate images of up to 512x512 resolution with fine details and sharp edges.
DeciDiffusion’s superior performance has been verified by several quantitative and qualitative evaluations. For instance, DeciDiffusion has achieved higher scores than Stable Diffusion on various metrics such as FID, IS, PPL, and CLIP. Furthermore, DeciDiffusion has received more positive feedback than Stable Diffusion from human evaluators on aspects such as realism, diversity, coherence, and preference.
DeciDiffusion is a breakthrough in text-to-image generation that opens up new possibilities for generative AI applications. By combining state-of-the-art LDMs with cutting-edge optimization techniques, DeciDiffusion offers a fast and high-quality solution for transforming text into images. DeciDiffusion is also an example of Deci’s vision to democratize AI by making it more accessible and affordable for everyone.
DeciDiffusion is available as a public model on Hugging Face, where you can try it out for yourself. You can also check out Deci’s blog post for more details and examples of DeciDiffusion’s amazing capabilities.
How to use DeciDiffusion
To use DeciDiffusion, you need to install the following Python packages:
# pip install diffusers transformers torch
Then, you can use the following code snippet to load the model and generate an image from a text prompt:
from diffusers import StableDiffusionPipeline
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
checkpoint = "Deci/DeciDiffusion-v1-0"
pipeline = StableDiffusionPipeline.from_pretrained(checkpoint, custom_pipeline=checkpoint, torch_dtype=torch.float16)
pipeline.unet = pipeline.unet.from_pretrained(checkpoint, subfolder='flexible_unet', torch_dtype=torch.float16)
pipeline = pipeline.to(device)
img = pipeline(prompt=['A photo of an astronaut riding a horse on Mars']).images[0]
Demo link
You can also try out DeciDiffusion online using the Hugging Face Spaces demo link. Just enter your text prompt and click on the “Generate” button to see the result. You can also download the generated image or share it with others.
I hope you enjoy using DeciDiffusion and exploring its possibilities. If you have any questions or feedback, please feel free to comment.
Add a Comment: