World of GenAI – Exploring the Depths of Stable Diffusion

This content originally appeared on DEV Community and was authored by krish

Since the COVID-19 pandemic, "Artificial Intelligence" has gained significant popularity and you might have heard of Generative AI and how it has been used for both illegal and artistic activities.

In this blog, we will explore what Generative AI is, delve into stable diffusion, and learn how to set it up and use it to its maximum potential.

What is Generative AI

Generative AI refers to algorithms that can create new content, such as images, text, and music, by learning patterns from existing data. Unlike traditional AI, which focuses on recognizing patterns and making decisions, generative AI models can produce original, realistic outputs, pushing the boundaries of creativity and innovation.

What is Stable Diffusion

Stable diffusion is a concept in generative modeling, particularly within the field of Generative AI. It refers to a class of algorithms designed to generate data (such as images) by iteratively refining random noise into a coherent and meaningful output.

This process involves a sequence of steps where the model gradually de noises the initial random input, guided by a learned probability distribution of the target data. This technique is particularly effective in producing high-quality and diverse outputs, making it a popular choice for applications requiring realistic image synthesis, such as digital art creation, video game design, and even scientific visualization.

We will be looking into the image generation aspect of this generative model. We will use stable-diffusion-webui for more user-friendly interface.

Setting up Stable Diffusion

Download the zip from here. You can find the same on the GitHub repository with the name sd.webui.zip.

Download the sd.webui.zip zip.
Extract the zip.
Run update.bat first. This updates the webui as well as install all the necessary libraries and example checkout for stable diffusion.
Run run.bat to start the web server and head over to http://127.0.0.1:7860/ to experience your newly setup stable diffusion.

Every time you want to run SD webui, use run.bat only.

Example

To understand Stable Diffusion, let us start with an example with all the technical details and we will break it down slowly. (If you don't like anime then don't worry, these prompts can even make realistic humans with the right checkpoint)

Positive Prompt: (overhead view), (green forest), dynamic angle, ultra-detailed, illustration, (fantasy:1.4), (detailed eyes), (cowboy shot:1.3), (1girl:1.4, elf ears), (blue high elf clothes:1.3), (<big> butterfly wings:1.4), (shiny silver hair:1.2 , long hair length, french braid:1,2, hair between eyes, white flower hair clip:1.22), ((blue starry eyes)), (standing), (melancholic lighting), (lonely atmosphere:1.3), (rain falling:1.3), (slight mist:1.33), white fog, volumetric volume, (camera angle: cowboy shot), (framing: cowboy shot), (focus: girl), (portrait orientation), hands behind the back,

Negative Prompt: (worst quality, low quality:1.4), monochrome, zombie, (interlocked fingers:1.2), bad hands, bad anatomy, extra fingers,

Checkpoint: counterfeitV30_v30
Sampling Steps: 30 | DPM++ 2M karras
Seed: 3079411257
Resolution: 512x512
Hi Res: Latent, 2x Upscaling, 0.5 strength, and 0 steps.

Let us breakdown what exactly is going on here!

Basic Terminologies

Checkpoints

Checkpoint refers to a saved state of the model at a particular point during training. Checkpoints capture the model’s parameters and configurations, allowing users to resume training from that point or to use the model for generating images. They are crucial for preserving progress, experimenting with different training stages, and fine-tuning the model to achieve specific results. By using checkpoints, users can leverage pre-trained models or revert to earlier states if needed, making them an essential tool for managing and optimizing machine learning workflows.

Think of checkpoints as pre-trained models that will be used for generating images. Each checkpoint is trained on different kind of dataset. Anime checkpoints like counterfeit, meinapastel, and ponyXL are trained on anime images whereas models like realistic vision, epicrealism, and realisticstockphotos are trained on realistic images.

I have created this page that mentions most of the well known checkpoints, LoRA and embedding. Check out the list over there: link

Positive Prompt

A positive prompt in Stable Diffusion serves as the guiding instruction for the model, detailing the attributes and features that should be present in the generated image. By providing specific and descriptive text, you can achieve the output that fits your visual desires, enhancing the model’s ability to produce images that align with the given context or theme.

Our example has so much content to define what exactly we need. (overhead view), (green forest), dynamic angle, ultra-detailed, illustration, (fantasy:1.4), (detailed eyes), (cowboy shot:1.3), (1girl:1.4, elf ears), (blue high elf clothes:1.3), (<big> butterfly wings:1.4), (shiny silver hair:1.2 , long hair length, french braid:1,2, hair between eyes, white flower hair clip:1.22), ((blue starry eyes)), (standing), (melancholic lighting), (lonely atmosphere:1.3), (rain falling:1.3), (slight mist:1.33), white fog, volumetric volume, (camera angle: cowboy shot), (framing: cowboy shot), (focus: girl), (portrait orientation), hands behind the back,

Things you need to understand:

You can write any prompt like a normal text you would write to an AI model like ChatGPT but that won't work as intended sometimes.
Using Brackets emphasizes on your words more. Eg. (green forest) specifies that we want our background to be a green forest. Notice how I have used ((blue starry eyes)) with 2 brackets, that adds more weight to my words. You can add as many brackets you want but it won't be as helpful as adding real numerical weights.
Adding Numerical Weights will help the model understand its priority. (fantasy:1.4) adds 1.4 as weight to the word fantasy. By default, each word separated by commas will have 1 as its weight. You should play around 1.1 to 1.6, going above this makes the model forget about the other prompt tokens (prompt tokens are the comma separated individual values).
Be more descriptive with <> angular brackets. Here, (<big> butterfly wings:1.4) big is an adjective and more information to the scenery. Without angular brackets also works but the will take away the attention of the model from other parts of prompt. An example for this could be something like (A big room <white walls with wall clock>, open window <wooden frame window with a waterfall and castle outside>) notice how I added description without making it a true prompt tokens (prompt tokens are the comma separated individual values).
Using Brackets for combining descriptors is a good practice where you use a single bracket and add all details of a specific part of the body, clothes or environment in one prompt token. Example adding starry eyes and detailed eyes in one bracket so that the model doesn't have to work with multiple tokens for the same area. (detailed eyes, blue starry eyes)
Order of elements matter. If you think that should edit the prompt to add something major then it is recommended to have it in the start of your prompt chain. Usually defining background, clothes and pose should be in the start of your prompt.
Avoid overloading prompts. Adding instruction for your image is valid but adding too much details will rather work negative. Keep anywhere around 30 to 150 prompt tokens (you can see the amount on the top right corner of your prompt box).
Context matters. I have seen a lot of AI artists adding 4K, 8K, illustration, hyper detailed, as their prompts and it actually doesn't works but gives a personal satisfaction as well as adds a little variance to the image. Adding fantasy, detailed eyes, detailed face, actually works though.

Experiment with your weights and prompt to get a good image.

Negative Prompt

The negative prompt acts as a filter to exclude undesired elements from the generated images. By specifying what to avoid, such as certain colors, objects, or styles, users can refine the output and prevent the inclusion of unwanted features. This helps in achieving a more accurate and focused result according to the user’s preferences.

Our example mentions (worst quality, low quality:1.4), monochrome, zombie, (interlocked fingers:1.2), bad hands, bad anatomy, extra fingers,

These words or type of images is what the model will try not to generate, every single rule mentioned for positive prompt works here as well. Add weights or LoRA to make the image more refined.

Sampling Methods, type and steps

Sampling methods define the techniques used by the model to navigate through the latent space and generate images. Methods such as DPM++ 2M or Euler determine how the model transitions from noise to a coherent image, affecting both the quality and diversity of the generated results. Different methods offer various trade-offs between accuracy, speed, and artistic style.

Try experimenting with different sampling methods on your own. These are the ones that I like and use most of the time in order.

DPM++ 2M
Euler A
DPM++ 2M SDE

Understanding type is like understanding which shirt to buy in supermarket. They all looks the same, does somewhat the same but has very different meaning when you look deep into it. I would suggest to use karras if you are going for animeish art style and SGM Uniform for realistic images and of-course Automatic if you don't want to play with it.

Sampling steps are the iterative stages through which the model refines an image from initial noise to a detailed final output. Each step represents a phase in the diffusion process where the model makes adjustments to the image. More sampling steps generally lead to higher quality and more detailed images, as the model has more opportunities to refine and perfect the output. By default, the number of steps is set to 20. Many artist raise it to 30 or 45 for better quality. (personally recommends 30 steps)

Seed

It is a random number generated by the model that could be recycled and reused to create the exact same image again. Click on the green recycle icon if you want to use the seed again. Helpful if you like a certain addition by the model to the image and you want to reuse it in other images. Could be the way the character stands, dress cloth/color, and hair/face details.

CFG Scale

CFG scale (Classifier-Free Guidance scale) adjusts how strongly the model follows the prompt. A higher CFG scale makes the model adhere more closely to the prompt, producing more accurate results, while a lower scale allows for more creative and varied outputs, with less strict adherence to the prompt. Default value: 7, suggested value: 7.

High Res Fix

High Res Fix is a technique used to improve the quality of high-resolution images generated. It involves refining and correcting details to reduce artifacts and enhance sharpness and clarity. This method ensures that images retain high fidelity and visual coherence, especially when generating outputs at larger dimensions or when fine details are crucial.

What are Upscalers, Hires steps, de noising strength, and upscaling?

Similar to how we had different Sampling methods, we have different upscaling methods (or upscalers) too! By default Latent is used for upscaling and is probably the best one too. Use R-ESRGAN 4x+ Anime6B if you are going for more anime look. Use VAE (they are like plugins that comes into play after upscaling and can be added using settings tab.) as well to add more color and vibrancy to the image.

Hires Steps is similar to our sampling steps. It should be less than or equal to the sampling steps. When you install a checkpoint from civit.ai, the checkpoint creator usually adds a few sample images or Dev notes that mentions what hires steps and de noise strength should be use. If no information exists, use the standard 0 steps and 0.5 de noising strength. (NOTE: 0 hires steps basically means that it will just use the same number of sampling steps)

Conclusion

With this, you should be able to create almost any good AI generated image on Stable diffusion. If you want to go deeper into other aspects of SD like VAE, embedding, control net, extensions and img2img generation then do let me know in comments and I would be happy to work on PART 2!

Krish Sharma

This content originally appeared on DEV Community and was authored by krish

Print Share Comment Cite Upload Translate Updates

APA

krish | Sciencx (2024-08-06T21:49:41+00:00) World of GenAI – Exploring the Depths of Stable Diffusion. Retrieved from https://www.scien.cx/2024/08/06/world-of-genai-exploring-the-depths-of-stable-diffusion/

MLA

" » World of GenAI – Exploring the Depths of Stable Diffusion." krish | Sciencx - Tuesday August 6, 2024, https://www.scien.cx/2024/08/06/world-of-genai-exploring-the-depths-of-stable-diffusion/

HARVARD

krish | Sciencx Tuesday August 6, 2024 » World of GenAI – Exploring the Depths of Stable Diffusion., viewed ,<https://www.scien.cx/2024/08/06/world-of-genai-exploring-the-depths-of-stable-diffusion/>

VANCOUVER

krish | Sciencx - » World of GenAI – Exploring the Depths of Stable Diffusion. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/08/06/world-of-genai-exploring-the-depths-of-stable-diffusion/

CHICAGO

" » World of GenAI – Exploring the Depths of Stable Diffusion." krish | Sciencx - Accessed . https://www.scien.cx/2024/08/06/world-of-genai-exploring-the-depths-of-stable-diffusion/

IEEE

" » World of GenAI – Exploring the Depths of Stable Diffusion." krish | Sciencx [Online]. Available: https://www.scien.cx/2024/08/06/world-of-genai-exploring-the-depths-of-stable-diffusion/. [Accessed: ]

rf:citation

» World of GenAI – Exploring the Depths of Stable Diffusion | krish | Sciencx | https://www.scien.cx/2024/08/06/world-of-genai-exploring-the-depths-of-stable-diffusion/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.