Tuesday, May 2, 2023

Replicating MidJourney within Stable Diffusion

MidJourney allows me to produce crazy amazing portraits. There is an open source alternative to MidJourney called "Stable Diffusion".

Over the last 5 days I challenged myself to reproduce the look of MidJourney in Stable Diffusion. I'm liking the result.

MidJourney image:


Stable Diffusion image: 


Stable Diffusion is an open source text to image program you run on your local computer via python. So... if you can get it to do what you want it is free with the caveat that the resulting image will be 512x512. This may be a bit technical but I wanted to help anyone trying to get Stable Diffusion to produce decent images. It is very doable.

This tutorial is decent but it is lacking in optimization:

https://www.youtube.com/watch?v=Bdl-jWR3Ukc


Here are some things to understand:

First off, you won't produce great images without a negative prompt. I stumbled onto this negative prompt and it was a game changer:

easynegative, badhandv4.pt, bad quality, normal quality, worst quality, (((duplicate))), bad art, mutated, extra limbs, extra legs, extra arms, bad anatomy, (blurry image:1.1), (blurry picture:1.1), (worst quality, low quality:1.4), (out of frame), duplication, (folds:1.7), lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, (extra arms), (extra legs), fused fingers, too many fingers, long neck, username, watermark, signature, monochrome, deformed legs, face out of frame, head out of frame, head cropped, face cropped, same face twins

AI images are produced by feeding your text into a sophisticated prebuilt knowledge pathway called a model. The model is a file GBs in size and comes in 2 varies: checkpoint (old style) and safe-tensors (new style). The one commonly used is Stable Diffusion v1.5 (v1-5-pruned.ckpt) which is 7GB. Models are built by sifting thru millions of images that are tagged with relevant information. This isn't something the average person can do costing companies millions of dollars and a ton of processing power.

However...

You can use AI tools to graft new images into an existing model that replace a concept. This is the hard part. You are trying to "lightly" replace a concept without overdoing it. If you do it wrong Stable Diffusion ignores your prompt text and only produces images like what you give it. This is called overfitting. If you do it right you can get images to change their appearance based on the text of the prompt.                                                                               

The tool I use to graft images into an existing model is dreambooth. I will assume you have installed Stable Diffusion and Dreambooth (see youtube video). You also need a graphics card with at least 12GB of VRAM. I have the Geforce 3080 ti which has 12GB of VRAM. 

It is important to understand VRAM capacity is your enemy. It has caused me a lot of grief finding the perfect settings.

You will need a good model to start with. v1.5 raw is not good enough. There are people that graft generic images onto v1.5 for you to start from. 

I recommend https://civitai.com/models/9114/consistent-factor 

Forget what the youtube video tells you. You will need 75 512x512 images that represent your subject in a variety of settings and poses. You can do it with less but VRAM will force you to build a smaller sized image. My setup allows you to build a 384x384 image with 12GB VRAM (which is really good). There are sites that allow you to quickly resize images to 512x512. (see youtube video for details on this).

I built 75 images in MidJourney in the style I wanted and resized them all to 512x512 and put those in 1 directory.

Once you have 75 images continue...

1. Download the consistent factor 4GB safe-tensors file and put it in your models/Stable-diffusion directory.

2. Go to dreambooth tab and create a new model based on that safe-tensors model. once it is built click 'load settings'

3. Once it is loaded click 'performance wizard' on settings tab. switch to concepts tab and click 'training wizard (person)'

4. click 'save settings' and 'load settings' because I don't trust this tool

'settings' tab entries of interest:

> Training Steps Per Image (Epochs) = 150

> Save model frequency = 25 (create safe-tensors model every 25 epochs)

> Save preview frequency = 5 (show previews every 5 epochs)

> batch size = 1

> Learning rate = 0.000005

> Max Resolution = 384 (if you run out of memory slide this down and retry until it works)

'concepts' tab entries of interest

> we are only doing 1 concept

> Dataset Directory = your 75 image directory

> Classification Dataset Directory = create a directory for this (dreambooth will fill it for you)

> Instance prompt = girl

> Class prompt = girl

> Sample image prompt = girl in a field

> Sample negative prompt = the one I gave above

> Class images per istance = 4 (300 ideal images / 74 provided images = 4)

> Number of samples to generate = 4

> Sample CFG scale = 7

> Sample Steps = 20


once this is all entered, click 'save settings' and 'load settings'


The commonly accepted min number of images to train on is 300. When you have less you ideally want the number of images you have to divide evenly into 300. You then take the remainder (4 in my case) and put that as Class images per istance.


Our goal is to get Max Resolution as high as possible. 

> 75 images + learning rate of 0.000005 + Class images per istance of 4 is the magic combination

> If you have more images, it needs more VRAM

> If you have less images, you need more class images which also needs more VRAM and slows down processing

> If you train slower (ex: 0.000002), you need more VRAM


An epoch just means how many training steps per image. In this case each Epoch will be 150 training steps.

> This can take anywhere between 45 to 75 seconds depending on the complexity of your 75 images.


After each 25 epochs your model will be saved in models\Stable-diffusion

> Once you have this model (your goal) you can test it (see below)


click 'Train' and let it go until it it reaches epoch 26. then click 'cancel' and let it wrap up

> It will have saved a safe-tensors model after 25 epochs.

> Go to txt2img tab

> refresh the checkpoint list and pick your model/checkpoint

> prompt = girl in a field

> negative prompt = what I gave above

> click 'restore faces'

> slide batch size to 8

> click 'generate'


If your model is good, the person will be in a variety of poses in the style you want. If your model overfits your text will have no impact on the result (which may still be ok). If your model is crappy you will see weird artifacts/glitches.


NOTE: all images you produce via txt2img tab are stored in outputs\txt2img-images


If you have a bad model it likely means your images are too different from each other.

You have 3 options to fix this:

1. You can try adding/removing words from the prompt to see if it gets better

2. You use this model as a starting point to run another 25 epochs for another save

3. Replace your images

25 epochs is ideal because if it works, it is likely not overfitted and flexible. The more epochs you run, the more the concept of 'girl' is hardwired to your images ignoring other words in your prompt.

Once you have this working you can do whatever you want. This is how people do deep fakes. You could replace 15 images in your 75 image set of a famous person, build a new model and they will show up in your results.