Hi there,
I’m a complete amateur in design and painting, but I got kinda hooked on Stable Diffusion, because it (theoretically) lets me create images out of my fantasy without needing the digital painting skills.
I poured a couple of weekend free time into learning how to use SD and by now I’m somewhat familiar with how to make useful prompts, how to use Control Net, Inpainting and Upscaling.
But now I’m a bit at a loss on how to further perfect my workflow, because as of right now I can get really good images that kinda resemble the scene I was going for (letting the model / loras do the heavy lifting) or I’m getting an image that is composed exactly as I want (utilizing control net heavily) but is very poorly executed in the details with all sorts of distorted faces, ugly hands and so on.
Basically, if I give a more vague prompt the image comes out great but the more specific I want to be, the more the image generation feels “strangled” by prompt and control net and it doesn’t seem to result in usable images …
How do you approach this? Trying to generate 100’s or more images in the hope that one of them will get your envisioned scene correctly? Or do you make heavy use of Photoshop/Gimp for postprocessing (<- I want to avoid this) or do you painstakingly inpaint all the small details until it fits?
Edit: Just to add a thought here: I just started to realise how limited most of the models are in what they “recognise”. All our everyday items are covered pretty well, e.g. prompting “smartphone” or “coffeemachine” will produce very good results, but things like “screwdriver” are getting dicey already and with special terms like “halberd” it is completely hopeless. Seems I will need to go through with making my own lora as discussed in the other thread …
Well I’m a noob too, using SD for a month now. There’s a lot to tackle but let’s talk about some points of my workflow, of what I believe to understand.
So first of you got to ask yourself how complex is the scene going to be. One person alone is usually no problem. Multiple is tricky.
Then you have the lora vs no lora approach.
So first of I try to use as little lora as possible, as they add an additional layer of balancing weights, on top of the already necessary balancing of prompts. Lora deform your composition a lot and you want to avoid them. Then you have a ton of lora that only work well with certain checkpoints or contain the opposite of what you want to gen (anime vs realistic vs CGI for example). If you have to use lora from the opposite style you want, you need to set the lora <tag> low, like on 0.3 or even lower, to avoid oversaturation of the style bleeding into your composition. And at the same time increase the weight of the prompt keyword, that is used by the lora. I try to avoid going above 1.9 as that seams to cause artifacts and I’m doing better by removing keywords, adding keywords or shifting them. Sometimes the most important isn’t far enough up in the list. Using stuff like BREAK to separate certain elements might help too.
So far I found using “latent couple” extension and “composible lora” extension, to give me good results with multiple people and multiple lora. You can enable and add controlnet as well. There’s even a latent couple helper tool to make it easier to select the parts of an image you want to be person A and person B. Haven’t tried more than 4 people yet but there’s almost no limit I guess.
You are generally on a good track (meaning you picked the right balancing of weights and prompts) when faces get fixed in hires fix (2x resolution) automatically. Meaning without enabling restore faces option. Some checkpoints are bad at faces or the combination of your lora, so it’s a bit of a pain searching for a different one and testing it. I have like 40 now and I seam to download more instead of less. Haha. But maybe learning to use one and sticking too it is smarter as some like or dislike certain prompts (usually described on the checkpoint civitai page)
Increasing CFG can help or adding more prompts. If you use lora, you can look into the details. I use civitai helper extension and click on the small exclamation mark (!) to see trigger words of the lora (even more than used in example images or description of the lora on civitai) and there you often find words that trigger the lora, resulting in more weight generated for that lora. For example if the training data used more images with say a women with black hair, than it’s easier to generate a women with black hair instead of forcing blond hair.
I usually generate 4 images at once in low steps and low resolution, like 512x512 first. If I reached at least 1 in 4 images being similar to what I want, then I start with hires fix and later img2img ultimate SD upscaling + controlnet. (I’m not a fan of extra upscaling)
At the end I do generate like 20 to 100 images until my composition is nearly where I want it to be. And I prefer to not use controlnet, so I can easily reuse the image prompts I used. Nothing is worse than needing to search for the right controlnet, depth, canny, reference and more and the weights to get close again.
If it’s about using a certain position, you can combine multiple controlnet with lower weight. That’s usually smart if you want an exact pose.
Well I hope this did help a little. Cheers!
Very solid run down, and I agree with most of this.
In particular - Latent couple and composable LORA are amazing tools and the OP should definitely look into them.
I would love to hear the answer to this one. I have also found it’s hard to exactly what I want. People end up having too many limbs or something.
It’s somewhat easy to get something generic.
If you are getting too many limbs it’s normally because of something you did. Getting too many fingers is just SD being SD, but if you get an extra leg then most of the time SD is confused about a controlnet, or you are using a weird LORA.
I too hate having to use GIMP to fiddle with images. Really the best way to approach it will depend on how you think/approach things more than anything else. With enough effort you can convince SD to do almost anything, it just depends which phase of the process you are working at.
My workflow is something like this:
- Generate figures using a general prompt and a controlnet to set the composition in the frame. Reroll the seed until I get something nice, then freeze the seed.
- If the composition is not quite right, spit the resulting image out into the OpenPose editor, detect the pose, then move the figure to fix whatever the problem is (repeat as necessary, back and forth between text2img and OpenPose)
- Turn on ADetailer, and at least set it to give me an appropriate face (no big foreheads)
- Start refining the prompt, adding more words and more details. Use the prompt history extension so you never lose a good one. At this stage every generation should be very similar to the previous one, and that’s the point, so you can do constant better or worse comparisons. You can use X/Y plots to try out different values and so on here.
- Add LORA, but expect to be disappointed. Often they will fry the image unless strongly controlled.
- If the LORA doesn’t need to be in the whole scene I use latent couples and composable LORA to keep it to the specific area that it needs to be in.
- Alternatively I use InpaintAnything to segment the image, then inpaint that way rather than generating again.
Sometimes you get better results regenerating a whole new image, sometimes you get better results from inpainting. The gods of SD are fickle.
For things like hands and feet, I generally preferred to try and fix them in the text2img phase, because they are the very devil to get right after the fact. At a minimum I want 4 fingers and a thumb on each side.