Advanced Flux Dreambooth LoRA Training with 🧨 diffusers

Community Article Published October 21, 2024

multimodalart Apolinário from multimodal AI art

Just merged: an advanced version of the diffusers Dreambooth LoRA training script! Inspired by techniques and contributions from the community, we added new features to maxamize flexibility and control. We encourage you to experiment, and share your insights with us so we can keep it growing together 🤗

acknowledgements 🙌🏻: ostris, simo ryu, the last ben, bghira, Ron Mokady, Rinon Gal, Nataniel Ruiz, Jonathan Fischoff

What’s New?

Pivotal Tuning (and more)

In addition to the CLIP encoder fine-tuning, pivotal tuning is now also supported 🧨 (following the pivotal tuning feature we also had for SDXL training, based on simo ryu cog-sdxl, read more on pivotal tuning here).

But that’s not all -

(1) Apply it to CLIP only, or both CLIP & T5:

Flux uses two text encoders - CLIP & T5 , by default --train_text_encoder_ti performs pivotal tuning for the CLIP, you can activate it for both encoders, by adding --enable_t5_ti, e.g.

--train_text_encoder_ti
--enable_t5_ti

Motivation - pivotal tuning with clip alone can lead to better convergence and training of the transformer, while keeping it still relatively light weight (in terms of speed and required memory for training) - however adding T5 might have an impact on expressiveness & prompt adherence.

specifically for faces we observed adding pivotal tuning with CLIP is beneficial, but we think the potential for styles is there, as well as for experimenation with T5!

These were trained on 4 images, with the following configs (identicial aside for the use of pivotal tuning)

--dataset_name=linoyts/linoy_face
--instance_prompt=TOK
--output_dir=linoy_flux_v19
--lora_layers="attn.to_k,attn.to_q,attn.to_v,attn.to_out.0"
--mixed_precision=bf16
--optimizer=prodigy
--train_transformer_frac=1
*--train_text_encoder_ti*
--train_text_encoder_ti_frac=.25
--weighting_scheme=none
--resolution=768
--train_batch_size=1
--guidance_scale=1
--repeats=1
--learning_rate=1.0
--gradient_accumulation_steps=1
--rank=16
--max_train_steps=1000
--push_to_hub

(2) Control how new tokens are initialized:

similar to the OG textual inversion work, you can now specify a concept of your choosing as the starting point for training of the new inserted tokens by adding --initializer_concept.

Motivation - The idea is that by choosing a concept initializer that resembles the trained concept, we can better harness the prior knoweldge of the model

we noticed some pretty interesting behaviours when using an initializer concept and we think it's definitely worth some more explorations, e.g.

These were trained on multimodalart/1920-raider-waite-tarot-public-domain dataset, with the following configs (identicial aside for the use of initializer concept)

--dataset_name=multimodalart/1920-raider-waite-tarot-public-domain
--instance_prompt="a trtcrd tarot card"
--caption_column=caption
--token_abstraction=trtcrd
--output_dir=tarot_card_flux_v12
--lora_layers="attn.to_k,attn.to_q,attn.to_v,attn.to_out.0"
*--initializer_concept="tarot card"*
--mixed_precision=bf16
--optimizer=prodigy
--train_transformer_frac=1
--train_text_encoder_ti
--train_text_encoder_ti_frac=1.
--weighting_scheme=none
--resolution=768
--train_batch_size=1
--guidance_scale=1
--learning_rate=1.0
--gradient_accumulation_steps=1
--rank=16
--max_train_steps=750
--push_to_hub

(3) Pure textual inversion:

to support the full range from pivotal tuning to textual inversion we introduce --train_transformer_frac which controls the amount of epochs the transformer LoRA layers are trained. By default, --train_transformer_frac==1, to trigger a textual inversion run set --train_transformer_frac==0, e.g.

--train_text_encoder_ti
--train_text_encoder_ti_frac=1
--train_transformer_frac==0

Values Between 0 and 1 are supported as well!

Motivation - Flux’s base knowledge is so wide, it’s possible pure textual inversion could be enough for some concepts - removing the need for a chunky lora & potentially mitigating some of the issue with prompt adherence

Target Modules

To provide fine grained control over trained modules we’ve added --lora_layers, allowing you to specify exactly which modules you want to train - both the type of transformer layers and blocks LoRA training is applied to.

Motivation - it was recently shown (by the last ben& other community members) that focusing on as little as two blocks - one DiT and one MMDiT can be enough to get a good quality lora. Another aspect is the type of layers we use - for many concepts training on the attention layers only seem to be enough to achieve great results while keeping LoRA size minimal

For example, you can target attention layers only like this:

--lora_layers="attn.to_k,attn.to_q,attn.to_v,attn.to_out.0"

Want to train a broader set of modules? You can specify them all in a comma-separated list 🎯
Want to train specific blocks? add the prefix single_transformer_blocks.i, e.g. - single_transformer_blocks.i.attn.to_k for a DiT and the prefix transformer_blocks.i, e.g. - transformer_blocks.i.attn.to_k to target the i’th MMDiT.

for example,the following config will apply LoRA training to attention layers of the DiT block #7 only:

--lora_layers="single_transformer_blocks.7.attn.to_k,single_transformer_blocks.7.attn.to_q,single_transformer_blocks.7.attn.to_v,single_transformer_blocks.7.attn.to_out.0"

➡️ Checkout more module examples here

Let’s Train

For more examples, installation instructions and inference code snippets please check out this README

What's next?

We plan to release a more extensive analysis of training features and best practies for flux soon, and we welcome community members to contribute & be part of it! 🤗 This is an experimental resource 👩‍🔬 Our goal is to add to it and improve it iteraivtly, so feel free to share your results & let's build it together 🚀

Happy training 🌈✨

Upvote