VampDiffusion-V1

A From-Scratch Anime Diffusion Model with Dual Mixture-of-Experts

Warning

This is a live research document. It is updated as the experiment progresses. Results, architecture decisions, and hyperparameters are subject to change. Nothing here is final.

Status: Phase 1 Training - Active
Team: Four researchers, private
Compute: 2× RTX 4090 (48 GB VRAM total)
Dataset: Danbooru2023 + Danbooru2024
Architecture: Diffusion Transformer + Dual Mixture-of-Experts

Abstract

We present VampDiffusion-V1, a from-scratch text-to-image diffusion model targeting anime and illustration generation. The model is trained entirely without pretrained diffusion weights, using only a frozen pretrained VAE encoder for latent space compression. Our primary contribution is a dual Mixture-of-Experts (MoE) system embedded within a Diffusion Transformer (DiT) backbone that enables native style and quality control - eliminating the need for LoRA adapters, HiRes fix post-processing, or external upscalers. Style and quality are first-class architectural concerns routed in a single forward pass.

The model is trained on approximately 3.0 million filtered images from the Danbooru2023 and Danbooru2024 datasets. We document our full methodology, mathematical formulation, and experimental observations here as the work proceeds.

1. Introduction

The dominant paradigm for anime image generation relies heavily on fine-tuned derivatives of Stable Diffusion - models like NAI Diffusion, Animagine XL, and countless community checkpoints all trace their weights back to a common pretrained ancestor. This inheritance provides a strong initialization, but it also means every model in this lineage shares the same architectural decisions, the same latent space, and the same foundational biases.

When researchers or studios want different styles or quality levels, the standard solution is layered: LoRA adapters for style, HiRes fix for detail, external upscalers for resolution, and negative prompts as a catch-all correction mechanism. This stack works, but it is architecturally inelegant. Style and quality are afterthoughts bolted onto a model that was never designed to reason about them natively.

VampDiffusion-V1 takes a different position: style and quality should be first-class architectural properties, handled by dedicated expert networks that activate at inference time based on explicit conditioning signals. No adapter loading. No post-processing pass. One model, one forward pass.

We make no claims about achieving state-of-the-art FID scores or outperforming models with orders of magnitude more compute. This is a constrained experiment - two consumer GPUs, a volunteer research team, and a genuine curiosity about whether the expert routing hypothesis holds up in practice.

2. Background

2.1 Denoising Diffusion Probabilistic Models

Diffusion models learn to reverse a gradual noising process. Given a data distribution \(q(\mathbf{x}_0)\), the forward process adds Gaussian noise over \(T\) timesteps:

\[ q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{1 - \beta_t}\,\mathbf{x}_{t-1},\, \beta_t \mathbf{I}\right) \]

where \(\{\beta_t\}_{t=1}^{T}\) is a fixed noise schedule. Using the reparameterization with \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s\), we can sample \(\mathbf{x}_t\) at any timestep directly from \(\mathbf{x}_0\):

\[ q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\, (1 - \bar{\alpha}_t)\mathbf{I}\right) \]

which means:

\[ \mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \]

The model learns the reverse process \(p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\), parameterized as:

\[ p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \mu_\theta(\mathbf{x}_t, t),\, \Sigma_\theta(\mathbf{x}_t, t)\right) \]

Training minimizes the evidence lower bound (ELBO), which simplifies to a denoising objective:

\[ \mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\left(\mathbf{x}_t, t\right)\right\|^2\right] \]

The model \(\boldsymbol{\epsilon}_\theta\) learns to predict the noise added to a clean image at each timestep.

2.2 V-Prediction Parameterization

VampDiffusion-V1 uses v-prediction rather than noise prediction. Defined as:

\[ \mathbf{v}_t = \sqrt{\bar{\alpha}_t}\,\boldsymbol{\epsilon} - \sqrt{1 - \bar{\alpha}_t}\,\mathbf{x}_0 \]

The training objective becomes:

\[ \mathcal{L}_v = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[\left\|\mathbf{v}_t - \mathbf{v}_\theta(\mathbf{x}_t, t)\right\|^2\right] \]

V-prediction has two practical advantages for our setting:

Zero-terminal SNR compatibility: At \(t = T\), \(\bar{\alpha}_T \to 0\), so the model must predict something meaningful even at pure noise. V-prediction handles this gracefully whereas noise prediction becomes degenerate.
Training stability: V-prediction provides more uniform loss magnitudes across timesteps, preventing the model from being dominated by easy low-noise timesteps.

We implement zero-terminal SNR by setting \(\bar{\alpha}_T = 0\) exactly, following Lin et al. (2023), which allows the model to learn the full noise-to-image mapping without boundary artifacts.

2.3 Latent Diffusion Models

Training directly in pixel space at 1024×1024 resolution is computationally prohibitive on consumer hardware. Following Rombach et al. (2022), we operate in a learned latent space.

A pretrained VAE with encoder \(\mathcal{E}\) and decoder \(\mathcal{D}\) compresses images:

\[ \mathbf{z}_0 = \mathcal{E}(\mathbf{x}_0), \qquad \hat{\mathbf{x}}_0 = \mathcal{D}(\mathbf{z}_0) \]

We use the SDXL VAE, which provides an 8× spatial compression factor. A 1024×1024 image becomes a \(128 \times 128 \times 4\) latent tensor. The diffusion process runs entirely in this latent space:

\[ \mathcal{L}_{\text{LDM}} = \mathbb{E}_{\mathcal{E}(\mathbf{x}_0), t, \boldsymbol{\epsilon}}\!\left[\left\|\mathbf{v}_t - \mathbf{v}_\theta(\mathbf{z}_t, t, \mathbf{c})\right\|^2\right] \]

where \(\mathbf{c}\) is a conditioning vector (tag embeddings + quality token). The VAE weights are frozen throughout all training phases.

Implementation Note

We pre-encode all 3.0 million training images through the VAE encoder once before training begins, storing the latent tensors on disk. This removes VAE inference from the training loop entirely, reducing per-step compute by approximately 20% and simplifying the data pipeline considerably.

2.4 Diffusion Transformers (DiT)

The original diffusion models used UNet architectures with convolutional residual blocks, skip connections, and self-attention at multiple resolutions. Peebles & Xie (2023) showed that a pure transformer backbone - the Diffusion Transformer (DiT) - matches or exceeds UNet performance while scaling more predictably with model size and compute.

A DiT processes image latents as sequences of patches. Given a latent \(\mathbf{z} \in \mathbb{R}^{H \times W \times C}\), we patchify with patch size \(p\):

\[ \mathbf{z}_{\text{patches}} \in \mathbb{R}^{\frac{HW}{p^2} \times (p^2 C)} \]

Each patch is linearly projected to a hidden dimension \(d\), positional embeddings are added, and the sequence is processed by a stack of transformer blocks.

The timestep \(t\) is conditioned through adaptive layer norm (adaLN):

\[ \text{adaLN}(\mathbf{h}, \mathbf{c}) = \gamma(\mathbf{c}) \odot \text{LayerNorm}(\mathbf{h}) + \beta(\mathbf{c}) \]

where \(\gamma(\mathbf{c})\) and \(\beta(\mathbf{c})\) are learned linear projections of the conditioning signal \(\mathbf{c}\) (timestep embedding concatenated with tag embedding). This allows the conditioning signal to modulate every layer's normalization rather than being injected only at cross-attention layers.

We chose DiT over UNet for three reasons specific to our expert system design:

MoE compatibility: Transformer FFN layers are the natural insertion point for expert networks. UNet's convolutional blocks are not.
Conditioning flexibility: adaLN conditioning is cleaner to extend with additional conditioning signals (quality token) than UNet's channel-concatenation approach.
Scaling behavior: As we scale VampDiffusion to V2 and beyond, DiT's known scaling laws give us a more predictable path.

3. Architecture

3.1 Full Model Overview

VampDiffusion-V1
│
├── VAE Encoder (frozen, SDXL)              84M params
│   └── Compresses 1024×1024 → 128×128×4
│
├── Tag Encoder                             ~125M params
│   ├── Danbooru vocabulary (~120k tags)
│   ├── Learned tag embeddings
│   └── 12-layer transformer
│
├── Quality Token Embedding                 ~0.5M params
│   └── Learnable embedding table (4 entries → d_model)
│
├── DiT Backbone                            ~600M params
│   ├── Patch embedding (p=2, d=1024)
│   ├── Positional embeddings (learned)
│   ├── 28 transformer blocks
│   │   ├── Blocks 0,2,4,...  → Standard DiT blocks
│   │   ├── Blocks 1,5,9,...  → Style Expert blocks
│   │   └── Blocks 3,7,11,... → Quality Expert blocks
│   └── Output projection
│
├── Style Expert System                     ~400M params total
│   ├── 8 expert FFN networks
│   ├── Style router (linear projection)
│   └── Load balancing auxiliary head
│
└── Quality Expert System                   ~200M params total
    ├── 4 expert FFN networks
    ├── Quality router (linear projection)
    └── Load balancing auxiliary head

Total: ~1.3B parameters
Active per forward pass: ~875M parameters

3.2 Tag Encoder

The tag encoder is a core differentiator. Existing anime models use CLIP or T5 text encoders that were trained on natural language image captions. Danbooru images are not captioned - they are tagged, and the tag vocabulary has its own structure, syntax, and semantic relationships that differ significantly from natural language.

Vocabulary construction: We extract all unique tags from Danbooru2023 and Danbooru2024, retaining tags that appear in at least 100 images. This gives a vocabulary of approximately 120,000 tags. We add standard special tokens [PAD], [UNK], [BOS], [EOS].

Input representation: Each image's tag set is treated as an unordered set (not a sentence). Tags are sorted by frequency (most common first) and truncated to a maximum of 128 tokens. We use learned positional embeddings to allow the model to weight earlier (more general) tags differently from later (more specific) tags.

The encoder is a standard transformer:

\[ \mathbf{E}_{\text{tag}} = \text{TransformerEncoder}\!\left(\text{Embed}(\mathcal{T}) + \mathbf{P}\right) \]

where \(\mathcal{T}\) is the tag token sequence and \(\mathbf{P}\) are positional embeddings. The output is a sequence of contextualized tag embeddings \(\mathbf{E}_{\text{tag}} \in \mathbb{R}^{128 \times d_{\text{enc}}}\).

A pooled summary embedding is computed as a weighted mean:

\[ \mathbf{e}_{\text{tag}} = \sum_{i=1}^{128} w_i \mathbf{E}_{\text{tag},i}, \quad w_i = \text{softmax}(\mathbf{W}_{\text{pool}} \mathbf{E}_{\text{tag},i}) \]

This pooled embedding \(\mathbf{e}_{\text{tag}} \in \mathbb{R}^{d}\) feeds into both the adaLN conditioning signal and the style router.

3.3 Standard DiT Block

Each standard DiT block applies:

\[ \mathbf{h}' = \mathbf{h} + \alpha_1 \cdot \text{SelfAttn}\!\left(\text{adaLN}(\mathbf{h}, \mathbf{c})\right) \]

\[ \mathbf{h}'' = \mathbf{h}' + \alpha_2 \cdot \text{FFN}\!\left(\text{adaLN}(\mathbf{h}', \mathbf{c})\right) \]

where \(\alpha_1, \alpha_2\) are scalar gates initialized to zero (ensuring identity at initialization), and \(\mathbf{c} = f(\mathbf{e}_t, \mathbf{e}_{\text{tag}}, \mathbf{e}_{\text{quality}})\) is the combined conditioning signal.

The FFN is a standard two-layer MLP with GELU activation:

\[ \text{FFN}(\mathbf{x}) = \mathbf{W}_2 \cdot \text{GELU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 \]

with an expansion ratio of 4× (hidden dimension \(4d\)).

3.4 Style Expert Blocks

Style expert blocks replace the FFN in every alternating block (blocks 1, 5, 9, 13, 17, 21, 25) with a sparse Mixture-of-Experts layer.

Expert Networks

We define 8 style experts \(\{E_i^S\}_{i=0}^{7}\), each an independent FFN:

\[ E_i^S(\mathbf{x}) = \mathbf{W}_{2,i}^S \cdot \text{GELU}(\mathbf{W}_{1,i}^S \mathbf{x} + \mathbf{b}_{1,i}^S) + \mathbf{b}_{2,i}^S \]

Each expert has its own independent weight matrices. The experts are identical in architecture but will specialize through training to handle different aesthetic domains.

Style Router

The style router takes the pooled tag embedding \(\mathbf{e}_{\text{tag}}\) and the timestep embedding \(\mathbf{e}_t\) as input:

\[ \mathbf{r}^S = \mathbf{W}_r^S \cdot \text{concat}(\mathbf{e}_{\text{tag}}, \mathbf{e}_t) + \mathbf{b}_r^S, \quad \mathbf{r}^S \in \mathbb{R}^8 \]

The router selects the top-\(k\) experts (\(k=2\) for style):

\[ \text{TopK}(\mathbf{r}^S, k) = \text{softmax}\!\left(\text{keep top-}k\text{ of }\mathbf{r}^S,\text{ set rest to } -\infty\right) \]

Let \(\mathcal{K} = \text{top-2 indices of } \mathbf{r}^S\). The style expert output is:

\[ \text{MoE}^S(\mathbf{x}) = \sum_{i \in \mathcal{K}} g_i^S \cdot E_i^S(\mathbf{x}) \]

where \(g_i^S = \text{softmax}(\mathbf{r}^S)_i\) are the normalized gating weights for the selected experts. This is soft routing - both selected experts receive a weighted contribution, and gradients flow through the gating weights back to the router.

Load Balancing Loss for Style Experts

A critical failure mode in sparse MoE models is expert collapse - the router learns to always send tokens to the same 1–2 experts, leaving the rest undertrained. We prevent this with an auxiliary load balancing loss (Fedus et al., 2022):

Let \(f_i\) be the fraction of tokens routed to expert \(i\) across a batch, and \(p_i\) be the mean router probability for expert \(i\):

\[ f_i = \frac{1}{T} \sum_{t=1}^{T} \mathbf{1}[i \in \mathcal{K}_t], \qquad p_i = \frac{1}{T} \sum_{t=1}^{T} \text{softmax}(\mathbf{r}_t^S)_i \]

The load balancing loss is:

\[ \mathcal{L}_{\text{balance}}^S = \alpha_S \cdot N_S \cdot \sum_{i=0}^{7} f_i \cdot p_i \]

where \(N_S = 8\) and \(\alpha_S = 0.01\). This term is minimized when all experts receive equal load.

3.5 Quality Expert Blocks

Quality expert blocks appear in blocks 3, 7, 11, 15, 19, 23, 27. The structure mirrors style experts but with key differences.

Quality Token Embedding

The quality conditioning is explicit and discrete - the user specifies a quality tier 0–3 at inference time:

\[ \mathbf{e}_{\text{quality}} = \mathbf{W}_q[q], \quad q \in \{0, 1, 2, 3\}, \quad \mathbf{W}_q \in \mathbb{R}^{4 \times d} \]

This embedding is added to the adaLN conditioning signal alongside the timestep and tag embeddings.

Quality Expert Networks and Router

We define 4 quality experts \(\{E_i^Q\}_{i=0}^{3}\). The quality router takes only the quality token embedding as input:

\[ \mathbf{r}^Q = \mathbf{W}_r^Q \cdot \mathbf{e}_{\text{quality}} + \mathbf{b}_r^Q, \quad \mathbf{r}^Q \in \mathbb{R}^4 \]

Quality routing is top-1 hard routing (\(k=1\)):

\[ j^* = \arg\max_j(\mathbf{r}^Q), \qquad \text{MoE}^Q(\mathbf{x}) = E_{j^*}^Q(\mathbf{x}) \]

Only one expert activates. This is intentional: quality tiers are mutually exclusive. An image cannot simultaneously be "draft" and "ultra quality." A hard switch is semantically correct here.

The straight-through estimator (Bengio et al., 2013) is used to pass gradients through the argmax during training.

Load Balancing Loss for Quality Experts

\[ \mathcal{L}_{\text{balance}}^Q = \alpha_Q \cdot N_Q \cdot \sum_{i=0}^{3} f_i^Q \cdot p_i^Q \]

with \(N_Q = 4\) and \(\alpha_Q = 0.01\).

3.6 Combined Training Objective

\[ \mathcal{L}_{\text{total}} = \mathcal{L}_v + \lambda_S \mathcal{L}_{\text{balance}}^S + \lambda_Q \mathcal{L}_{\text{balance}}^Q + \lambda_{\text{spec}} \mathcal{L}_{\text{specialization}} \]

where:

\(\mathcal{L}_v\) - v-prediction diffusion loss (primary objective)
\(\mathcal{L}_{\text{balance}}^S\) - style expert load balancing (\(\lambda_S = 0.01\))
\(\mathcal{L}_{\text{balance}}^Q\) - quality expert load balancing (\(\lambda_Q = 0.01\))
\(\mathcal{L}_{\text{specialization}}\) - expert specialization loss, active in Phase 2 only (\(\lambda_{\text{spec}} = 0\) in Phase 1)

3.7 Positional Embeddings and Patchification

Input latents have shape \(128 \times 128 \times 4\). With patch size \(p = 2\):

\[ N_{\text{patches}} = \frac{128}{2} \times \frac{128}{2} = 4096 \text{ patches} \]

We use 2D learned positional embeddings - separate embeddings for row and column position that are summed:

\[ \mathbf{P}_{r,c} = \mathbf{P}_r^{\text{row}} + \mathbf{P}_c^{\text{col}}, \quad \mathbf{P}^{\text{row}} \in \mathbb{R}^{64 \times 1024}, \quad \mathbf{P}^{\text{col}} \in \mathbb{R}^{64 \times 1024} \]

This factorized form reduces positional embedding parameters from \(4096 \times 1024\) to \(128 \times 1024\), and generalizes better to non-square aspect ratios.

3.8 Attention Mechanism

We use standard multi-head self-attention with \(H = 16\) heads, \(d_h = 64\):

\[ \text{Attn}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_h}}\right)\mathbf{V} \]

At 4096 patches, the attention matrix is \(4096 \times 4096\) - 16.7 million entries per head. We use Flash Attention 2 (Dao, 2023) which rewrites the attention computation to avoid materializing the full \(N \times N\) matrix in HBM:

\[ \text{Memory: } O(N^2) \to O(N), \quad \text{Compute: same} \]

Flash Attention 2 is a non-negotiable dependency for fitting our model in 24 GB per GPU.

4. Dataset Pipeline

4.1 Raw Data

Source	Raw Count	Coverage
Danbooru2023	~6,500,000	Uploaded through Dec 2023
Danbooru2024	~1,500,000	Jan 2024 additions
Total raw	~8,000,000	-

4.2 Filtering Pipeline

Stage 1 - Resolution Filter

\[ \text{Keep if: } \min(W, H) \geq 512 \]

Removes ~18% of images. Remaining: ~6.2M images.

Stage 2 - Danbooru Score Filter

Keep images with score tag ≥ 3. Remaining: ~4.1M images.

Stage 3 - CLIP Aesthetic Score

\[ s_{\text{aesthetic}} = \mathbf{w}_{\text{aes}}^\top \mathbf{f}_{\text{CLIP}}(\mathbf{x}) \]

Keep images with \(s_{\text{aesthetic}} \geq 4.5\). Remaining: ~3.5M images.

Stage 4 - Tag Density Filter

\[ |\{t \in \mathcal{T}_{\text{img}} : t \notin \mathcal{T}_{\text{meta}}\}| \geq 5 \]

Remaining: ~3.2M images.

Stage 5 - Near-Duplicate Removal

Using perceptual hashing (pHash), for any pair with Hamming distance \(d_H < 8\), keep the higher-resolution version. Remaining: ~3.0M images.

4.3 Quality Tier Labeling

Quality tier labels \(q \in \{0, 1, 2, 3\}\) are assigned from Danbooru score tags:

\[ q(\mathbf{x}) = \begin{cases} 3 & \text{if score:9 or score:8} \\ 2 & \text{if score:7 or score:6} \\ 1 & \text{if score:5 or score:4} \\ 0 & \text{if score:3} \end{cases} \]

Quality Tier	Danbooru Score	Count	%
QE-3 (Ultra)	8–9	~310,000	10.3%
QE-2 (High)	6–7	~780,000	26.0%
QE-1 (Standard)	4–5	~1,320,000	44.0%
QE-0 (Draft)	3	~590,000	19.7%

The class imbalance is intentional. In Phase 2 and Phase 3, we apply targeted oversampling of QE-3 images to strengthen the ultra quality expert.

4.4 Style Cluster Labeling

Style labels are derived from a tag-matching heuristic:

Expert	ID	Primary Tags
Dark/Gothic	0	`gothic`, `dark`, `horror`, `blood`, `dark_background`, `skull`
Soft Pastel	1	`soft_focus`, `pastel_colors`, `shoujo`, `watercolor`, `dreamy`
Bold Lineart	2	`manga`, `monochrome`, `lineart`, `sketch`, `doujinshi_style`
Painterly	3	`oil_painting`, `impasto`, `painterly`, `traditional_media`
Flat Color	4	`flat_color`, `vector_art`, `cel_shading`, `limited_palette`
Realistic	5	`realistic`, `semi-realistic`, `photo_background`, `detailed_background`
Chibi	6	`chibi`, `super_deformed`, `cute`, `mini_character`
Cyberpunk	7	`neon_lights`, `cyberpunk`, `glitch`, `hologram`, `sci-fi`

For each image, tag matches per cluster are counted: \(s_i(\mathbf{x}) = |\mathcal{T}_{\text{img}} \cap \mathcal{C}_i|\). If \(\max_i s_i(\mathbf{x}) \geq 2\), the image receives style label \(\ell^S(\mathbf{x}) = \arg\max_i s_i(\mathbf{x})\).

Style	Count	% of labeled
Unlabeled	~1.6M	-
Bold Lineart	~340k	24%
Flat Color	~260k	18%
Soft Pastel	~210k	15%
Chibi	~150k	11%
Dark/Gothic	~180k	13%
Realistic	~130k	9%
Painterly	~90k	6%
Cyberpunk	~40k	3%

4.5 Latent Pre-Encoding

All 3.0M images are encoded through the frozen SDXL VAE before training:

\[ \mathbf{z}_0 = \mathcal{E}(\mathbf{x}_0), \quad \mathbf{z}_0 \in \mathbb{R}^{128 \times 128 \times 4} \]

Each latent in float16: \(128 \times 128 \times 4 \times 2\text{ B} \approx 128\text{ KB}\). Total for 3.0M images: \(\approx 384\text{ GB}\), stored on a RunPod network volume in 3,000 shards of 1,000 latents each.

5. Training

5.1 Phase 0 - Infrastructure and Preprocessing

Duration: Weeks 1–2 | Status: Complete

Dataset download and integrity verification
Filtering pipeline execution (stages 1–5)
Quality tier and style cluster labeling
Tag vocabulary extraction and tokenizer training
VAE pre-encoding of all 3.0M images (~18 hours on a single RTX 4090)
Sharded latent dataset construction
RunPod environment setup and W&B initialization

5.2 Phase 1 - Base Pretraining

Duration: Weeks 3–10 (estimated) | Target steps: 500,000–800,000 | Status: In Progress (~Step 48,000)

Phase 1 trains the full model end-to-end on the complete 3.0M dataset. The MoE routing is present but receives uniform gating weights - no specialization loss is applied.

Expert specialization requires the backbone to first develop a shared representation of anime images. If we push specialization from step 0, experts diverge before they have learned anything useful.

Parameter	Value
Base learning rate	\(1 \times 10^{-4}\)
LR schedule	Cosine decay with 2000-step warmup
Optimizer	AdamW (\(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\))
Weight decay	\(0.01\)
Batch size (per GPU)	32
Gradient accumulation	4 steps
Effective batch size	256
Mixed precision	bf16
Gradient clipping	1.0 (global norm)
EMA decay	0.9999

Learning rate schedule:

\[ \eta(t) = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\!\left(\pi \cdot \frac{t - t_{\text{warmup}}}{T - t_{\text{warmup}}}\right)\right) \]

EMA:

\[ \theta_{\text{EMA}} \leftarrow \lambda \cdot \theta_{\text{EMA}} + (1 - \lambda) \cdot \theta, \quad \lambda = 0.9999 \]

Timestep sampling uses log-normal sampling (Karras et al., 2022):

\[ \log t \sim \mathcal{N}(\mu_t, \sigma_t^2), \quad \mu_t = -1.2, \quad \sigma_t = 1.2 \]

Distributed training uses PyTorch DDP across 2 GPUs. Effective batch size: \(32 \times 2 \times 4 = 256\) samples per optimizer step.

5.3 Phase 2 - Expert Specialization

Duration: Weeks 11–14 (estimated) | Target steps: ~200,000 | Status: Pending

Phase 2 freezes the DiT backbone and tag encoder, training only the expert FFN weights and router projections.

Style specialization loss for labeled images:

\[ \mathcal{L}_{\text{spec}}^S = -\sum_{\mathbf{x} \in \mathcal{D}_{\text{labeled}}} \log \text{softmax}(\mathbf{r}^S_{\mathbf{x}})_{\ell^S(\mathbf{x})} \]

Quality specialization loss:

\[ \mathcal{L}_{\text{spec}}^Q = -\sum_{\mathbf{x} \in \mathcal{D}} \log \text{softmax}(\mathbf{r}^Q_{\mathbf{x}})_{q(\mathbf{x})} \]

Reduced learning rate: \(\eta_2 = 1 \times 10^{-5}\)

UltraQuality oversampling - QE-3 images are oversampled 3× to compensate for the smaller pool (310k vs 1.32M for QE-1).

5.4 Phase 3 - Joint Fine-tuning

Duration: Weeks 15–18 (estimated) | Target steps: ~150,000 | Status: Pending

Phase 3 unfreezes all weights and trains end-to-end on a curated top-500k subset, selected by:

\[ \text{score}_{\text{rank}} = 0.5 \cdot s_{\text{aesthetic}} + 0.3 \cdot \mathbf{1}[q = 3] + 0.2 \cdot \frac{|\mathcal{T}_{\text{img}}|}{|\mathcal{T}_{\text{max}}|} \]

Learning rate: \(\eta_3 = 5 \times 10^{-6}\)

The key hypothesis tested in Phase 3 is whether the routing system correctly composes: a prompt for gothic, dark, masterpiece should simultaneously activate Expert 0 (Dark/Gothic) via the style router AND QE-3 (Ultra) via the quality router, with both routing decisions interacting in the same forward pass.

5.5 Noise Schedule

Phase 1 uses a linear schedule:

\[ \beta_t = \beta_{\text{start}} + \frac{t-1}{T-1}(\beta_{\text{end}} - \beta_{\text{start}}) \]

with \(\beta_{\text{start}} = 0.00085\), \(\beta_{\text{end}} = 0.012\), \(T = 1000\). Zero-terminal SNR is enforced by setting \(\beta_T = 1\).

Phase 3 will experiment with a cosine schedule (Nichol & Dhariwal, 2021):

\[ \bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos^2\!\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right), \quad s = 0.008 \]

6. Compute and Infrastructure

6.1 VRAM Budget (Per GPU)

Component	Memory
Model weights (bf16, ~875M active params)	~1.75 GB
Optimizer states (AdamW: 2× params)	~3.5 GB
Gradients	~1.75 GB
Activations (with gradient checkpointing)	~8.0 GB
Flash Attention workspace	~2.0 GB
Latent batch buffer (32 × 128×128×4)	~0.5 GB
EMA model copy	~1.75 GB
CUDA context + overhead	~4.0 GB
Total	~23.25 GB

Gradient Checkpointing

Without gradient checkpointing, storing all intermediate activations for backprop through 28 transformer blocks at sequence length 4096 would require ~40 GB per GPU. Gradient checkpointing recomputes activations during the backward pass, trading 30–40% additional compute for a ~3× reduction in activation memory.

6.2 Training Speed Estimates

Phase	Target Steps	Estimated Hours
Phase 1	800,000	1,333–1,600
Phase 2	200,000	333–400
Phase 3	150,000	250–300
Total	1,150,000	~1,916–2,300

6.3 RunPod Configuration

GPU:            2× NVIDIA RTX 4090
VRAM:           48 GB total
System RAM:     64 GB
Storage:        500 GB Network Volume
Docker Image:   pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
Python:         3.11
Key Libraries:  torch==2.3.0, flash-attn==2.5.8, diffusers==0.27.0,
                transformers==4.40.0, wandb, accelerate, datasets

Checkpoint strategy:

Full checkpoint (weights + optimizer + EMA) every 5,000 steps → /network_volume/checkpoints/full/
EMA-only checkpoint every 1,000 steps → /network_volume/checkpoints/ema/
Sample grid (4×4 images from fixed prompts) every 5,000 steps → W&B Artifacts

7. Evaluation

7.1 Fréchet Inception Distance (FID)

\[ \text{FID} = \|\mu_r - \mu_g\|^2 + \text{Tr}\!\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right) \]

We evaluate on a held-out set of 10,000 Danbooru images not seen during training, tracked separately per quality tier:

\[ \text{FID}_q = \text{FID}(\mathcal{D}_q^{\text{real}}, \mathcal{G}_q^{\text{generated}}), \quad q \in \{0, 1, 2, 3\} \]

If quality expert routing is working correctly: \(\text{FID}_3 < \text{FID}_2 < \text{FID}_1 < \text{FID}_0\).

7.2 CLIP Score

\[ \text{CLIP Score} = \mathbb{E}\left[\frac{\mathbf{f}_{\text{img}} \cdot \mathbf{f}_{\text{text}}}{\|\mathbf{f}_{\text{img}}\| \|\mathbf{f}_{\text{text}}\|}\right] \]

Since our tag encoder is Danbooru-native, CLIP score is an imperfect but useful cross-model comparison baseline.

7.3 Expert Utilization Metrics

Style expert entropy:

\[ H_S = -\sum_{i=0}^{7} \bar{g}_i^S \log \bar{g}_i^S \]

Maximum entropy (\(H_S = \log 8 \approx 2.08\)) means uniform routing. Decreasing entropy after Phase 2 indicates experts are differentiating.

Style routing accuracy (primary Phase 2 metric):

\[ \text{Acc}^S = \frac{1}{|\mathcal{D}_{\text{labeled}}^{\text{val}}|} \sum_{\mathbf{x} \in \mathcal{D}_{\text{labeled}}^{\text{val}}} \mathbf{1}\!\left[\arg\max_i g_i^S(\mathbf{x}) = \ell^S(\mathbf{x})\right] \]

Expected to rise from ~12.5% (random baseline for 8 classes) to 55–70% by end of Phase 2.

Expert collapse alert threshold: \(f_i > 0.7\) for any single expert triggers a W&B alert and may require adjusting \(\alpha_S\).

8. Preliminary Observations

Info

These observations are early and should not be interpreted as results. Phase 1 is approximately 6–10% complete at time of writing.

Loss behavior: The diffusion loss decreased from ~0.98 at step 1 to ~0.31 at step 48,000. The rate has slowed as expected - the first 20,000 steps saw most of the rapid improvement.

Expert load balancing: Style expert routing frequencies at step 48,000 range from 0.09 to 0.16 (uniform expected: 0.125). Routers have begun developing mild preferences even without explicit specialization supervision.

Sample quality: At step 48,000, generated samples show coherent structure (recognizable faces, plausible limb positions, consistent backgrounds) but significant detail artifacts - blurry linework, inconsistent shading, color bleeding at edges. This is within expected range for this stage.

Router gradient norms: Style and quality router gradients (~0.02–0.05) are roughly 5× lower than backbone block gradients (~0.15–0.25). Expected - shallow linear projections receive indirect gradient signal from expert outputs.

9. Discussion

9.1 Why Train From Scratch

Research integrity: Starting from SDXL or any derivative entangles our results with decisions made by those teams. We cannot know which improvements come from our expert system versus the pretrained weights. Training from scratch gives a clean baseline.

Architectural freedom: Fine-tuning an existing model with an embedded MoE system would require surgical modification of a pretrained architecture - new MoE layers would be randomly initialized while the rest is pretrained, creating difficult-to-control training dynamics.

9.2 Limitations

Compute constraints: Two RTX 4090s are modest hardware for a 1.3B parameter model. We do not claim to outperform models with larger compute budgets on raw FID.
Style label quality: Tag-based style labeling is noisy. A dedicated visual style classifier would improve Phase 2 specialization.
Quality label calibration: Danbooru scores carry recency and fandom biases. Score-based quality labels are imperfect proxies for aesthetic quality.
VAE lock-in: Using the frozen SDXL VAE couples our latent space to Stability AI's architectural decisions.

9.3 Future Directions (VampDiffusion-V2, Speculative)

Native VAE trained on anime images (breaking SDXL dependency)
Extended expert pool - potentially 16 style experts
Preference alignment via DPO using VampDevAI user feedback
Resolution scaling to 2048×2048 with tiled attention
Auxiliary RLHF stage for tag adherence

10. Team

VampDiffusion-V1 is developed by a team of four researchers operating under the Vampelium Dev Research Division. Individual identities are not disclosed.

Codename	Role
Vampelium	Architecture Lead - DiT design, MoE system, training loop
Death Executioner	Training Infrastructure - RunPod, DDP, checkpointing, monitoring
Kmax	Dataset & Pipeline - filtering, labeling, pre-encoding, data loading
Rem	Evaluation & Tooling - metrics, sample grids, analysis, tooling

11. Experiment Log

Date	Hash	Entry
2026-04-01	`#a1f3c2`	Project initialized. Architecture finalized after three rounds of VRAM budget analysis.
2026-04-14	`#b29e11`	Dataset filtering pipeline complete. 3.06M images retained from 8.0M raw.
2026-04-22	`#d48c32`	Tag vocabulary built. 118,442 unique tags after frequency pruning.
2026-05-02	`#f84a30`	VAE pre-encoding complete. 384 GB latent cache written to network volume.
2026-05-08	`#77c1a0`	Docker environment finalized. DDP confirmed working across both GPUs.
2026-05-10	`#cc72de`	Phase 1 training started. Initial loss: 0.982.
2026-05-16	`#9b1f55`	Step 48,000. Loss: 0.31. Expert routing frequencies stable. Samples coherent but soft.

References

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.
Song, J., et al. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021.
Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.
Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023.
Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. ICLR 2024.
Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR 2022.
Nichol, A., & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021.
Karras, T., et al. (2022). Elucidating the Design Space of Diffusion-Based Generative Models. NeurIPS 2022.
Lin, S., et al. (2023). Common Diffusion Noise Schedules and Sample Steps are Flawed. WACV 2024.
Bengio, Y., Léonard, N., & Courville, A. (2013). Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv:1308.3432.
Shazeer, N., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017.

VampDiffusion-V1 is an internal research experiment by Vampelium Dev. Last updated: 2026-05-16.