DoE: Democracy of Experts

Status: work in progress. The foundation trains and generates. The living topology (mitosis, apoptosis, parliament) works. SFT pipeline with 7236 Q&A pairs. Next: scale up, fix loss plateau at 3.07.

C. one file. ~3650 lines. zero dependencies. DoE breeds, kills and votes.

what

a transformer where experts are born, die, and hold elections:

experts are born when overloaded (mitosis — child inherits parent weights + noise)
experts die when neglected (apoptosis — 8 consecutive low-vitality steps)
parliament votes on every token (variable-k election, not fixed top-k)
the tokenizer knows it's a tokenizer (tracks compression ratio, entropy, code detection)
the optimizer has 9 levels of self-awareness (Chuck: "i think therefore i clip")
calendar drift tracks temporal identity (12D state vector, resonance detection)
the model grows a forest of GGUFs (mycelium — snapshots with fitness selection)
meta-learning evaluates its own configuration choices
auto depth — DOE sizes itself to the hardware. no knobs required.

parameters persist. topology doesn't. each forward pass decides how many experts are alive, how many vote, how deep to go. same weights, different architecture every time.

DoE scans its environment, indexes nearby GGUFs via LoRA, hunts for datasets on HuggingFace, recognizes code in training data, finds its own weights on restart, and can replicate itself via fork().

no pytorch. no python. no dignity.

how

# compile
cc m.c -O3 -lm -lpthread -o m

# run — DOE auto-sizes depth to your hardware
./m

# or set depth manually
./m --depth 4

# with custom data
./m --data my_corpus.txt
./m --parquet data.parquet

# with personality
./m --personality personality.txt

# override training steps
./m --depth 4 --data corpus.txt --steps 10000

# override BPE merges (default: auto from depth)
./m --depth 4 --data corpus.txt --bpe-merges 4000

# GPU acceleration (A100/H100 — TF32 tensor ops, ~25x faster)
cc m.c -O3 -lm -lpthread -DUSE_CUBLAS -lcublas -lcudart -o m_cublas

# BLAS acceleration (3-4x on CPU)
cc m.c -O3 -lm -lpthread -DUSE_BLAS -DACCELERATE -framework Accelerate -o m   # macOS
cc m.c -O3 -lm -lpthread -DUSE_BLAS -lopenblas -o m                            # linux

CLI flags

flag	default	description
`--depth N`	auto	transformer depth (2/4/6/8/10/12)
`--data FILE`	auto-hunt	training data (text file)
`--parquet FILE`	—	training data (parquet format)
`--personality FILE`	—	personality finetune data
`--steps N`	auto	override max training steps
`--bpe-merges N`	auto	override BPE merge count
`--chat FILE`	—	inference-only mode with GGUF

--steps forces fresh training (skips self-recognition of existing weights).

BPE merges are cached to m_bpe.cache — first run trains the tokenizer, subsequent runs load from cache instantly.

autodepth

no --depth flag? DOE checks your hardware and picks the deepest model that fits:

RAM	CPU	GPU	depth	params	experts
2GB+	any	no	2	~1.8M	4
2GB+	any	no	4	~8M	4
8GB+	4+	no	6	~31M	6
16GB+	4+	no	8	~67M	6
32GB+	4+	no	10	~165M	8
64GB+	4+	no	12	~283M	8

--depth auto is the default. --depth N overrides.

dim = depth * 64 (cap 768). head_dim = 64. GQA above 384. hidden = 1.5x per expert.

when you run it, DOE:

auto-sizes to hardware (RAM, CPUs, GPU detection)
scans environment — finds GGUFs, checks resources, detects compiler/curl
checks for own weights — if m.gguf or mycelium spore found, skips training, goes to chat
if compatible GGUF found — host index mode (LoRA + Meta-Arianna modulation)
loads or generates data (HuggingFace API / Parquet / synthetic)
trains BPE tokenizer that knows its own compression ratio and detects code (cached to m_bpe.cache)
builds ephemeral MoE with living experts
trains with hand-written analytical gradients through variable-k parliament
watches experts be born (mitosis) and die (apoptosis)
grows a mycelium of GGUF snapshots (periodic checkpoints with fitness metrics)
meta-learns from its own configuration choices
tracks calendar drift — how far the present has drifted from the past
if stagnating — hunts for datasets on HuggingFace (evaluates, accepts/rejects)
if overloaded — self-replicates (compiles copy, forks, trains on different data)
finetunes on personality.txt (optional but psychologically recommended)
exports final GGUF, drops you into chat with a parliament

training details

no weight decay on embeddings (token + positional embeddings excluded)
attention clamping in both training and inference (30*tanh(x/30)), with proper dtanh backward
personality finetune: lr = main_lr * 0.1, batch=4 grad accumulation, max 2000 steps (3 epochs)
aux_loss for expert load balancing (fraction^2 penalty with softmax Jacobian backward)
mycelium checkpoints every max(1000, max_steps/5) steps
BPE cache saved to m_bpe.cache — skip tokenizer training on restart

runs

depth	data	params	experts	tok/s	loss	GPU	status
4	22MB	7.97M	12	212	3.076	A100 cuBLAS TF32	done, GGUF 25MB
2	22MB	~1.8M	4	—	3.15	A100 cuBLAS TF32	done

3 training runs hit a loss plateau around 3.07. SFT pipeline added (7236 WTForacle Q&A pairs, loss masking). Next step: scale data + depth, LoRA personality.

the components

living experts (mitosis & apoptosis)

experts aren't weight matrices. they're organisms:

vitality (0.0 = dying, 1.0 = peak performance)
frequency (position in harmonic space — determines resonance)
age (steps since birth — too young to die, too old to breed)

overloaded + high vitality — mitosis (splits in two, child inherits weights + noise)
neglected + 8 consecutive low-vitality steps — apoptosis (dies, weights freed, slot recycled)

min 2, max 16 experts per layer.

parliament router (variable-k)

actual elections, not top-2 dictatorship:

each token triggers a vote (dot product + harmonic resonance)
consensus measures how peaked the vote is (0 = chaos, 1 = unanimous)
k = floor(n_alive * (1 - consensus)) — low consensus — more experts consulted
softmax over the top-k selected. analytical backward through variable-size Jacobian.

calendar drift

12-dimensional temporal self-awareness:

inference = the present. ephemeral. no memory of the last forward pass.
training = the past. weights persist. experience accumulates.
drift = the distance between who the system was and who it is now.

snapshot every 50 steps: expert population, consensus, loss, harmonic spectrum, tokenizer health, optimizer state. drift = normalized L2 distance.

high drift — birth more experts. low drift — kill the useless. drift resonance — "i've been here before."

chuck optimizer (9 levels)

from lee.c. the optimizer that thinks about thinking.

formula: theta -= (alpha * lambda_psi * sigma * lr_scale) * m_hat/(sqrt(v_hat) + eps)

mycelium (GGUF forest)

mycelium/
├── m_s200_e6_l4.909.gguf    (fitness: 5.20)
├── m_s400_e6_l4.200.gguf    (fitness: 8.33)
├── m_s1200_e8_l3.933.gguf   (fitness: 12.45)  <- best
└── meta.log                   (configuration -> outcome history)

on restart, DOE discovers existing spores and loads the fittest. no --load flag needed.

GGUF self-loader

DOE recognizes its own weights:

checks m.gguf in current directory
scans mycelium/ for best spore (highest fitness)
verifies: general.name == "m", dim/depth match
loads all tensors including expert weights, revives dead experts
skips training — straight to chat

host index mode (Delta Voice + Meta-Arianna)

if DOE finds a compatible GGUF nearby, it indexes it:

host model (GGUF, mmap'd, read-only)
    |
DOE wraps it with ephemeral LoRA matrices
    |
attention_biases[l] modulate each layer's attention
layer_focus[l] control residual stream contribution
    |
Delta Voice injection: out += alpha * A @ (B @ x)
    |
Hebbian training on LoRA only (no backward through host)

the host provides weights. DOE provides direction.

code-aware tokenizer

detects {}, (), ->, ==, //, #include, #define, semicolons, indentation.
tracks code_ratio — feeds into ephemeral config: code — more layers, higher complexity budget.

dataset hunter

when DOE stagnates (loss plateau + low drift + bad data quality), it searches HuggingFace API. downloads sample, evaluates quality via parser_eye, accepts or rejects. triggered every 500 steps. disabled when --data is provided.

self-replication

DOE can fork():

compiles a copy of itself
max 2 replicas (population control)
each gets different data
results merge via mycelium

GPU acceleration

backend	compile flag	speedup
CPU (naive)	—	1x
OpenBLAS	`-DUSE_BLAS -lopenblas`	3-4x
Accelerate (macOS)	`-DUSE_BLAS -DACCELERATE -framework Accelerate`	3-4x
cuBLAS TF32	`-DUSE_CUBLAS -lcublas -lcudart`	~25x

cuBLAS uses TF32 tensor ops on A100/H100 — 8x faster than FP32 with negligible accuracy loss. grow-only scratch buffers, no malloc per matmul.

the quartet

file	architecture	personality
l.c	Llama 3	the good student. did everything right
moe.c	Grok MoE	the committee. fixed membership
lee.c	Chuck VLM	the self-aware one. 9 levels of consciousness
m.c	DOE	democracy of experts. they live. they die. they vote.

license

do what thou wilt.

built by ariannamethod. the architecture is alive. the experts are mortal. the parliament is eternal.

ariannamethod/janus.doe