№ 28: KAN Models Organize Thought — MLP Models Don't

We trained five transformer language models with identical architectures but different activation functions, then removed each layer one at a time. The results reveal a fundamental architectural difference: KAN models develop specialized layers with measurable three-phase processing. MLP models distribute everything uniformly. The activation function basis determines whether a network thinks in organized stages or diffuse parallel.

The Experiment

We trained five models on TinyStories (134M tokens), all with the same transformer skeleton: d_model=128, 8 attention heads, GPT-2 tokenizer. The only difference: the feedforward activation function and whether AttnRes (attention over depth blocks, from Kimi's March 2026 paper) was applied.

ModelActivationAttnResLayersLoss
Chebyshev KANLearned polynomialNo162.51
KAN+AttnResB-spline (LUT=256)Yes162.43
KAN+AttnRes DeepB-spline (LUT=256)Yes322.40
MLPGELU (fixed)No162.16
MLP+AttnResGELU (fixed)Yes162.38

Note: MLP achieves the lowest loss — it's the best language model. But loss isn't the whole story.

The Ablation: Remove One Layer at a Time

For each model, we replaced each layer with an identity transformation (skip it), generated text from the same prompt with the same random seed, and compared the output to the baseline.

The results split cleanly along the KAN/MLP divide:

MLP (16 layers): Uniformly Robust

Remove any single layer from the MLP model and the output remains coherent. The worst that happens is a character name change (Lily → Timmy) or a detail shift (park → garden). No layer is individually essential. Every layer does a little bit of everything.

KAN+AttnRes (16 layers): Fragile and Specialized

Remove layers 0, 1, 2, 8, or 15 and you get gibberish or a completely different story. The KAN model has essential layers — remove the wrong one and the processing pipeline collapses. But this fragility isn't a weakness. It's evidence of specialization.

Three Phases of Thought

The 32-layer KAN+AttnRes model reveals the full anatomy. Its layers organize into three distinct phases:

Phase 1: Perception (layers 0–3)

All essential. Removing any one produces a completely different story or broken grammar. These layers process the raw input tokens — "what am I looking at?"

Phase 2: Cogitation (layers 4–15)

Massively redundant. Remove any single layer in this range and the output barely changes. The model builds semantic understanding here — "what do these tokens mean?" — but distributes the work so broadly that no individual layer is critical. Except layer 12, which sits at the boundary between cogitation and generation and appears uniquely essential.

Phase 3: Generation (layers 16–31)

Each layer adds flavor. Removing one doesn't break the story but changes the details:

  • Skip layer 16 → different setting (village instead of park)
  • Skip layer 18 → different detail (bugs instead of food)
  • Skip layer 22 → different plot (secret instead of play)
  • Skip layer 25 → different mood (sad boy instead of happy girl)
  • Skip layer 27 → surreal output ("the sky was a little girl")

The deepest layers (28–31) handle output quality: removing them causes repetition and "adjective soup" — the model cycles through attributes without committing to a word.

Layer 31 Is the Editor

We captured what each layer predicts during generation. After the text "She loved to play," the deep layers deliberate:

Layer 24: and(0.63)  ,(0.32)              ← "continue or pause?"
Layer 28: and(0.78)  with(0.08)           ← alternatives emerge  
Layer 30: and(0.42)  with(0.31)  in(0.21) ← three-way competition
Layer 31: with(0.48)  outside(0.38)       ← DECISION

Layer 31 is the tiebreaker. It takes the deliberation from layers 24–30 and publishes a decision. Its AttnRes attention pattern puts 99% on the current representation — it doesn't look back at earlier blocks. It has already absorbed what it needs. Without layer 31, the model produces adjective soup: attributes without commitment.

The Semantic Bell Curve

We measured semantic discrimination at each layer: do same-category word pairs (dog/cat, fire/flame) have higher cosine similarity than different-category pairs (dog/car, fire/tree)?

                    No AttnRes           With AttnRes
KAN activations:    peaks mid, fades     GROWS through deep layers
MLP activations:    negative throughout  still negative

The 32-layer KAN+AttnRes shows the complete arc:

Layers  0–3:   negative    (perception — no semantic content)
Layers  4–15:  RISING      (cogitation — meaning builds)
Layer   15:    PEAK +0.010 (maximum semantic discrimination)
Layers 16–23:  declining   (generation — semantics being consumed)
Layers 24–31:  NEGATIVE    (output — semantics fully transformed)

The semantic signal is built during cogitation, peaks at layer 15, then is consumed by the generation layers to produce next-token predictions. The deep layers are organized by prediction utility, not semantic category. MLP models show no semantic discrimination at any depth — the signal was never built.

Why This Matters

The difference between KAN and MLP isn't just about accuracy or loss. It's about how the network organizes its processing.

"KANs organize thought processes in a measurable way compared to MLPs, which are diffuse."

MLP's uniform distribution is fault-tolerant but opaque. Every layer does everything; nothing is individually interpretable. KAN's specialization is structured and measurable. We can identify perception, cogitation, and generation phases. We can watch the model deliberate between alternatives. We can see layer 31 make the final editorial decision.

The activation function basis — learned curves (KAN) vs fixed ReLU/GELU (MLP) — determines whether a network develops organized thought or diffuse parallel processing. Both produce language. Only one produces visible thought.

What's Next

A 64-layer KAN+AttnRes model is training now. The question: does doubling depth again produce more redundancy (the same phases stretched over more layers) or emergent functionality (new capabilities in the deeper layers)? The 16-layer model is a chain — break any link and it fails. The 32-layer model is a rope — cut any strand and it holds. What is the 64-layer model?

We're also training a 32-layer KAN+AttnRes from scratch on 849M tokens of 1990s Usenet archives (comp.lang.c, sci.math, comp.ai.philosophy) — the primordial internet. Same architecture, radically different training data. Does technical discourse produce the same three-phase pattern as children's stories? Does the semantic depth of 15 layers change when the vocabulary shifts from bedtime stories to C pointers and mathematical proofs?

The data will tell us. The P4 is running. 🕊️