Nandan91/relu-revival-normfree
PyTorch implementation of normalization-free LLMs investigating entropic behavior to find desirable activation functions
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models
Welcome to the official repository for our paper, "ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models", accepted to the ATTRIB@NeurIPS'24 workshop. Our paper is available on arXiv.
π Abstract
LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. However, it introduces significant challenges in mechanistic interpretability, outlier feature suppression, faithful signal propagation, and the computational and communication complexity of private inference. This work explores desirable activation functions in normalization-free decoder-only LLMs. Contrary to the conventional preference for GELU in transformer-based models, our empirical findings demonstrate an opposite trendβReLU significantly outperforms GELU in LayerNorm-free models, leading to an 8.2% perplexity improvement.
We discover a key issue with GELU, where early layers experience entropic overload, resulting in the under-utilization of the representational capacity of attention heads. This highlights that smoother activations like GELU are ill-suited for LayerNorm-free architectures, whereas ReLU's geometrical propertiesβspecialization in input space and intra-class selectivityβlead to improved learning dynamics and better information retention in the absence of LayerNorm. This study offers key insights for optimizing transformer architectures where LayerNorm introduces significant challenges.
Attention head entropy heatmaps for baseline and normalization-free GPT-2 models with GELU and ReLU activations, highlighting entropic overload in early layers of GELU-based normalization-free model.
π Overview
This repository contains the codebase and resources for exploring the entropic characteristics of normalization-free transformers, particularly focusing on the use of ReLU activation in large language models (LLMs). Our approach sheds light on the advantages of removing normalization layers for efficient model training and inference in transformer architectures, making this research particularly relevant for efficient private inference tasks.
π§ Installation
To get started with the repository, clone the repository and install dependencies as follows:
git clone https://github.com/Nandan91/relu-revival-normfree
cd relu-revival-normfree
pip install -r requirements.txtπ Repository Structure
The repository is organized into directories to simplify exploration and usage:
.
βββ normfree_transformers/ # Core model and utility files
β βββ config/ # Configuration files for model and training
β β βββ config.yaml # General configuration file
β β βββ model/
β β β βββ default.yaml # Default model configuration
β β βββ train/
β β βββ train.yaml # Training-specific configuration
β βββ model_utils.py # Model architecture utilities
β βββ train_utils.py # Training utilities
βββ scripts/
β βββ experiments/ # Experimentation scripts for model training
β β βββ train_baseline.sh # Script for training baseline model
β β βββ train_normfree.sh # Script for training normalization-free model
β βββ plotting/ # Visualization and analysis scripts
β βββ plot_headwise_entropy_heatmap.py # Head-wise entropy heatmap generator
β βββ plot_layerwise_entropy.py # Layer-wise entropy analysis plotter
β βββ plot_layerwise_nan_counts.py # Tracks NaNs in layers for instability detection
βββ requirements.txt # Dependencies
βββ run_clm.py # Main script for running LLM trainingπ¬ Project Components
Model Architecture & Configuration
The normfree_transformers/ directory contains:
- Configuration files in YAML format for easy adjustments to model and training parameters.
- Model utilities (
model_utils.py): Contains modular functions to define the normalization-free architecture with desriable activation functions. - Training utilities (
train_utils.py): Helper functions for training the model with advanced monitoring and logging.
Experimentation
Experiment scripts for training normalization-free and baseline models are available in scripts/experiments/:
train_baseline.sh: Trains a standard transformer model.train_normfree.sh: Trains our novel normalization-free transformer model.
Analysis & Visualization
The scripts/plotting/ directory provides tools for analyzing entropy dynamics and training stability:
plot_headwise_entropy_heatmap.py: Generates entropy visualizations across attention heads in a full-trained model.plot_layerwise_entropy.py: Performs layer-wise entropy analysis, helpful for observing entropic patterns during training.plot_layerwise_nan_counts.py: Identifies instability by tracking NaN counts in each layer during training.
Usage π
All models were trained and evaluated on CodeParrot dataset (2.1B training tokens) using a single NVIDIA RTX 3090 GPU (24GB VRAM).
Training a Model
To start training a normalization-free model:
bash scripts/experiments/train_normfree.shTo train a baseline model:
bash scripts/experiments/train_baseline.shPerplexity Evaluation
The baseline GPT-2 (0.988 and β1.014 respectively, after 32K training steps (device batch size = 128).
| Model | GPT-2 (T=128) | Pythia-70M (T=128) | Pythia-70M (T=256) | |||
|---|---|---|---|---|---|---|
| Eval PPL | +Ξ(%) | Eval PPL | +Ξ(%) | Eval PPL | +Ξ(%) | |
| SM+LN+G | 2.688 | 0.00 | 3.512 | 0.00 | 3.054 | 0.00 |
| SM+LN+R | 2.757 | 2.53 | 3.590 | 2.22 | 3.107 | 1.73 |
| SM+G | 3.197 | 18.92 | 4.086 | 16.35 | 3.570 | 16.87 |
| SM+R | 2.936 | 9.20 | 3.736 | 6.36 | 3.273 | 7.17 |
Evaluation Perplexity (PPL) and Relative Change (+Ξ%) for GPT-2 and Pythia-70M models across different configurations. Normalization-free model with ReLU activations outperform their GELU counterpart.
Visualization
To analyze entropic characteristics, generate entropy heatmaps or layer-wise plots using the scripts/plotting/ tools.
π Citation
If you find our work valuable for your research, please consider citing us:
@article{jha2024relusrevival,
title={ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models},
author={Jha, Nandan Kumar and Reagen, Brandon},
journal={2nd Workshop on Attributing Model Behavior at Scale (NeurIPS)},
year={2024}
}Contact π§
For any inquiries or further information, feel free to reach out to the primary author:
- Nandan Kumar Jha: nj2049@nyu.edu
Acknowledgments
This code is built on the Simplified Transformers.