actypedef/AURA
AURA: Augmented Representation for Unified Accuracy-aware Quantization
😇AURA: Augmented Representation for Unified Accuracy-aware Quantization
AURA is a quantization method that quantizes both weight and activation to low-bit augmented matrices.
We use an accuracy-aware strategy to determine which channels are more likely to suffer severe accuracy loss when performing low-bit quantization. Then we quantize the weight and activation to NVFP4 augmented matrices, concating additional channels to contain the quantize error in activation matrices.
In contrast to traditional mixed-precision quantization methods, AURA decouples the GEMM kernel from the quantization process. This design enables support for various data formats, such as MXFP4 and NVFP4, and facilitates easier adaptation to future data types, establishing it as a more universal strategy.
1. Installation
conda create -n aura python=3.10 -y
conda activate auraPlease make sure that CUDA 12.8 is in your environment.
git clone --recurse-submodules https://github.com/actypedef/AURA.git
cd AURA
pip install -r requirements.txt2. Usage
2.1 Building Kernels
sudo apt-get update
sudo apt-get install python3-devconda install pybind11
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128cd kernels/
bash remake.shThis might take a few minutes.
2.2 Preprocessing
Reorder_indices, select_num are needed for quantization:
python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 32 --seqlen 2048 --act_sort_metric frobeniusResults are saved in ./saved/
2.3 Accuracy Evaluation
bash run_micromix.sh /PATH/TO/YOUR/MODEL/3. Efficiency Evaluation
End-to-end efficiency:
python benchmarks/benchmark_e2e_aura.py --model 'llama-2-7b' --batch_size 8 --prefill_seq_len 1024 --decode_steps 50TensorRT efficiency:
pip install tensorrt
python benchmark/trt-fp8-prefill-llama.py