😇AURA: Augmented Representation for Unified Accuracy-aware Quantization

AURA is a quantization method that quantizes both weight and activation to low-bit augmented matrices.

We use an accuracy-aware strategy to determine which channels are more likely to suffer severe accuracy loss when performing low-bit quantization. Then we quantize the weight and activation to NVFP4 augmented matrices, concating additional channels to contain the quantize error in activation matrices.

In contrast to traditional mixed-precision quantization methods, AURA decouples the GEMM kernel from the quantization process. This design enables support for various data formats, such as MXFP4 and NVFP4, and facilitates easier adaptation to future data types, establishing it as a more universal strategy.

1. Installation

conda create -n aura python=3.10 -y
conda activate aura

Please make sure that CUDA 12.8 is in your environment.

git clone --recurse-submodules https://github.com/actypedef/AURA.git
cd AURA
pip install -r requirements.txt

2. Usage

2.1 Building Kernels

sudo apt-get update
sudo apt-get install python3-dev

conda install pybind11
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

cd kernels/
bash remake.sh

This might take a few minutes.

2.2 Preprocessing

Reorder_indices, select_num are needed for quantization:

python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 32 --seqlen 2048 --act_sort_metric frobenius

Results are saved in ./saved/

2.3 Accuracy Evaluation

bash run_micromix.sh /PATH/TO/YOUR/MODEL/

3. Efficiency Evaluation

End-to-end efficiency:

python benchmarks/benchmark_e2e_aura.py --model 'llama-2-7b' --batch_size 8 --prefill_seq_len 1024 --decode_steps 50

TensorRT efficiency:

pip install tensorrt
python benchmark/trt-fp8-prefill-llama.py

actypedef/AURA