JE
jeongwhanchoi/GFSA
"Graph Convolutions Enrich the Self-Attention in Transformers!" NeurIPS 2024
Graph Convolutions Enrich the Self-Attention in Transformers!
Jeongwhan Choi1*,
Hyowon Wi2*,
Jayoung Kim2,
Yehjin Shin2,
Kookjin Lee3,
Nathaniel Trask4,
Noseong Park2,
1Yonsei University, 2KAIST, 3Arizona State University, 4University of Pennsylvania
๐ข News!
- Mar 3, 2025: ๐ฐ alphaXiv has generated a blog post of our paper!
- Dec 11, 2024: We presented our work at NeurIPS 2024! ๐
- ๐ผ๏ธ See our poster
- ๐ Read the paper
- ๐ฝ๏ธ Watch the video presentation and slides
- Dec 3, 2024: ๐ With this work, Jeongwhan Choi and Hyowon Wi have won a 2024 Qualcomm Innovation Fellowship!
- Dec 9, 2024: Our source code is available now!
- Oct 23, 2024: ๐ With this work, Jeongwhan Choi and Hyowon Wi were qualified as a Qualcomm Innovation Fellowship Finalist in the field of AI/ML.
- Sep 26, 2024: Our paper has been accepted to NeurIPS 2024! ๐
Introduction
- Graph Filter-based Self-Attention (GFSA) is a novel approach to enhance the self-attention mechanism in Transformers.
- By redesigning self-attention from a graph signal processing (GSP) perspective, GFSA addresses the oversmoothing problem and improves performance for various domains.
Key Features:
- Easily integrates with existing Transformer models
- Improves performance with minimal computational overhead
- GFSA shows significant improvements across various tasks on multiple domains
Tasks and Directories
The detailed guidance is included in the README.md of each subdirectory:
-
๐ผ๏ธ Image Classification ๐
./Image -
๐ Natural Language Understanding ๐
./NLP -
๐ง Causal Language Modeling ๐
./NLP -
๐ Graph Regression ๐
./Graph -
๐๏ธ Speech Recognition ๐
./Speech -
๐ป Code Classification ๐
./Code
Implementation Example with the Pseudocode
GFSA's core implementation is shown in the following pseudocode:
def GFSA(att, K):
"""
Graph Filter-based Self-Attention
Args:
att: original self-attention matrix
K: order of high-order term
Notes:
w_0, w_1 can be set in two ways:
1) As learnable parameters
2) Fixed as hyperparameters (w_0=0, w_1=1)
Returns:
gf_att: GFSA attention matrix
"""
# Initialize weights
w_0 = torch.zeros(h) # identity term weight
w_1 = torch.ones(h) # first-order term weight
w_K = torch.zeros(h) # high-order term weight
I = torch.eyes(n)[None, None, ...]
# Compute high-order term using Taylor approximation
att_K = att + (K-1) * (torch.mm(att,att) - att)
# Combine terms with weights
gf_att = w_0[None, :, None, None] * I + \
w_1[None, :, None, None] * att + \
w_K[None, :, None, None] * att_K
return gf_attKey Implementation Features
- Weight Initialization:
w_0,w_1can be either learnable parameters or fixed hyperparameters - High-order Term: Uses Taylor approximation to reduce computational cost
- Minimal Parameters: Adds only a small number of parameters compared to base models
Integration Example
from models.attention import GFSA
# Replace original self-attention with GFSA
attention_output = GFSA(
att=attention_scores, # original attention matrix
K=3 # order of high-order term
)Citation
If you use this code for your research, please cite our paper:
@inproceedings{choi2024gfsa,
title={Graph Convolutions Enrich the Self-Attention in Transformers!},
author={Jeongwhan Choi and Hyowon Wi and Jayoung Kim and Yehjin Shin and Kookjin Lee and Nathaniel Trask and Noseong Park},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=ffNrpcBpi6}
}
Star History
On this page
Languages
Python99.3%Cuda0.6%C++0.1%Shell0.0%C0.0%Cython0.0%
Contributors
MIT License
Created November 13, 2024
Updated October 17, 2025
