rolandtannous/gotorchx
GoTorchX is a comprehensive Go binding for PyTorch that provides high-performance deep learning capabilities in idiomatic Go.
GoTorchX
GoTorchX implements PyTorch high-level APIs, including modules and functionals, in idiomatic Go. This enables deep learning programming in Go and Go+. We are following the original design and philosophy of the gotorch package while exploring architectural changes that could improve Go's performance with PyTorch. This project is the revival and continued development of the original GoTorch project.
Attribution
This project is based on gotorch by Yi Wang.
Copyright 2020 Yi Wang.
Modified and extended by Roland Tannous and GravityQ in 2024.
Please refer to the LICENSE and NOTICE documents for detailed copyright and licensing information.
Recent Developments
Since reviving the project in 2024, we've made significant improvements:
- Upgraded to libtorch 2.5.1 with dynamic CPU/GPU package selection and CUDA compatibility for versions 11.8, 12.1, 12.4, and 12.5
- Implemented advanced attention mechanisms including MultiHeadAttention and Flash Attention with comprehensive CUDA support
- Added fundamental mathematical operations including Power (Pow), Square root (Sqrt), Natural logarithm (Log), and Exponential (Exp)
- Enhanced tensor shape manipulation with Unsqueeze, Reshape, and Cat (Concatenate) operations
- Added named tensor dimensions for more semantic clarity in tensor operations
- Implemented comprehensive reduction operations including Max, Min, and Prod with various modes
Current Status & Roadmap
Core Infrastructure
✅ Basic tensor operations and management
✅ CPU/GPU device support
✅ Memory management and GC integration
✅ Error handling system for C++/Go interop
✅ Basic type system aligned with libtorch
Tensor Operations Current coverage:
✅ Basic arithmetic operations (add, subtract, multiply, etc.)
✅ Matrix operations (mm, matmul)
✅ Shape manipulation (reshape, view)
✅ Basic indexing and slicing
❓ Advanced indexing capabilities (need verification)
❓ Complex number support (need verification)
❌ Sparse tensor operations (appears missing)
Neural Network Components Modules implemented:
✅ Linear layers
✅ Convolution layers
✅ BatchNorm
✅ Basic containers (Sequential)
❌ RNN/LSTM/GRU layers (appears missing)
❌ Transformer layers (being added in recent PR)
❌ Embedding layers (appears missing)
Optimizers
✅ Basic optimizers (SGD)
❓ Advanced optimizers status unclear (Adam, RMSprop, etc.)
❌ Learning rate schedulers (appears missing)
Loss Functions
❓ Status of common loss functions unclear
❓ Custom loss function support unclear
Data Loading & Processing
✅ Basic image loading
✅ Common transforms
❌ Dataset abstractions (appears limited)
❌ DataLoader parallelization (appears missing)
Serialization & Model Management
✅ Basic model saving/loading
❓ Checkpoint management (need verification)
❓ State dict handling (need verification)
Easy Switch (Adopted from original package README)
Writing deep learning systems in Go is as efficiently as in Python. The DCGAN training programs in GoTorchX and PyTorch call similar APIs, have similar program structure, and have a similar number of lines. Go+ has a syntax similar to Python. The Go+ compiler translates Go+ programs into Go source programs. It is a joy to write Go+ programs that calls Go packages like GoTorchX.
We have a plan of a translator that migrates existing PyTorch models in Python into GoTorchX.
Benefits (Adopted from original package README)
Higher runtime efficiency. Go programs run as efficiently as C++.
Training and prediction in the same language. No longer training in Python and online prediction in C++. All in Go/Go+. No TensorFlow graphs or PyTorch tracing.
Same data processing code for training and prediction. No need to Wrap OpenCV functions into TensorFlow operators in C++ for prediction and Python for training.
Supports many machine learning paradigms., including adversarial, reinforcement, and imitation learning -- those we cannot split into training and prediction.
Same program for edge and cloud. GoTorchX programs compile and run on phones and self-driving cars as they do on servers and desktops.
Technical Considerations and Bottlenecks
Updated Assessment of Gotorch Challenges in 2024
Examining the key issues mentioned in the original abandonment of gotorch, here's how the situation has evolved:
1. CGO Performance
Original Issue: The project was initially abandoned partly due to cgo performance concerns, with the original author noting that Go's team wasn't prioritizing cgo performance improvements.
2024 Status: Largely Resolved
cgo performance has improved approximately 17x since 2015
Current overhead is only ~40ns per call (single-core) or ~4ns (multi-core)
Scales effectively up to 16 cores, enabling 250 million ops/second
The Go team has indeed improved cgo performance significantly despite earlier indications
This performance level is now adequate for most ML workloads and no longer represents a significant bottleneck for projects like gotorch.
2. GPU Memory Management
Original Issue: Coordination between Go's garbage collector and GPU memory release was identified as a critical challenge, particularly for training workloads.
2024 Status: Partially Improved
Go's GC has added memory limits (GOMEMLIMIT) for better resource management
GC latency has improved with shorter pauses and better concurrency
The fundamental challenge remains: Go's GC still isn't designed to coordinate with external resource management systems like CUDA
No native mechanism exists for GPU memory coordination with Go's GC cycles
This remains the most significant technical challenge for a project like gotorch, particularly for training workloads where timely GPU memory release is critical.
3. GC Pauses and Inference Performance
Original Issue: Concerns were raised about Go's garbage collector causing irregular pauses that could impact online inference performance.
2024 Status: Significantly Improved
Go's GC has made substantial strides in latency predictability
More sophisticated GC pacing and cycle management
Better tools for profiling and tuning GC behavior
Explicit memory limits help prevent unexpected OOM conditions
While the GC still causes some pauses, they're shorter and more predictable than in 2020, making Go more viable for latency-sensitive inference workloads.
In summary, the cgo performance issues have been largely resolved, GC predictability for inference has significantly improved, but coordinating GPU memory with Go's GC remains a substantial challenge that would need to be addressed for any revival of the gotorch project.
We are actively investigating a potential re-architecture of the package to at least partially release it from over-reliance on cgo, which would address some of these remaining concerns.
The Tech Stack (Adopted from original package README)
GoTorchX works with the following open-source communities to form Go+Torch.
the Go+ community,
the PyTorch community, and
the TensorFlow XLA ecosystem.
The following figure reveals the stack of technologies.
text
Go+ applications # users write DL applications in Go+,
│ # whose syntax is as concise as Python
[Go+ compiler]
↓
Go source code -→ GoTorchX -→ libtorch -→ pytorch/xla -→ XLA ops
│
[Go compiler]
↓
executable binary # x86_64, ARM, CUDA, TPU
# Linux, macOS, Android, iOS