64 results for “topic:bpe-tokenizer”
Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
Implemented GPT from scratch
GPT-style language model with Byte Pair Encoding tokenizer, built from scratch in PyTorch.
A PHP implementation of OpenAI's BPE tokenizer tiktoken.
Teaching transformer-based architectures
BPE tokenizer for LLMs in Pure Zig
High-Performance Tokenizer implementation in PHP.
Byte-Pair Encoding tokenizer for training large language models on huge datasets
R-BPE: Improving BPE-Tokenizers with Token Reuse
implementation of Byte-Pair Encoding (BPE) for subword tokenization, written entirely in C++ . The tokenizer learns merges from raw text and supports encoding/decoding with UTF-8
Tok: my own Tokenizer
Multi-language BPE tokenizer implementation for Qwen3 models. Lightweight byte-pair encoding for C#/.NET
(1) Train large language models to help people with automatic essay scoring. (2) Extract essay features and train new tokenizer to build tree models for score prediction.
🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.
a parallel and minimal implementation of Byte Pair Encoding (BPE) from scratch in less than 200 lines of python.
[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust
High performance Byte-Pair Encoding tokenizer for large language models
Visualize HuggingFace Byte-Pair Encoding (BPE) Tokenizer encoding process
Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.
LLM Learning step-by-step.
Byte-Pair Encoding tokenizer built from scratch in Python. The same algorithm used by GPT-2.
A high-performance Byte Pair Encoding (BPE) tokenizer written in Rust with Python bindings, using a doubly-linked list ("chain") structure and a frequency-indexed BTreeMap to efficiently track and apply the most frequent pair merges.
Build a light-weight Llama from scratch, based on course Stanford CS336 2025.
Implementation of a Decoder-only Transformer language model from scratch for CS336, featuring a byte-level BPE tokenizer, RoPE, Multi-Head Self-Attention and SwiGLU FFN. Trained on TinyStories with 1.39 Val Loss.
Fast, near-parity C++ BPE token counter for OpenAI encodings
This project implements a Byte Pair Encoding (BPE) tokenizer trained on Kashmiri poetry written in the Latin script. The corpus is derived from the work of Abdul Ahad Azaad, a prominent revolutionary Kashmiri poet of the 20th century.
Tokenizer Chopper is a implementation of a text tokenizer and detokenizer using Byte Pair Encoding (BPE) for modern LLM systems.
processing de LANguage NATural
C89, single header, nostdlib byte pair encoding algorythm