"topic:bpe-tokenizer" — Search

implementation of Byte-Pair Encoding (BPE) for subword tokenization, written entirely in C++ . The tokenizer learns merges from raw text and supports encoding/decoding with UTF-8

C++50Updated 7 months ago

bpe-tokenizermachine-learning

gianndev/Tok

Tok: my own Tokenizer

Jupyter Notebook50Updated 6 months ago

bpebpe-tokenizertokenizer

yuniko-software/qwen3-tokenizer-dotnet

Multi-language BPE tokenizer implementation for Qwen3 models. Lightweight byte-pair encoding for C#/.NET

C#51Updated 2 weeks ago

bpe-tokenizercsharpdotnetembedding-modelshuggingfaceinferencellmmachine-learningonnxqwenvector-database

Lizhecheng02/Kaggle-Automated_Essay_Scoring_2.0

(1) Train large language models to help people with automatic essay scoring. (2) Extract essay features and train new tokenizer to build tree models for score prediction.

Python40Updated 1 year ago

awpbpe-tokenizerclassificationdeberta-v3-largehuggingfacekfoldllmpoolingregressiontransformertreemodelvectorizer

neluca/tinybpe

🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.

C40Updated 11 months ago

bpebpe-tokenizercpython-extensionsllmtokenizer

Demon-Sheriff/tiny-BPE

a parallel and minimal implementation of Byte Pair Encoding (BPE) from scratch in less than 200 lines of python.

Jupyter Notebook30Updated 6 months ago

bpe-tokenizermultiprocessingpythontokenization

willxxy/superbpe

[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust

Rust30Updated 11 months ago

bpebpe-tokenizerbytepairencodingrustrust-lang

jmaczan/bpe.c

High performance Byte-Pair Encoding tokenizer for large language models

C30Updated 1 year ago

bpebpe-tokenizercclangllmtokenizer

SauravP97/hf-tokenizer-visualizer

Visualize HuggingFace Byte-Pair Encoding (BPE) Tokenizer encoding process

Python20Updated 2 weeks ago

artificial-intelligencebpe-tokenizerhuggingfacepythontokenizer

estnafinema0/russian-jokes-generator

Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.

Jupyter Notebook10Updated 1 year ago

alibibpe-tokenizergrouped-query-attentionnlppytorchrotary-position-embeddingswiglutransformer-models

odongi/Learning-LLM

LLM Learning step-by-step.

Jupyter Notebook10Updated 3 months ago

bpe-tokenizerllmpositionalembeddingtokenizationvectorembeddings

designer-coderajay/bpe-tokenizer-scratch

Byte-Pair Encoding tokenizer built from scratch in Python. The same algorithm used by GPT-2.

Python10Updated 3 months ago

bpebpe-tokenizerfrom-scratchgpt-2machine-learningnatural-language-processingnlppythontokenizer

zhixiangli/fast-bpe-rs

A high-performance Byte Pair Encoding (BPE) tokenizer written in Rust with Python bindings, using a doubly-linked list ("chain") structure and a frequency-indexed BTreeMap to efficiently track and apply the most frequent pair merges.

Rust10Updated 3 hours ago

bpebpe-tokenizer

jerrypan617/LightLlama

Build a light-weight Llama from scratch, based on course Stanford CS336 2025.

Python10Updated 7 months ago

bpe-tokenizerllamallm

ZZZ150751/cs336_spring2025_assignment1

Implementation of a Decoder-only Transformer language model from scratch for CS336, featuring a byte-level BPE tokenizer, RoPE, Multi-Head Self-Attention and SwiGLU FFN. Trained on TinyStories with 1.39 Val Loss.

Python10Updated 3 weeks ago

bpe-tokenizerdeep-learninggptlanguage-modelingllmpytorchtransformer

xbid-ai/xbid-ai-tokkit

Fast, near-parity C++ BPE token counter for OpenAI encodings

C++10Updated 6 months ago

aibpe-tokenizer

hanaanllone/bpe-tokenizer-kashmiri-latin-script

This project implements a Byte Pair Encoding (BPE) tokenizer trained on Kashmiri poetry written in the Latin script. The corpus is derived from the work of Abdul Ahad Azaad, a prominent revolutionary Kashmiri poet of the 20th century.

Python10Updated 2 weeks ago

bpe-tokenizercomputational-linguisticskashmirilow-resource-languagesnlptokenizer

gnatykdm/tokenizer-chopper

Tokenizer Chopper is a implementation of a text tokenizer and detokenizer using Byte Pair Encoding (BPE) for modern LLM systems.

C++10Updated 1 year ago

aibpebpe-tokenizercppdetokedetokenillmllm-inferencellmsmachine-learningmachine-learning-algorithmstokentokenizer

nos1dot618/lanat

processing de LANguage NATural

Jupyter Notebook13Updated 1 year ago

bigram-language-modelbio-encodingbpe-tokenizernamed-entity-recognitionnatural-language-processing

nickscha/bpe

C89, single header, nostdlib byte pair encoding algorythm

C10Updated 6 months ago

aibpe-tokenizerc89neural-networknostdlibsingle-header

Page 1 of 3