"topic:tokenizer" — Search

The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT models (gpt-5, gpt-o*, gpt-4o, etc.). Port of OpenAI's tiktoken with additional features.

TypeScript74955Updated 2 hours ago

bpedecoderencodergpt-2gpt-3gpt-4gpt-4ogpt-5gpt-o1machine-learningopenaitokenizer

BLKSerene/Wordless

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

Python74897Updated 1 day ago

corpuscorpus-analysiscorpus-linguisticscorpus-processingcorpus-searchcorpus-statisticscorpus-toolsdependency-parserlemmatizerlinguisticsliteraturestopwordtaggertokenizertranslation

risesoft-y9/Data-Labeling

数据标注是一款专门对文本数据进行处理和标注的工具，通过简化快捷的文本标注流程和动态的算法反馈，支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础，再由自动标注反哺人工标注，最后由人工标注进行纠偏，从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。

Java696104Updated 6 days ago

chinesedata-annotation-toolsdata-annotationsdockerelasticsearchjavanacosspringboot2tokenizertokenizer-parservue3

mathewsanders/Mustard

🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

Swift68618Updated 4 weeks ago

substringsswifttokenizer

cbaziotis/ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Python67595Updated 1 day ago

nlpnlp-librarysemevalspell-correctorspelling-correctiontext-processingtext-segmentationtokenizationtokenizerword-normalizationword-segmentation

open-korean-text/open-korean-text

Open Korean Text Processor - An Open-source Korean Text Processor

Scala65797Updated 1 week ago

koreankorean-text-processingkorean-tokenizernatural-language-processingtext-processingtokenizer

jflex-de/jflex

The fast scanner generator for Java™ with full Unicode support

Java627121Updated 1 week ago

bazel-rulescupdfadfa-minimizationflexgrammarjavalexerlexer-generatorlexical-analyzermaven-pluginnfaparsingregexpscannerscanner-generatortokenizeryacc

therealoliver/Deepdive-llama3-from-scratch

Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.

Jupyter Notebook62650Updated 1 day ago

attentionattention-mechanismgptinferencekv-cachelanguage-modelllamallm-configurationllmsmaskmulti-head-attentionpositional-encodingresidualsrmsrms-normroperotary-position-encodingswiglutokenizertransformer

smoothnlp/SmoothNLPArchived

专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference

Java622111Updated 4 weeks ago

depedency-parsingnlpnlp-pipelinepostaggingpythontokenizer

alasdairforsythe/tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

Go61521Updated 1 week ago

text-tokenizationtokenisationtokenizationtokenizetokenizertokenizingvocabularyvocabulary-buildervocabulary-generator

lindera/lindera

A multilingual morphological analysis library.

Rust60653Updated 1 week ago

analyzerlibrarymorphologicalmultilingualtokenizer

glayzzle/php-parser

:herb: NodeJS PHP Parser - extract AST or tokens

JavaScript56274Updated 4 days ago

astdevelopmentjavascriptlexerparserphpphp-astphp-parserstatic-code-analysistokenizer

lydell/js-tokens

Tiny JavaScript tokenizer.

JavaScript55040Updated 4 weeks ago

ecmascriptjavascriptregextokenizer

polm/fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

C++51439Updated 8 hours ago

cython-wrapperjapanesemecabnlptokenizer

FoundationVision/UniTok

[NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding

Python51311Updated 5 days ago

autoregressive-modelsgenerativegenerative-aigenerative-modelimage-generationimage-tokenizerlarge-language-modelstext-to-imagetokenizer

lionsoul2014/friso

High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.

C51294Updated 3 weeks ago

cchinese-tokenizerchinese-word-segmentationcjk-tokenizerfull-text-searchjapanese-tokenizerkorean-tokenizerphp-tokenizertokenizer

NLPOptimize/flash-tokenizer

EFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING

C++5049Updated just now

bertberttokenizercppcpp17deep-learningflashhuggingfacenlppybind11pythontokenizertriewordpiecewordpiece-tokenization

leodevbro/vscode-blockman

VSCode extension to highlight nested code blocks

TypeScript49920Updated 1 week ago

abstract-syntax-treeasthighlight-blocksindentationparsertokenizervscode-apivscode-blockmanvscode-extension

hplt-project/sacremoses

Python port of Moses tokenizer, truecaser and normalizer

Python49560Updated 1 month ago

machine-translationnlptokenizer

CogComp/cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

Java479145Updated 3 days ago

big-datacogcompdata-miningdependency-parsinglemmatizationlemmatizernamed-entity-recognitionnatural-language-processingnatural-language-understandingnernlpparts-of-speech-taggingpospos-taggingrelation-extractionsimilaritytokenizertransliteration

neurosnap/sentences

A multilingual command line sentence tokenizer in Golang

Go46541Updated 1 week ago

clisentence-tokenizersentencestokenizer

Page 1 of 34