🧹 ollama-japanese-cleaner

Remove tokenization spacing artifacts from Ollama / local LLM Japanese text output.

The Problem

Ollama (gemma3, llama3, etc.) uses subword tokenizers that insert spaces at token boundaries when generating Japanese text:

LLM output:  私 の 役 割 は、個 人 の 知 識 管 理 ア シ ス タ ン ト です。
After clean: 私の役割は、個人の知識管理アシスタントです。

Full-width brackets and mixed ASCII are also affected:

LLM output:  「 リード ・ スク リ プト （ 演出 家 ）」 です （ ファイル : name . md ）。
After clean: 「リード・スクリプト（演出家）」です（ファイル:name.md）。

This is especially noticeable in SSE streaming where tokens are sent one by one.

Features

Zero dependencies — single file, copy-paste ready
21 cleanup rules — CJK, katakana, numbers, filenames, punctuation, brackets (full-width & half-width), colon, markdown, short English fragments
Safe for English — cjkOnly guard prevents breaking normal English text
Universal — works in Node.js (CJS/ESM) and browsers
Iterative convergence — rules applied until no more changes (max 10 iterations)

Install

# npm
npm install ollama-japanese-cleaner

# or just copy the file
cp ollama-japanese-cleaner.js your-project/

Usage

const { cleanOllamaJapanese } = require('ollama-japanese-cleaner');

const raw = '私 の 役 割 は ア シ ス タ ン ト です。';
console.log(cleanOllamaJapanese(raw));
// → 私の役割はアシスタントです。

Browser

<script src="ollama-japanese-cleaner.js"></script>
<script>
  const cleaned = cleanOllamaJapanese(rawText);
</script>

SSE Streaming

let text = '';
eventSource.onmessage = (e) => {
  text += e.data;
  el.textContent = text;                            // raw during stream
};
eventSource.addEventListener('done', () => {
  el.textContent = cleanOllamaJapanese(text);       // clean after done
});

Supported Patterns (21 Rules)

#	Category	Before → After
1	Markdown escape	`\_test\_` → `_test_`
2	Kanji / Hiragana / Katakana	`私の役割` → `私の役割`
3	CJK ↔ ASCII	`技術ドキュメント AI` → `技術ドキュメントAI`
4	Digits	`2 0 2 6 年` → `2026年`
5	Dot / Slash / Colon	`. md` → `.md`, `/ path` → `/path`
6	Underscores	`FLUX _ PRIORITY` → `FLUX_PRIORITY`
7	English fragments (1-2 + 3+)	`SK ILL` → `SKILL` (CJK context only)
8	English fragments (2+2)	`eB PF` → `eBPF` (CJK context only)
9	English fragments (1+2)	`e BP` → `eBP` (CJK context only)
10	Single char + digit	`v 1` → `v1`
11	CJK + closing punct	`文書、解析。` → `文書、解析。`
12	Opening bracket + CJK	`「演出` → `「演出`
13	CJK + opening bracket	`です（ファイル` → `です（ファイル`
14	Closing bracket + CJK	`）です` → `）です`
15	CJK + half-width colon	`ファイル :01` → `ファイル:01`
16	ASCII + full-width close	`md ）` → `md）`
17	Full-width open + ASCII	`（ file` → `（file`
18	Bold markers	` 太字 ` → `太字`

Rules 7-9 use iterative convergence: e BP F → eBP F → eBPF (2 iterations).
Rules 11-17 handle both standard brackets (「」（）) and full-width ASCII variants (\uFF08\uFF09).

⚠️ SSE Streaming: Watch for `\r`

Some SSE implementations send \r\n line endings. If you split on \n only, trailing \r characters will remain in your data and the cleaner won't match them (it only matches spaces).

// ✅ Fix: strip \r before processing
buffer += decoder.decode(value, { stream: true });
const lines = buffer.replace(/\r/g, '').split('\n');

Test

npm test
# 35 passed, 0 failed

API

`cleanOllamaJapanese(text, options?)`

Param	Type	Default	Description
`text`	`string`	—	Raw LLM output text
`options.maxIterations`	`number`	`10`	Max cleanup iterations

Returns: cleaned string

`RULES`

Exported array of rule objects. You can push your own rules:

const { RULES } = require('ollama-japanese-cleaner');
RULES.push({ pattern: /custom/g, replacement: 'fix', phase: 1 });

Each rule has:

pattern — RegExp with /g flag
replacement — replacement string
phase — 0 (apply once) or 1 (apply iteratively)
cjkOnly — optional, if true only applied when CJK chars exist in text

Tested Models

gemma3:12b — production use in Lethe
Should work with any Ollama model that exhibits the same tokenization artifacts

License

MIT

🇯🇵 日本語