unixwzrd/UnicodeFix
Normalizes Unicode to ASCII equivalents and remove Unicode from AI generated text from ChatGPT, Anthropic, Google and more.
UnicodeFix - *Wolf Edition v1.2.1" - it solves "problems."
Last updated: 2026-03-06
- UnicodeFix - *Wolf Edition v1.2.1" - it solves "problems."
- Finally - a tool that blasts AI fingerprints, torches those infuriating smart quotes, and leaves your code & docs squeaky clean for real humans.
- Two modes (cleaner + auditor)
- Why Is This Happening?
- Installation
- Usage
- Python API Usage
- Brief Examples
- What's New / What's Cool
- Shortcut for macOS
- What's in This Repository
- Testing and CI/CD
- Contributing
- Support This and Other Projects
- Changelog
- License
Finally - a tool that blasts AI fingerprints, torches those infuriating smart quotes, and leaves your code & docs squeaky clean for real humans.
Ever open up a file and instantly know it came from ChatGPT, Copilot, or one of their AI cousins? (Yeah, so can everyone else now.)
UnicodeFix vaporizes all the weird dashes, curly quotes, invisible space ninjas, and digital "tells" that out you as an AI user - or just make your stuff fail linters and code reviews.
Whether you're a student, a dev, or an open-source rebel: this is your "eraser for AI breadcrumbs."
Yes, it helps students cheat on their homework.
It also makes blog posts and AI-proofed emails look like you sweated over every character.
Nearly a thousand people have grabbed it. Nobody's bought me a coffee yet, but hey… there's a first time for everything.
Two modes (cleaner + auditor)
- Clean mode (default): scrub Unicode artifacts from files or stdin → stdout.
- Audit mode (
--report): scan text for anomalies + (optional) semantic metrics. Works for CI gates, pre-commit hooks, and yes - professors looking for shenanigans.
A combination of Jules and Vincent... plus Winston Wolf. It solves problems.
Why Is This Happening?
Some folks think all this Unicode cruft is a side-effect of generative AI's training data. Others believe it's a deliberate move - baked-in "watermarks" to ID machine-generated text. Either way: these artifacts leave a trail. UnicodeFix wipes it.
Be careful, professors and reviewers may even start planting Unicode honeypots in starter code or essays - UnicodeFix torches those too. In this "AI Arms Race," diff and vimdiff are your night-vision goggles.
Installation
Clone the repository and run the setup script:
git clone https://github.com/unixwzrd/UnicodeFix.git
cd UnicodeFix
# Installs from pyproject.toml.
# Reuses an active non-base Conda env if you already have one.
# Otherwise it creates or reuses a local .venv.
./setup.shThe setup.sh script:
- Uses
pyproject.tomlas the single source of truth for dependencies - Reuses your active non-base Conda environment when one is already active
- Otherwise creates or reuses a local
.venv - Installs the package directly instead of requiring a second manual
pip installstep
Optional install modes:
./setup.sh --dev # editable install + dev tooling
./setup.sh --nlp # optional NLP/metrics dependenciesSee setup.sh for the nitty-gritty. If the executable bit is stripped by your tooling, bash setup.sh --nlp works too.
For serious environment nerds: VenvUtil is my full-featured Python env toolkit.
Usage
Once installed and activated:
(ConnectomeAI) [unixwzrd@xanax: unicodefix]$ cleanup-text --help
usage: cleanup-text [-h] [-i] [-Q] [-D] [--keep-fullwidth-brackets] [-n] [-o OUTPUT] [-t] [-p] [--report] [--csv | --json] [--label LABEL] [--threshold THRESHOLD] [--metrics] [--metrics-help] [--exit-zero] [--no-color] [-q] [infile ...]
Clean Unicode quirks from text. STDIN→STDOUT if no files; otherwise writes .clean files or -o.
positional arguments:
infile Input file(s)
options:
-h, --help show this help message and exit
-i, --invisible Preserve invisible Unicode (ZW*, bidi controls)
-Q, --keep-smart-quotes
Preserve Unicode smart quotes
-D, --keep-dashes Preserve Unicode EN/EM dashes
--keep-fullwidth-brackets
Preserve fullwidth square brackets (【】)
-n, --no-newline Do not add a final newline
-o OUTPUT, --output OUTPUT
Output filename or '-' for STDOUT (only valid with one input)
-t, --temp In-place clean via .tmp swap, then write back
-p, --preserve-tmp With -t, keep the .tmp file after success
--report Audit counts per category (no changes)
--csv With --report, emit CSV (one row per file)
--json With --report, emit JSON
--label LABEL When reading from STDIN ('-'), use this display name in report/CSV
--threshold THRESHOLD
With --report, exit 1 if total anomalies >= N
--metrics Include semantic metrics and imply report mode
--metrics-help Explain metrics and arrows (↑/↓).
--exit-zero Always exit with code 0 (useful for pre-commit reporting)
--no-color Disable ANSI colors (plain output)
-q, --quiet Suppress status lines on stderrNew options
-Q,--keep-smart-quotes: Preserve Unicode smart quotes (curly single/double quotes). Useful when preparing prose/blog posts where typographic quotes are intentional. Default behavior converts them to ASCII for shell/CI safety.-D,--keep-dashes: Preserve Unicode dash and hyphen variants. Useful when stylistic punctuation is desired in prose. Default behavior folds non-breaking hyphens and EN-style dashes to-, and EM-style bars to-.--keep-fullwidth-brackets: Preserve fullwidth square brackets (【】). By default, they are folded to ASCII[]to keep monospace alignment in terminals and fixed-width tables.-R,--report: Audit text for anomalies, human-readable.-J,--json: Audit text for anomalies, JSON format.-T,--threshold: Fail CI if anomalies exceed threshold.--metrics: Attach experimental semantic metrics (entropy, AI-score, etc.) and implicitly switch to report mode unless you explicitly request cleaned output with-oor-t, in which case the clean output is written and the report is shown onstderr.--metrics-help: Print friendly descriptions of each metric and the ↑/↓ hints.--exit-zero: Force a zero exit code for report mode (handy for informative hooks/CI jobs).-H,--help: Show help message and exit.-V,--version: Show version and exit.
When to preserve invisible characters (-i)
In most code/CI workflows, invisible/bidi controls are accidental and should be removed (default). Rare cases to preserve (-i):
- Linguistic text where ZWJ/ZWNJ influence shaping
- Intentional watermarks/markers in text
- Forensic/debug inspections before deciding what to strip
Python API Usage
UnicodeFix provides a clean Python API for programmatic text cleaning and analysis. Import and use the functions directly in your Python code:
from unicodefix.transforms import clean_text, handle_newlines
from unicodefix.scanner import scan_text_for_report
from unicodefix.report import print_human, print_json
from unicodefix.metrics import compute_metrics # Experimental
# Clean text with default settings (aggressive normalization)
cleaned = clean_text(""Hello" — world…")
# Clean with preservation options
cleaned = clean_text(
text="'Smart quotes' and — dashes",
preserve_quotes=True, # Keep smart quotes
preserve_dashes=True, # Keep em/en dashes
preserve_invisible=False # Remove invisible chars (default)
)
# Scan text for anomalies (report mode)
anomalies = scan_text_for_report(""text"\u200b")
# Returns: {'unicode_ghosts': {...}, 'typographic': {...}, ...}
# Generate human-readable report
print_human("file.txt", anomalies)
# Generate JSON report
print_json({"file.txt": anomalies})
# Compute semantic metrics (experimental, requires NLTK)
metrics = compute_metrics("Some text to analyze...")
# Returns: {'entropy': 0.85, 'ai_score': 0.42, ...}See API Documentation for complete details on all available functions, parameters, and return values.
Brief Examples
Pipe / Filter (STDIN to STDOUT)
cat file.txt | cleanup-text > cleaned.txtBatch Clean
cleanup-text *.txtIn-Place (Safe) Clean
cleanup-text -t myfile.txtPreserve Temp File for Backup
cleanup-text -t -p myfile.txtAudit only (no changes), human-readable
cleanup-text --report foo.txtAudit as JSON
cleanup-text --report --json foo.txtAudit with Semantic Metrics (experimental)
cleanup-text --metrics foo.txt
cleanup-text --report --json --metrics foo.txt--metrics now implies report mode, so the first command prints a human-readable report with metrics and the second emits JSON. If you explicitly request cleaned output with -o or -t, the clean output is still written and the human-readable report is emitted on stderr. Install the optional NLP support with ./setup.sh --nlp.
Report without blocking commits
cleanup-text --report --metrics --exit-zero foo.txtEmits the full report (and metrics if requested) but always returns exit code 0, so informational pre-commit hooks and dashboards can surface issues without aborting the workflow.
Fail CI if anomalies exceed threshold
cleanup-text --report --threshold 1 some/*.txtUsing in vi/vim/macvim
:%!cleanup-textWorks great for vi/Vim purists, VS Code hipsters, or anyone who just wants their text to behave like text.
Also handy if you’re trying to slip your AI-generated code past your CS prof without curly quotes giving you away.
You can run it from Vim, VS Code in Vim mode, or as a pre-commit. Use it for email, blog posts, whatever. Ignore the naysayers - this is real-world convenience.
See cleanup-text.md for deeper dives and arcane options.
- Make sure your Python environment is activated before launching your editor, or wrap it in a shell script that does it for you.
- Adjust your editor's shell settings as needed for best results.
What's New / What's Cool
CodexExorcism+ Release (Sept 2025)
- CI/CD pipeline hardened: full cross-platform test matrix (Ubuntu/macOS × Python 3.9–3.12) with integration tests + lint + shellcheck.
- Cross-platform verified: tested on GitHub Actions
ubuntu-latestandmacos-latestrunners across Python 3.9–3.12 (see.github/workflows/ci.ymlfor the current OS images). - Regression protection: added/strengthened newline preservation validation to prevent accidental line-collapsing failures.
- CLI/report consistency: clarified and aligned filter vs file modes, output handling, and audit/report formatting.
- Scanner improvements: improved anomaly detection and reporting accuracy; cleaner category breakdowns.
- Transform refinements: tightened Unicode cleaning behavior while preserving text structure (EOL handling, whitespace normalization).
- Docs refresh: README + CLI docs updated to better explain “clean mode vs audit mode” and the “forensics” use case.
CodexExorcism+ Release (Sept 2025)
The follow-up release keeps the Unicode exorcism vibe but layers on early-stage semantics:
- Semantic metrics preview – opt into
--metricsfor entropy, diversity, repetition, and a heuristic AI-likeness score; it automatically switches into report mode. - Metrics legend on demand –
--metrics-helpexplains every stat plus the ↑/↓ hints. - Hook-friendly reporting –
--exit-zeromeans pre-commit hooks can flag anomalies without blocking your commit. - Slimmer all-in-one test harness –
tests/test_all.shderives its run list fromdata/, handles STDIN/STDOUT quirks, and drops per-scenario diffs/word-count deltas.
Fun fact: Even Python will execute code with "curly quotes." Your IDE, email client, and browser all sneak these in. UnicodeFix hunts them down and torches them, ...so your coding homework looks lovingly hand-crafted at 4:37 a.m., rather than LLM spawn.
Keep It Fresh
Pull requests/issues always welcome - especially if your AI friend slipped a new weird Unicode gremlin past me, I found a few more while preparing this release too...🙄
Shortcut for macOS
UnicodeFix ships with a macOS Shortcut for direct Finder integration.
Right-click files, pick a Quick Action, and - bam - no terminal required.
To add the Shortcut
- Open the Shortcuts app.
- Choose
File -> Import.

- Select the Shortcut in
macOS/Strip Unicode.shortcut.

- Edit it to point to the
cleanup-textexecutable in your active environment.

- Relaunch Finder (
Cmd+Opt+Esc→ select Finder → Relaunch) if needed. - After setup, right-click files, choose
Quick Actions, selectStrip Unicode.

What's in This Repository
- src/unicodefix/cli.py - CLI entry point
- src/unicodefix/transforms.py - Unicode normalization logic
- src/unicodefix/scanner.py - Audit/report scanner
- pyproject.toml - Packaging metadata and dependency source of truth
- setup.sh - Unified bootstrap/install script
- bin/uniclean.sh - Shell helper
- data/ - Example test files
- tests/ - Automated test suite for features and regressions
- docs/ - Documentation and screenshots
- LICENSE
- README.md - This file
Testing and CI/CD
UnicodeFix comes with a full, automated test suite:
- Drives every scenario against the canonical file list in
data/ - Writes diffs and normalized word-count summaries into
test_output/<scenario>/ - Run it with:
tests/test_all.sh - Clean up with:
tests/test_all.sh clean - STDIN/STDOUT coverage skips the binary fixtures (everything else still runs on them)
- Plug into your CI/CD pipeline or just use it as a "paranoia check" before shipping anything
Pro tip: Run the tests before you merge, publish, or email a "final" version.
See docs/test-suite.md for the deep dive.
Contributing
Feedback, bug reports, and patches welcome.
If you've got a better integration path for your favorite platform, let's make it happen.
Pull requests with attitude, creativity, and clean diffs appreciated.
Support This and Other Projects
If UnicodeFix (or my other projects) saved your bacon or made you smile, please consider fueling my caffeine habit and indie dev obsession...
Quite a bit of effort goes into preparing these releases. *One coffee = one more tool released to the wild...*🤔
Thank you for keeping solo development alive!
Changelog
See CHANGELOG.md for the latest drop.
License
Copyright 2025
unixwzrd@unixwzrd.ai
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
