GitHunt
UN

unixwzrd/UnicodeFix

Normalizes Unicode to ASCII equivalents and remove Unicode from AI generated text from ChatGPT, Anthropic, Google and more.

UnicodeFix - *Wolf Edition v1.2.1" - it solves "problems."

Last updated: 2026-03-06

UnicodeFix Hero Image

Python Platforms License: MIT Release CI


Finally - a tool that blasts AI fingerprints, torches those infuriating smart quotes, and leaves your code & docs squeaky clean for real humans.

Ever open up a file and instantly know it came from ChatGPT, Copilot, or one of their AI cousins? (Yeah, so can everyone else now.)
UnicodeFix vaporizes all the weird dashes, curly quotes, invisible space ninjas, and digital "tells" that out you as an AI user - or just make your stuff fail linters and code reviews.

Whether you're a student, a dev, or an open-source rebel: this is your "eraser for AI breadcrumbs."

Yes, it helps students cheat on their homework.
It also makes blog posts and AI-proofed emails look like you sweated over every character.
Nearly a thousand people have grabbed it. Nobody's bought me a coffee yet, but hey… there's a first time for everything.


Two modes (cleaner + auditor)

  • Clean mode (default): scrub Unicode artifacts from files or stdin → stdout.
  • Audit mode (--report): scan text for anomalies + (optional) semantic metrics. Works for CI gates, pre-commit hooks, and yes - professors looking for shenanigans.

A combination of Jules and Vincent... plus Winston Wolf. It solves problems.


Why Is This Happening?

Some folks think all this Unicode cruft is a side-effect of generative AI's training data. Others believe it's a deliberate move - baked-in "watermarks" to ID machine-generated text. Either way: these artifacts leave a trail. UnicodeFix wipes it.

Be careful, professors and reviewers may even start planting Unicode honeypots in starter code or essays - UnicodeFix torches those too. In this "AI Arms Race," diff and vimdiff are your night-vision goggles.


Installation

Clone the repository and run the setup script:

git clone https://github.com/unixwzrd/UnicodeFix.git
cd UnicodeFix

# Installs from pyproject.toml.
# Reuses an active non-base Conda env if you already have one.
# Otherwise it creates or reuses a local .venv.
./setup.sh

The setup.sh script:

  • Uses pyproject.toml as the single source of truth for dependencies
  • Reuses your active non-base Conda environment when one is already active
  • Otherwise creates or reuses a local .venv
  • Installs the package directly instead of requiring a second manual pip install step

Optional install modes:

./setup.sh --dev   # editable install + dev tooling
./setup.sh --nlp   # optional NLP/metrics dependencies

See setup.sh for the nitty-gritty. If the executable bit is stripped by your tooling, bash setup.sh --nlp works too.

For serious environment nerds: VenvUtil is my full-featured Python env toolkit.


Usage

Once installed and activated:

(ConnectomeAI) [unixwzrd@xanax: unicodefix]$ cleanup-text --help
usage: cleanup-text [-h] [-i] [-Q] [-D] [--keep-fullwidth-brackets] [-n] [-o OUTPUT] [-t] [-p] [--report] [--csv | --json] [--label LABEL] [--threshold THRESHOLD] [--metrics] [--metrics-help] [--exit-zero] [--no-color] [-q] [infile ...]

Clean Unicode quirks from text. STDIN→STDOUT if no files; otherwise writes .clean files or -o.

positional arguments:
  infile                Input file(s)

options:
  -h, --help            show this help message and exit
  -i, --invisible       Preserve invisible Unicode (ZW*, bidi controls)
  -Q, --keep-smart-quotes
                        Preserve Unicode smart quotes
  -D, --keep-dashes     Preserve Unicode EN/EM dashes
  --keep-fullwidth-brackets
                        Preserve fullwidth square brackets (【】)
  -n, --no-newline      Do not add a final newline
  -o OUTPUT, --output OUTPUT
                        Output filename or '-' for STDOUT (only valid with one input)
  -t, --temp            In-place clean via .tmp swap, then write back
  -p, --preserve-tmp    With -t, keep the .tmp file after success
  --report              Audit counts per category (no changes)
  --csv                 With --report, emit CSV (one row per file)
  --json                With --report, emit JSON
  --label LABEL         When reading from STDIN ('-'), use this display name in report/CSV
  --threshold THRESHOLD
                        With --report, exit 1 if total anomalies >= N
  --metrics             Include semantic metrics and imply report mode
  --metrics-help        Explain metrics and arrows (↑/↓).
  --exit-zero           Always exit with code 0 (useful for pre-commit reporting)
  --no-color            Disable ANSI colors (plain output)
  -q, --quiet           Suppress status lines on stderr

New options

  • -Q, --keep-smart-quotes: Preserve Unicode smart quotes (curly single/double quotes). Useful when preparing prose/blog posts where typographic quotes are intentional. Default behavior converts them to ASCII for shell/CI safety.
  • -D, --keep-dashes: Preserve Unicode dash and hyphen variants. Useful when stylistic punctuation is desired in prose. Default behavior folds non-breaking hyphens and EN-style dashes to -, and EM-style bars to -.
  • --keep-fullwidth-brackets: Preserve fullwidth square brackets (【】). By default, they are folded to ASCII [] to keep monospace alignment in terminals and fixed-width tables.
  • -R, --report: Audit text for anomalies, human-readable.
  • -J, --json: Audit text for anomalies, JSON format.
  • -T, --threshold: Fail CI if anomalies exceed threshold.
  • --metrics: Attach experimental semantic metrics (entropy, AI-score, etc.) and implicitly switch to report mode unless you explicitly request cleaned output with -o or -t, in which case the clean output is written and the report is shown on stderr.
  • --metrics-help: Print friendly descriptions of each metric and the ↑/↓ hints.
  • --exit-zero: Force a zero exit code for report mode (handy for informative hooks/CI jobs).
  • -H, --help: Show help message and exit.
  • -V, --version: Show version and exit.

When to preserve invisible characters (-i)

In most code/CI workflows, invisible/bidi controls are accidental and should be removed (default). Rare cases to preserve (-i):

  • Linguistic text where ZWJ/ZWNJ influence shaping
  • Intentional watermarks/markers in text
  • Forensic/debug inspections before deciding what to strip

Python API Usage

UnicodeFix provides a clean Python API for programmatic text cleaning and analysis. Import and use the functions directly in your Python code:

from unicodefix.transforms import clean_text, handle_newlines
from unicodefix.scanner import scan_text_for_report
from unicodefix.report import print_human, print_json
from unicodefix.metrics import compute_metrics  # Experimental

# Clean text with default settings (aggressive normalization)
cleaned = clean_text(""Hello" — world…")

# Clean with preservation options
cleaned = clean_text(
    text="'Smart quotes' and — dashes",
    preserve_quotes=True,      # Keep smart quotes
    preserve_dashes=True,       # Keep em/en dashes
    preserve_invisible=False    # Remove invisible chars (default)
)

# Scan text for anomalies (report mode)
anomalies = scan_text_for_report(""text"\u200b")
# Returns: {'unicode_ghosts': {...}, 'typographic': {...}, ...}

# Generate human-readable report
print_human("file.txt", anomalies)

# Generate JSON report
print_json({"file.txt": anomalies})

# Compute semantic metrics (experimental, requires NLTK)
metrics = compute_metrics("Some text to analyze...")
# Returns: {'entropy': 0.85, 'ai_score': 0.42, ...}

See API Documentation for complete details on all available functions, parameters, and return values.

Brief Examples

Pipe / Filter (STDIN to STDOUT)

cat file.txt | cleanup-text > cleaned.txt

Batch Clean

cleanup-text *.txt

In-Place (Safe) Clean

cleanup-text -t myfile.txt

Preserve Temp File for Backup

cleanup-text -t -p myfile.txt

Audit only (no changes), human-readable

cleanup-text --report foo.txt

Audit as JSON

cleanup-text --report --json foo.txt

Audit with Semantic Metrics (experimental)

cleanup-text --metrics foo.txt
cleanup-text --report --json --metrics foo.txt

--metrics now implies report mode, so the first command prints a human-readable report with metrics and the second emits JSON. If you explicitly request cleaned output with -o or -t, the clean output is still written and the human-readable report is emitted on stderr. Install the optional NLP support with ./setup.sh --nlp.

Report without blocking commits

cleanup-text --report --metrics --exit-zero foo.txt

Emits the full report (and metrics if requested) but always returns exit code 0, so informational pre-commit hooks and dashboards can surface issues without aborting the workflow.

Fail CI if anomalies exceed threshold

cleanup-text --report --threshold 1 some/*.txt

Using in vi/vim/macvim

:%!cleanup-text

Works great for vi/Vim purists, VS Code hipsters, or anyone who just wants their text to behave like text.
Also handy if you’re trying to slip your AI-generated code past your CS prof without curly quotes giving you away.

You can run it from Vim, VS Code in Vim mode, or as a pre-commit. Use it for email, blog posts, whatever. Ignore the naysayers - this is real-world convenience.

See cleanup-text.md for deeper dives and arcane options.

  • Make sure your Python environment is activated before launching your editor, or wrap it in a shell script that does it for you.
  • Adjust your editor's shell settings as needed for best results.

What's New / What's Cool

CodexExorcism+ Release (Sept 2025)

  • CI/CD pipeline hardened: full cross-platform test matrix (Ubuntu/macOS × Python 3.9–3.12) with integration tests + lint + shellcheck.
  • Cross-platform verified: tested on GitHub Actions ubuntu-latest and macos-latest runners across Python 3.9–3.12 (see .github/workflows/ci.yml for the current OS images).
  • Regression protection: added/strengthened newline preservation validation to prevent accidental line-collapsing failures.
  • CLI/report consistency: clarified and aligned filter vs file modes, output handling, and audit/report formatting.
  • Scanner improvements: improved anomaly detection and reporting accuracy; cleaner category breakdowns.
  • Transform refinements: tightened Unicode cleaning behavior while preserving text structure (EOL handling, whitespace normalization).
  • Docs refresh: README + CLI docs updated to better explain “clean mode vs audit mode” and the “forensics” use case.

CodexExorcism+ Release (Sept 2025)

The follow-up release keeps the Unicode exorcism vibe but layers on early-stage semantics:

  • Semantic metrics preview – opt into --metrics for entropy, diversity, repetition, and a heuristic AI-likeness score; it automatically switches into report mode.
  • Metrics legend on demand--metrics-help explains every stat plus the ↑/↓ hints.
  • Hook-friendly reporting--exit-zero means pre-commit hooks can flag anomalies without blocking your commit.
  • Slimmer all-in-one test harnesstests/test_all.sh derives its run list from data/, handles STDIN/STDOUT quirks, and drops per-scenario diffs/word-count deltas.

Fun fact: Even Python will execute code with "curly quotes." Your IDE, email client, and browser all sneak these in. UnicodeFix hunts them down and torches them, ...so your coding homework looks lovingly hand-crafted at 4:37 a.m., rather than LLM spawn.

Keep It Fresh

Pull requests/issues always welcome - especially if your AI friend slipped a new weird Unicode gremlin past me, I found a few more while preparing this release too...🙄


Shortcut for macOS

UnicodeFix ships with a macOS Shortcut for direct Finder integration.

Right-click files, pick a Quick Action, and - bam - no terminal required.

To add the Shortcut

  1. Open the Shortcuts app.
  2. Choose File -> Import.
    Shortcuts App Menu
  3. Select the Shortcut in macOS/Strip Unicode.shortcut.
    Import Shortcut
  4. Edit it to point to the cleanup-text executable in your active environment.
    Edit Shortcut Script Path
  5. Relaunch Finder (Cmd+Opt+Esc → select Finder → Relaunch) if needed.
  6. After setup, right-click files, choose Quick Actions, select Strip Unicode.
    Select Shortcut File

What's in This Repository


Testing and CI/CD

UnicodeFix comes with a full, automated test suite:

  • Drives every scenario against the canonical file list in data/
  • Writes diffs and normalized word-count summaries into test_output/<scenario>/
  • Run it with: tests/test_all.sh
  • Clean up with: tests/test_all.sh clean
  • STDIN/STDOUT coverage skips the binary fixtures (everything else still runs on them)
  • Plug into your CI/CD pipeline or just use it as a "paranoia check" before shipping anything

Pro tip: Run the tests before you merge, publish, or email a "final" version.

See docs/test-suite.md for the deep dive.


Contributing

Feedback, bug reports, and patches welcome.

If you've got a better integration path for your favorite platform, let's make it happen.
Pull requests with attitude, creativity, and clean diffs appreciated.


Support This and Other Projects

If UnicodeFix (or my other projects) saved your bacon or made you smile, please consider fueling my caffeine habit and indie dev obsession...

Quite a bit of effort goes into preparing these releases. *One coffee = one more tool released to the wild...*🤔

Thank you for keeping solo development alive!


Changelog

See CHANGELOG.md for the latest drop.


License

Copyright 2025
unixwzrd@unixwzrd.ai

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.