GitHunt
SI

simons-hub/rust-word-analyzer

Rust word frequency analyzer using custom linked list data structures in Rust — an educational project exploring Rust ownership, unsafe, and pointer semantics

rust-word-analyzer

License: MIT
Rust

A CLI tool that reads .txt and .docx files, counts word frequencies, and displays results sorted alphabetically and by count — built with custom linked list data structures in Rust.

This is an educational project that solves a real problem (word frequency analysis) using deliberately low-level data structures to explore Rust's ownership model, unsafe code, and pointer semantics.

Why Linked Lists?

A HashMap<String, usize> counts words in 5 lines. This project uses hand-rolled linked lists instead — not because it's practical, but because linked lists are the canonical hard problem in Rust.

Rust's ownership system makes linked lists genuinely difficult: every node owns the next, you can't have cycles without Rc/RefCell, and mutation requires careful management of borrows. This project tackles that head-on with real, working implementations.

Learning Goals

Concept Where it appears
Box<T> heap allocation Node storage — each node owns the next via Box<WordNode>
Raw pointer manipulation Tail pointer as *mut WordNode for O(1) append
unsafe blocks Dereferencing raw pointers for tail updates
Ownership transfer Moving nodes between positions during sort
Insertion sort on a linked list Alphabetical ordering during initial word collection
Merge sort on a linked list Re-sorting by word count after collection
Error handling with Result File I/O, XML parsing, ZIP extraction
Custom macros gprintln! and rprintln! for colored output

What makes this tricky in Rust

// This pattern — keeping a raw tail pointer alongside an owned head — is
// the core tension. Box gives you ownership, but the tail needs to mutate
// a node that Box already owns. You end up in unsafe territory:

struct WordList {
    head: Option<Box<WordNode>>,   // Owns the list
    tail: *mut WordNode,           // Points into it (unsafe)
}

Other languages let you do this trivially with garbage collection. Rust forces you to reason about who owns what, and this project is a worked example of navigating that.

Features

  • Reads .txt files (line-by-line word extraction)
  • Reads .docx files (ZIP archive extraction + XML content parsing)
  • Single linked list with alphabetical insertion sort
  • Double linked list variant (experimental)
  • Merge sort to re-order by word count
  • Colored terminal output
  • Integration tests with known expected output

Usage

cargo run -- path/to/file.txt
cargo run -- path/to/document.docx

Example Output

Printing list sorted alphabetically:
Node 1: Count: 5 Word: eight
Node 2: Count: 1 Word: five
Node 3: Count: 1 Word: four
...

Printing list sorted by word count:
Node 1: Count: 13 Word: one
Node 2: Count: 9 Word: two
Node 3: Count: 5 Word: ten
...

Total non-unique words: 50
Total unique words: 10

Project Structure

src/
├── main.rs                            # CLI entry point, file type routing
├── readfile.rs                        # .txt and .docx file parsers
├── word_tracker_single_linkedlist.rs  # Single linked list (primary)
├── word_tracker_double_linkedlist.rs  # Double linked list (experimental)
└── utilities/
    ├── mod.rs                         # Module declarations
    └── print_utils.rs                 # Colored terminal output macros

tests/
├── sort_by_word_count_txt_test.rs     # Integration test for .txt files
├── sort_by_word_count_docx_test.rs    # Integration test for .docx files
└── data/
    ├── input.txt                      # Test fixture
    └── input.docx                     # Test fixture (same content as .txt)

Testing

cargo test

Tests run the full binary against known input files and validate:

  • Alphabetical sort order and word counts
  • Word-count sort order (descending)
  • Correct handling of multiple whitespace and line breaks

Dependencies

Crate Purpose
zip Extract .docx ZIP archives
quick-xml Parse Word document XML
colored Colored terminal output

Further Reading

License

MIT License. See LICENSE for details.