simons-hub/rust-word-analyzer
Rust word frequency analyzer using custom linked list data structures in Rust — an educational project exploring Rust ownership, unsafe, and pointer semantics
rust-word-analyzer
A CLI tool that reads .txt and .docx files, counts word frequencies, and displays results sorted alphabetically and by count — built with custom linked list data structures in Rust.
This is an educational project that solves a real problem (word frequency analysis) using deliberately low-level data structures to explore Rust's ownership model, unsafe code, and pointer semantics.
Why Linked Lists?
A HashMap<String, usize> counts words in 5 lines. This project uses hand-rolled linked lists instead — not because it's practical, but because linked lists are the canonical hard problem in Rust.
Rust's ownership system makes linked lists genuinely difficult: every node owns the next, you can't have cycles without Rc/RefCell, and mutation requires careful management of borrows. This project tackles that head-on with real, working implementations.
Learning Goals
| Concept | Where it appears |
|---|---|
Box<T> heap allocation |
Node storage — each node owns the next via Box<WordNode> |
| Raw pointer manipulation | Tail pointer as *mut WordNode for O(1) append |
unsafe blocks |
Dereferencing raw pointers for tail updates |
| Ownership transfer | Moving nodes between positions during sort |
| Insertion sort on a linked list | Alphabetical ordering during initial word collection |
| Merge sort on a linked list | Re-sorting by word count after collection |
Error handling with Result |
File I/O, XML parsing, ZIP extraction |
| Custom macros | gprintln! and rprintln! for colored output |
What makes this tricky in Rust
// This pattern — keeping a raw tail pointer alongside an owned head — is
// the core tension. Box gives you ownership, but the tail needs to mutate
// a node that Box already owns. You end up in unsafe territory:
struct WordList {
head: Option<Box<WordNode>>, // Owns the list
tail: *mut WordNode, // Points into it (unsafe)
}
Other languages let you do this trivially with garbage collection. Rust forces you to reason about who owns what, and this project is a worked example of navigating that.
Features
- Reads
.txtfiles (line-by-line word extraction) - Reads
.docxfiles (ZIP archive extraction + XML content parsing) - Single linked list with alphabetical insertion sort
- Double linked list variant (experimental)
- Merge sort to re-order by word count
- Colored terminal output
- Integration tests with known expected output
Usage
cargo run -- path/to/file.txt
cargo run -- path/to/document.docxExample Output
Printing list sorted alphabetically:
Node 1: Count: 5 Word: eight
Node 2: Count: 1 Word: five
Node 3: Count: 1 Word: four
...
Printing list sorted by word count:
Node 1: Count: 13 Word: one
Node 2: Count: 9 Word: two
Node 3: Count: 5 Word: ten
...
Total non-unique words: 50
Total unique words: 10
Project Structure
src/
├── main.rs # CLI entry point, file type routing
├── readfile.rs # .txt and .docx file parsers
├── word_tracker_single_linkedlist.rs # Single linked list (primary)
├── word_tracker_double_linkedlist.rs # Double linked list (experimental)
└── utilities/
├── mod.rs # Module declarations
└── print_utils.rs # Colored terminal output macros
tests/
├── sort_by_word_count_txt_test.rs # Integration test for .txt files
├── sort_by_word_count_docx_test.rs # Integration test for .docx files
└── data/
├── input.txt # Test fixture
└── input.docx # Test fixture (same content as .txt)
Testing
cargo testTests run the full binary against known input files and validate:
- Alphabetical sort order and word counts
- Word-count sort order (descending)
- Correct handling of multiple whitespace and line breaks
Dependencies
| Crate | Purpose |
|---|---|
zip |
Extract .docx ZIP archives |
quick-xml |
Parse Word document XML |
colored |
Colored terminal output |
Further Reading
- Learn Rust With Entirely Too Many Linked Lists — the definitive guide to linked lists in Rust, covering safe and unsafe approaches
- The Rustonomicon — Rust's guide to unsafe code and raw pointers
License
MIT License. See LICENSE for details.